# Python Process Book for CS109 Project: Trends in Health Topics over Time


## Overview and Motivation:

This project looks at trends in health topics over time.  The members of this project team are all public health students.  We were interested in how topics in health in a popular newspaper, The New York Times, have changed over time.  We decided to gain a deeper understanding of Latent Dirilecht Allocation (LDA) and use this topic modeling method to find the major topics in health over the past 5 decades (the period of time we were able to collect data for).  

Related work which inspired our project was LDA from Homework 5, as well as web scraping and data cleaning techniques from Homework 1 and Homework 2. In addition, we wanted to emulate Google Trends visualizations, which show trends in Google searches over time for specific topics.

## Initial Questions:

 Which topics are persistent over time?  
 
 Which topics have a spike, when do they occur, and why did it happen? 
 Are there any surprising topics?
 
 With regards to LDA:
 Should we use Pattern (as in Hw5) or is there a better option for our data?
 What hyperparameters should we choose?
 
 Are there further analyses we can consider based on our findings?
 
 Can we correlate the topics we determine by LDA with similar topics in Google Trends or PubMed?

## Data:

Data was pulled from the New York Times article API.  Using the API console (http://developer.nytimes.com/io-docs), we were able to inspect the type of results for a given query.  Originally, we decided to look at results using the 'fq=newsdesk:Health' option which would pull results under the Health topic section of the Times (approximately 680,000 documents since 1851).  However, after looking at the first 1000 results, we found that the documents pulled mainly consisted of videos, slideshows, and interactive features instead of articles.  We remedied this issue by instead using a query for the keyword 'Health' which searched all articles and their headlines for the word health.  Looking at the results from the API console showed that overall, using 'Health' as our query term instead of newsdesk:Health produced approiximately 40,000 more documents (720,000 total).

The Times API has several limiting factors when pulling data: 10,000 calls per day and a maximum of 100 pages per query.  To handle these limitations, our code pulled data by year and split each month into 3 parts (based on testing date ranges for number of pages which would be pulled so the total number of pages would be less than 100).  Year and count were entered manually to keep track of how many calls were being made and to break up the data calls for when errors occurred (such as timeouts, key errors, and date errors; when these errors were encountered the solution was appended to the code which resulted in the final version below).     

In this code, we requested the json dictionaries, used the relevant information to create a dataframe with the date, id, document type (article, blog, video), newsdesk section and subsection, and text from the title, abstract, and first paragraph.  The data was saved in a csv file for that section and tracked using an excel file - [DateTracker](https://github.com/dyan1211/teamsignificant/blob/master/DateTracker.xlsx).

At the beginning, we wanted to pull as much data as possible to get an idea of topic changes over time - in particular if there was a difference between today and some unspecified earlier time (articles range back to the mid-1800s).  Due to the limiting factors (and keeping in mind amount of time needed for analysis), we pulled data through the late 1950s.  We decided to perform our analysis on data from the end of October 2015 through 1966 the end of December 1966, giving us five decades of data.  

Text scraping and cleaning was done in https://github.com/dyan1211/teamsignificant/blob/master/DataScraping.ipynb

In [None]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

In [None]:
import json

In [None]:
%%time
count = 1460 #enter new count starting number
year =  1954 #enter year
months = ['01', '01', '01', '02', '02', '02', '03', '03', '03', '04', '04', '04', '05', '05', '05', '06', '06', '06', '07', '07', '07', '08', '08', '08', '09', '09', '09', '10', '10', '10', '11', '11', '11', '12', '12', '12']
startdays = ['01', '11', '21', '01', '11', '21', '01', '11', '21', '01', '11', '21', '01', '11', '21', '01', '11', '21', '01', '11', '21', '01', '11', '21', '01', '11', '21', '01', '11', '21', '01', '11', '21', '01', '11', '21']
enddays = ['10', '20', '31', '10', '20', '27', '10', '20', '31', '10', '20', '30', '10', '20', '31', '10', '20', '30', '10', '20', '31', '10', '20', '31', '10', '20', '30', '10', '20', '31', '10', '20', '30', '10', '20', '31']

pcount = 0

for d in range(36):
    sdate = year + months[d] + startdays[d]
    edate = year + months[d] + enddays[d]

    docs=[]
    #get 1st page and number of documents
    url1 = "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=Health&page=1&begin_date={}&end_date={}&api-key=5cdab36b05348a4da2e74046dfb16a03:17:73541790".format(sdate, edate) 
    lpage = requests.get(url1).json()['response']['meta']['hits']
    
    #calculate number of pages for call (divide total documents by 10 and add 2 in case not every page has 10 documents); 
    #if there are no hits skip to end
    if lpage is not 0:
        numpages = int(lpage/10 + 2) 
        pcount += numpages
    
        #get json files for first page
        pagedoc1 = requests.get(url1).json()['response']['docs']
        for j in range(0,len(pagedoc1)):
            docs.append(pagedoc1[j])    
    
        #get json dictionaries for rest of the pages
        for i in range(2, numpages): 
            url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=Health&page={}&begin_date={}&end_date={}&api-key=5cdab36b05348a4da2e74046dfb16a03:17:73541790".format(i, sdate, edate)
            pagedocs = requests.get(url).json()['response']['docs']
            time.sleep(1)
        
            for j in range(0,len(pagedocs)):
                docs.append(pagedocs[j])

    #pull information from json file into dictionary        
        docsinfo = []
        for d in docs:
            obs = {}
            obs['id'] = d['_id']
            obs['type'] = d['type_of_material']
            obs['doctype'] = d['document_type']
            obs['date'] = d['pub_date']
            obs['news_desk'] = d['news_desk']
            obs['section'] = d['section_name']
            obs['subsection'] = d['subsection_name']
            obs['abstract'] = d['abstract']
            obs['paragraph'] = d['lead_paragraph']
        
            #Headline exceptions
            if d['headline'].get('main') is not None:
                obs['headline'] = d['headline']['main']
            elif d['headline'].get('name') is not None:
                obs['headline'] = d['headline']['name']
            else:
                obs['headline'] = ' '
    
            #get the date part of datetime
            if obs['date'] is not None:
                obs['date'] = obs['date'][0:10]
    
            #Remove empty abstract and lead paragraph cell to join text
            if obs['abstract'] is None:
                a = ' '
            else: 
                a = obs['abstract']
            if obs['paragraph'] == 'TK TK TK' or obs['paragraph'] is None:
                p = ' '
            else:
                p = obs['paragraph']
    
            #Join all the text columns
            text = [obs['headline'], p, a]
            obs['text'] = " ".join(text)
    
            docsinfo.append(obs)

    #create dataframe from dictionary, make date column date type, and store in csv
        docsdf = pd.DataFrame(docsinfo)
        docsdf['date'] = pd.to_datetime(docsdf['date'])
        docsdf.to_csv("data/docsdf-{}.csv".format(count), encoding = 'utf-8') 

    count += 1 
    

Next the files were concatenated into a single dataframe and saved to a csv file of our data which is located in our dropbox: https://www.dropbox.com/sh/6f9vok950zrg97h/AAA91zJUr_GKUcSvwHatHMkya?dl=0

In [None]:
#1966-2015
#There were no documents for 08/11/1978 - 10/31/1978 (leading to gap in csv files)
frames = []
for i in range(1,1338) : 
    dfs = pd.read_csv("docsdf-"+str(i)+".csv")
    frames.append(dfs)
for i in range(1346,1784) : 
    dfs = pd.read_csv("docsdf-"+str(i)+".csv")
    frames.append(dfs)
totaldf = pd.concat(frames)

#set date column to pandas date type
totaldf['date'] = pd.to_datetime(totaldf['date'])

In [None]:
totaldf.to_csv("total.csv", index=False)

We looked at the types of documents to see if there were any trends and found that the majority of the documents were articles.  We are mainly interested in articles, and two of the document types are more recent types (multimedia and blogpost), so we decided to only include articles in our corpus.

Document analysis was done in https://github.com/dyan1211/teamsignificant/blob/master/Exploration.ipynb

In [None]:
# Code to look at frequencies of document types
sns.set(style="white", context="talk")
ax = sns.barplot(x=('Article', 'Blogpost', 'Column', 'Multimedia', 'Recipe'), y=type_counts)
ax.set(title="Document Type Frequencies",ylim=(0,400000),yticks=[100000,200000,300000,400000])
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+0.2, height+10000, '%d'%height, fontsize=14)
sns.despine(bottom=True)

![](http://i.imgur.com/qLMXD3e.png)

In [None]:
#Only use documents that were specified as article will be used
df = totaldf[totaldf['doctype'] == 'article']

After plotting our data over time and checking days which had highest number of articles, we found that the years 1980 and 2014 had some strange spikes.  We realized there were a number of duplicate documents in our data which had pieces of another documents (such as a shorted title or only the first few sentences of a paragraph). We utilized the drop_duplicates function to by creating a new column which contained the first 50 characters of the paragraph column in order to capture the matching sections of the two documents.

Duplicate dropping was done in https://github.com/dyan1211/teamsignificant/blob/master/DuplicateArticles.ipynb

In [None]:
#Bar graph of number of documents per year before dropping duplicates (spike in 1980 and 2014)
years = df.year.value_counts().sort_index()
years.plot(kind='bar')

![](http://imgur.com/rVyNJx1.png)

In [None]:
#Check dates with most articles 
days = articledf.date.value_counts()
days[:20].plot(kind='barh');

![](http://imgur.com/caAkxtg.png)

In [None]:
#Remove duplicates
df['dupcheck'] = df['paragraph'].str[0:50]
df = df.drop_duplicates('dupcheck')

In [None]:
#Bar graph of number of documents per year after dropping duplicates (smoothed out 1980, though still 2014 spike )
years = df.year.value_counts().sort_index()
years.plot(kind='bar')

![](http://i.imgur.com/MR7mqH5.png)

This graph looks much better though there is still the 2014 spike.  Using the API console through NYT, I check the total number of articles in 2014 compared to 2013 (not just 'Health' articles):

Total # of articles produced by NYT in 2013 = 117593

Total # of articles produced by NYT in 2014 = 308867

So the spike is not specific to our search.  We do not have a concrete explanation, though we considered the possibility that it is due to an increase in web content.

#### Choice of Time Periods

Before moving forward, we made the decision that in order to parse out clear and specific topics we should perform LDA on subsets of the data.  By using the entire dataset, we were concerned that the topics would end up being too general.  We decided to use 5 year periods - large enough to have data to run LDA on and to capture topics which would have a trend in time, but small enough that some major events would show up. 

The code for splitting the df was done in https://github.com/dyan1211/teamsignificant/blob/master/Textblob.ipynb

## Exploratory Data Analysis:

We used spark to clean and analyze our data since these processes are easily parallelized.  Spark was implemented using homebrew on a Mac. 

In [None]:
import os
os.environ['PYSPARK_PYTHON'] = '/Applications/anaconda/bin/python'

In [None]:
import findspark
findspark.init()
print findspark.find()

In [None]:
import pyspark
conf = (pyspark.SparkConf()
    .setMaster('local')
    .setAppName('pyspark')
    .set("spark.executor.memory", "2g"))
sc = pyspark.SparkContext(conf=conf)

In [None]:
import sys
rdd = sc.parallelize(xrange(10),10)

In [None]:
from pyspark.sql import SQLContext
sqlsc=SQLContext(sc)

We began by cleaning our code in order to perform LDA.  We decided to use NLTK to process our text since it is the leading platform for natural language processing.  However, we found that NLTK takes an extremely long time to process our large data set.  Instead, we utilized TextBlob which works off of both NLTK and Pattern platforms.  TextBlob also has the advantage of implementing the Averaged Perceptron algorithm which has been shown to be faster and more accurate than NLTK and Pattern (http://stevenloria.com/tutorial-state-of-the-art-part-of-speech-tagging-in-textblob/).

The text was tokenized, tagged for part of speech using the AP tagger, made all lowercase, lemmatized using TextBlob's lemmatizer which is based on WordNet's morphy function, and the nouns were extracted.  Stopwords (based on NLTK's stopwords corpus) were not included in our list of nouns.  Punctuation and words made up of one letter were also not included.

We considered issues with making words all lowercase due to some nouns such as AIDs and WHO but decided that not using .lower would potentially cause more problems.  In addition, the AP tagger should be able to identify nouns based on the sentence structure and even if a word has multiple meanings, if that word is mostly used in a specific way (HIV/AIDs vs government aid), it should not be difficult to identify based on the other words which make up the topic and the documents associated with it.  (In fact, we were able to distinguish between the two in our analysis).

The initial NLTK work was done in https://github.com/dyan1211/teamsignificant/blob/master/NLTKTester.ipynb

TextBlob work (and the following parseout work) was done in https://github.com/dyan1211/teamsignificant/blob/master/Textblob.ipynb

In [None]:
from textblob import TextBlob as tb
from textblob_aptagger import PerceptronTagger
from textblob import Blobber

In [None]:
import nltk
from nltk.corpus import stopwords #stopwords

In [None]:
#The necessary NLTK packages need to be downloaded to implement some of the functions. RUN ONLY ONCE
nltk.download()

In [None]:
#Set up Averaged Perceptron Tagger for TextBlob
tb = Blobber(pos_tagger=PerceptronTagger())

In [None]:
#Clean the text for each document to extract the nouns 
#Cleaning includes lemmatization and removal of stopwords

def get_parts(thetext):
    nouns=[]
    tagged = tb(thetext).tags # a list of tuples
    for tup in tagged:
        w, tag = tup  
        if tag in ['NN', 'NNS', 'NNP', 'NNPS']:
            word = w.lemmatize().lower()
            if word[-1] in punctuation : 
                word = word[:-1]
            if word in stops or word in punctuation or len(word)==1 :
                continue
            nouns.append(word)
    nouns2=[]
    for n in nouns:
        if len(n)!=0:
            nouns2.append(n)
    return nouns2

Parse out the nouns and organize in a list for each document.  We decided to parse on the document level (instead of the sentence level) under the assumption that each document could be classified as a topic (or a few topics).

The output from parseout is saved in separate txt files, available here: https://www.dropbox.com/sh/t8xuqgs3y92lxd8/AAAFSSBTMmGRDuFn5A4zEXLla?dl=0

*This code was run separately for each 5 year period - all example output is for 1966

In [None]:
%%time

parseout = []
for index, row in df.iterrows() : 
    parseout.append(get_parts(row.text))

In [None]:
# saving parseout to txt file
with open("parseout.txt", 'w') as f:
    json.dump(parseout,f)

Using the output for each 5-year time period, we were able to create word clouds to get an idea of the words in our corpus.  Not surprisingly, we can see that generalized words dominate - new, city, state, today.  But we can also pick out the presidents during most periods and some health related words also jump out - program, hospital, doctor. 

Wordcloud work was done in https://github.com/dyan1211/teamsignificant/blob/master/Textblob.ipynb

Code is based on https://github.com/amueller/word_cloud

In [None]:
from os import path
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

from wordcloud import WordCloud, STOPWORDS

In [None]:
# read the mask image taken from http://static01.nyt.com/images/icons/t_logo_291_black.png
nyt_mask = np.array(Image.open("~/nyt_mask_2.png"))

In [None]:
wc = WordCloud(background_color="white", max_words=2000, mask=nyt_mask, stopwords=STOPWORDS)

In [None]:
# create single string of all text for each parseout list
import itertools
document1966 = list(itertools.chain.from_iterable(parseout1966))
document1966 = ' '.join(document1966)
wc.generate(document1966)  #generate wordcloud

# show
plt.imshow(wc)
plt.axis("off")
plt.figure()
plt.show()

![](http://i.imgur.com/m7nfPPr.png)

### LDA 

LDA was done in https://github.com/dyan1211/teamsignificant/blob/master/Topics+LDA.ipynb

We started by created a corpus of our words.

In [None]:
from gensim import corpora, models, similarities, matutils

In [None]:
documents = parseout

In [None]:
dictionary = corpora.Dictionary(documents)
dictionary.filter_extremes(no_below=5,no_above=0.75,keep_n=100000)
dictionary.compactify()

In [None]:
corpus = [dictionary.doc2bow(document) for document in documents]

We wanted a way to decide how many topics would be best for our data.  In addition, since each of our 5-year time periods had a different number of documents, we assumed they were likely to also have a different number of topics (or the corpus for that period may be able to support more topics based on its content). Ideally, we would check the topic output after using several different options for number of topics and decide which worked best for our data.  Due to time constraints, we searched for an automatic method. After searching for different ways to approach this task, we found a method to find the number of topics using  KL divergence based on the idea that we can view LDA as a matrix factorization.  

Code for this section (and details for the method) are based on: http://blog.cigrainger.com/2014/07/lda-number.html

In [None]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
def sym_kl(p,q):
    return np.sum([stats.entropy(p,q),stats.entropy(q,p)])

In [None]:
def arun(corpus,dictionary,min_topics,max_topics,step):
    kl = []
    for i in range(min_topics,max_topics,step):
        lda = models.ldamodel.LdaModel(corpus=corpus,
            id2word=dictionary,chunksize=170,num_topics=i)  #we decided on a chunksize of approximately .01 of number of documents
        m1 = lda.expElogbeta
        U,cm1,V = np.linalg.svd(m1)
        #Document-topic matrix
        lda_topics = lda[corpus]
        m2 = matutils.corpus2dense(lda_topics, lda.num_topics).transpose()
        cm2 = l.dot(m2)
        cm2 = cm2 + 0.0001
        cm2norm = np.linalg.norm(l)
        cm2 = cm2/cm2norm
        kl.append(sym_kl(cm1,cm2))
    return kl

In [None]:
l = np.array([sum(cnt for _, cnt in doc) for doc in corpus])
kl = arun(corpus,dictionary,min_topics=1,max_topics=100,step=5)  #searched best number of topics between 1 and 100 by 5

In [None]:
# Plot kl divergence against number of topics
x = np.arange(1,100,5)
plt.plot(x, kl)
plt.ylabel('Symmetric KL Divergence')
plt.xlabel('Number of Topics')
plt.savefig('kldiv1966.png', bbox_inches='tight')

![](http://i.imgur.com/9aRvg9m.png)

For the LDA we also needed to decide on the number for chunksize (the number of documents which are loaded into memory at a time).  Using 1966 as an example (with number of topics equal to 50), we tested chunksizes of 10%, 5%, and 1% of the total document number.  After inspecting the topic output, we found that smaller chunksizes (1% of document total) provided clearer topics (the grouped words told a clearer story on average for this chunksize).  

Therefore, for each period, we ran LDA using the number of topics chosen from above and a chunksize of 1% of document size.  

The output from our LDA (lda.print_topics) can be found [here](https://docs.google.com/a/mail.harvard.edu/document/d/12jsDy1T5_7QjslxpsDnzfzH60H00rV0YEPizmwzknNU/edit?usp=sharing)

We had a total of 520 topics - 50 topics on average for each period.

In [None]:
import gensim

In [None]:
lda1966 = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary,chunksize=170, num_topics=50)

In [None]:
lda1966.print_topics(num_topics=50)

`[u'0.146*law + 0.102*may + 0.066*jersey + 0.044*new + 0.035*car + 0.034*state + 0.032*today + 0.025*critic + 0.023*code + 0.022*nj',
 u'0.138*agency + 0.135*department + 0.087*dept + 0.041*federal + 0.036*employes + 0.036*government + 0.034*u.s + 0.032*standard + 0.025*farm + 0.013*asthma',
 u'0.055*chief + 0.053*pay + 0.050*contract + 0.045*pact + 0.041*cost + 0.041*wage + 0.033*charles + 0.029*girl + 0.026*river + 0.025*ranch',
 u'0.112*south + 0.093*education + 0.082*plan + 0.042*feb + 0.040*line + 0.033*appeals + 0.033*southern + 0.033*opposition + 0.031*action + 0.028*ap',
 u'0.256*state + 0.059*nys + 0.050*gov + 0.046*unit + 0.038*albany + 0.020*race + 0.019*interest + 0.016*term + 0.016*connecticut + 0.016*governors',
 u'0.067*year + 0.060*disease + 0.053*heart + 0.051*dec + 0.048*work + 0.038*dies + 0.037*attack + 0.034*dr + 0.026*expert + 0.025*funds',
 u'0.215*johnson + 0.171*president + 0.078*oct + 0.049*washington + 0.029*change + 0.026*policy + 0.026*april + 0.022*capital + 0.018*set + 0.018*rule',
 u'0.223*city + 0.099*council + 0.062*budget + 0.060*rate + 0.028*year + 0.028*part + 0.028*francis + 0.026*increase + 0.026*rise + 0.026*yesterday',
 u'0.082*office + 0.057*poverty + 0.039*o'connor + 0.037*dec + 0.034*foundation + 0.033*gen + 0.031*suit + 0.028*poor + 0.028*abortion + 0.027*fed',
 u'0.069*hospitals + 0.063*police + 0.050*grant + 0.045*cut + 0.043*hotel + 0.035*gardner + 0.034*commission + 0.031*pub + 0.030*urban + 0.030*pressure',
 u'0.078*public + 0.065*act + 0.046*today + 0.041*upi + 0.034*calif + 0.031*team + 0.029*phs + 0.026*visit + 0.025*service + 0.023*inquiry',
 u'0.041*way + 0.027*year + 0.026*roosevelt + 0.022*finds + 0.022*activity + 0.020*prime + 0.020*renewal + 0.020*peking + 0.020*thursday + 0.018*fla',
 u'0.096*war + 0.038*march + 0.037*peace + 0.036*today + 0.032*secretary + 0.031*defense + 0.026*us + 0.026*world + 0.023*foreign + 0.021*threat', 
 u'0.077*birth + 0.072*control + 0.055*director + 0.049*woman + 0.046*use + 0.029*children + 0.026*back + 0.026*service + 0.023*data + 0.021*birthday',
 u'0.051*authority + 0.048*columbia + 0.045*season + 0.035*sale + 0.034*bank + 0.032*source + 0.028*pope + 0.027*stress + 0.024*lag + 0.021*named',
 u'0.093*party + 0.073*leader + 0.057*campaign + 0.054*senator + 0.038*candidate + 0.035*election + 0.031*l.i + 0.022*nassau + 0.020*leaders + 0.020*g.o.p',
 u'0.058*job + 0.056*water + 0.050*survey + 0.047*area + 0.043*men + 0.041*space + 0.035*brooklyn + 0.032*chairman + 0.019*league + 0.019*vaccine',
 u'0.078*welfare + 0.061*conf + 0.058*news + 0.036*com + 0.036*conference + 0.030*today + 0.030*committee + 0.029*white + 0.027*washington + 0.026*proposal',
 u'0.148*school + 0.050*court + 0.042*student + 0.042*negro + 0.037*order + 0.028*ct + 0.027*rights + 0.026*today + 0.025*federal + 0.025*fed',
 u'0.127*study + 0.066*report + 0.064*research + 0.042*operation + 0.037*natl + 0.036*problem + 0.035*group + 0.034*condition + 0.032*panel + 0.024*inst',
 u'0.054*john + 0.050*bellevue + 0.049*post + 0.049*china + 0.041*mao + 0.031*queens + 0.026*communist + 0.023*park + 0.021*summer + 0.020*story',
 u'0.026*sept + 0.022*july + 0.015*college + 0.015*nurse + 0.014*mr + 0.012*tex + 0.012*help + 0.010*today + 0.010*saigon + 0.009*equipment',
 u'0.045*security + 0.038*social + 0.036*americans + 0.035*soc + 0.032*econ + 0.032*business + 0.031*benefit + 0.028*article + 0.027*need + 0.025*growth',
 u'0.342*program + 0.070*administration + 0.050*adm + 0.030*federal + 0.029*u.s + 0.022*future + 0.022*javits + 0.020*washington + 0.015*priority + 0.012*harry',
 u'0.113*medicare + 0.094*home + 0.052*care + 0.046*death + 0.043*person + 0.038*question + 0.036*mass + 0.036*surgery + 0.030*fire + 0.029*message',
 u'0.147*nov + 0.120*dec + 0.043*rise + 0.042*today + 0.035*price + 0.031*cent + 0.029*ap + 0.027*increase + 0.026*san + 0.025*washington',
 u'0.093*problem + 0.086*system + 0.083*nation + 0.076*pres + 0.050*support + 0.044*crisis + 0.041*us + 0.040*int + 0.035*shortage + 0.028*london',
 u'0.131*center + 0.085*patient + 0.053*medicaid + 0.048*community + 0.046*drs + 0.039*nyc + 0.038*money + 0.031*clinic + 0.029*physician + 0.027*medicine',
 u'0.049*comr + 0.049*st + 0.046*building + 0.038*commissioner + 0.035*age + 0.033*population + 0.031*room + 0.030*field + 0.027*youth + 0.024*yesterday',
 u'0.114*time + 0.048*end + 0.045*airline + 0.039*legislation + 0.033*factor + 0.030*merger + 0.025*key + 0.025*animal + 0.024*agreement + 0.021*md',
 u'0.217*hospital + 0.084*hosp + 0.050*dr + 0.037*university + 0.033*yesterday + 0.028*east + 0.027*harlem + 0.022*series + 0.021*charge + 0.020*manhattan',
 u'0.100*union + 0.089*strike + 0.076*labor + 0.047*hearing + 0.044*worker + 0.042*dispute + 0.040*nurses + 0.031*progress + 0.030*demand + 0.027*amendment',
 u'0.065*company + 0.054*safety + 0.041*statement + 0.041*co + 0.040*robert + 0.036*international + 0.031*force + 0.029*industry + 0.025*resignation + 0.024*map',
 u'0.100*dr + 0.093*drug + 0.061*food + 0.056*effect + 0.042*doctor + 0.038*society + 0.028*research + 0.027*washington + 0.024*scientist + 0.021*lsd',
 u'0.190*city + 0.101*lindsay + 0.087*mayor + 0.053*yesterday + 0.033*service + 0.027*emergency + 0.021*resident + 0.020*bronx + 0.019*plant + 0.019*plan',
 u'0.125*vietnam + 0.062*housing + 0.053*test + 0.036*facility + 0.033*rev + 0.031*level + 0.030*training + 0.028*rusk + 0.025*tomorrow + 0.022*number',
 u'0.124*life + 0.056*result + 0.053*mental + 0.044*n.j + 0.026*accident + 0.026*dr + 0.024*discussion + 0.024*figure + 0.019*yale + 0.019*retirement',
 u'0.166*aid + 0.140*bill + 0.090*hosps + 0.050*right + 0.040*vote + 0.035*treatment + 0.025*today + 0.024*legis + 0.022*urge + 0.021*com',
 u'0.163*fund + 0.083*world + 0.074*project + 0.071*national + 0.069*head + 0.052*chicago + 0.030*construction + 0.030*william + 0.029*republican + 0.019*measure',
 u'0.149*united + 0.120*states + 0.044*son + 0.044*jr + 0.042*daughter + 0.033*army + 0.027*history + 0.025*shriver + 0.025*america + 0.022*james',
 u'0.085*drive + 0.061*record + 0.040*george + 0.033*hope + 0.026*theater + 0.026*high + 0.026*decade + 0.025*town + 0.022*reports + 0.022*sea',
 u'0.122*pollution + 0.115*air + 0.063*control + 0.058*role + 0.041*fight + 0.027*step + 0.026*exec + 0.025*first + 0.023*issue + 0.020*north',
 u'0.110*day + 0.067*man + 0.066*family + 0.055*illus + 0.049*planning + 0.047*night + 0.029*por + 0.020*year + 0.018*beach + 0.016*goal',
 u'0.138*rockefeller + 0.104*governor + 0.036*insurance + 0.029*yesterday + 0.028*tour + 0.027*costello + 0.026*catholic + 0.026*state + 0.025*session + 0.023*plan',
 u'0.127*mrs + 0.086*county + 0.082*child + 0.056*miss + 0.035*aug + 0.027*staff + 0.022*teacher + 0.019*mother + 0.018*parent + 0.018*philadelphia',
 u'0.082*june + 0.064*cabinet + 0.061*kennedy + 0.057*tax + 0.043*government + 0.031*today + 0.030*govt + 0.026*min + 0.024*gift + 0.021*little',
 u'0.198*new + 0.106*york + 0.043*medical + 0.035*city + 0.034*case + 0.031*american + 0.030*yesterday + 0.028*assn + 0.027*association + 0.024*ny',
 u'0.092*senate + 0.081*thaler + 0.078*sen + 0.069*brown + 0.050*angeles + 0.049*los + 0.038*move + 0.032*finance + 0.030*plans + 0.027*coast',
 u'0.085*board + 0.061*bd + 0.058*site + 0.056*meeting + 0.051*power + 0.048*talk + 0.037*general + 0.032*island + 0.031*executive + 0.028*california',
 u'0.149*house + 0.078*washington + 0.075*congress + 0.058*cancer + 0.041*today + 0.039*hr + 0.037*cigarette + 0.035*cong + 0.027*advertising + 0.023*ad']`

#### Visualization of topics

pyLDAvis is a python libarary used previously to visualize topic models. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization. It was useful in classifying topics, described further below. The package information can be found here https://pypi.python.org/pypi/pyLDAvis

The interactive models for each period can be found at http://nbviewer.ipython.org/github/dyan1211/teamsignificant/blob/master/InteractiveLDA.ipynb

In [None]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

In [None]:
vis_1966 = gensimvis.prepare(lda1966, corpus, dictionary)
pyLDAvis.display(vis_1966)

![](http://i.imgur.com/dXcRcWj.png)

#### Creating new data frame 

The idea is to get the count of articles of a topic within each time period and then plot it over time to understand the trend of the popularity of a topic. Instead of assigning an article one single topic, we consider it contributing to several topics based on the percent that it belongs to a certain topic, which we considered a better way as a weigted sum to measure the popularity of topics. In order to do this, we created a new data frame whose columns are `Topics` and rows are `Articles`, with cells as the corresponding percent of each article under each topic, and then appended the data frame to each 5-year data frame (to match up the `Date`). 

These dataframes are stored in our dropbox: https://www.dropbox.com/sh/6f9vok950zrg97h/AAA91zJUr_GKUcSvwHatHMkya?dl=0

In [None]:
n = 50
topicdf = pd.DataFrame()
topicdf = np.zeros(shape=(len(corpus),n))
for index, bow in enumerate(corpus) : 
    topics = lda1966.get_document_topics(bow)
    for t in topics :
        topicdf[index][t[0]] = t[1]

In [None]:
topicdf1966 = pd.DataFrame(topicdf,columns=["Topic"+str(i) for i in range(0,n)])
comb1966 = pd.concat([df1966, topicdf1966], axis=1)

#### Qualitative topic analysis

Next we define a function that takes the top 20 articles, based on their probability generated by LDA of falling within a certain topic, and print them out. With these prints, combined with topic output, we were able to qualitatively assign labels to topics.

In [None]:
import heapq

In [None]:
#Get the top 20 articles with the highest percent for a topic to 
def topic_check(dataframe, topic_num):
    top10 = [list(dataframe[topic_num]).index(p) for p in heapq.nlargest(20, list(dataframe[topic_num]))]
    return list(dataframe['text'][top10])[0:19]

As an example, we analyzed Topic 48 from years 1971-1975 using the output generated by LDA:

`u'0.081*united + 0.078*disease + 0.067*states + 0.060*level + 0.048*us + 0.032*u.s + 0.022*death + 0.021*washington + 0.021*official + 0.020*dr',`

and the following output. In order to avoid potential biases, we analyzed all topics separately before comparing results. Ultimately, the following topic was labeled *epidemics*. This process was repeated for all 5 year datasets, and all topics contained within them. Though time consuming, placing labels on topics allowed us to analyze similar topics across our datasets, and to further investigate topics with more interesting and clear labels. Indeed, this type of analysis is not robust and our subjective labeling of topics does not necessarily make them comparable. However, given that exploration is our primary objective, we reasoned that this effort was nevertheless useful. 

The full list of our topic labels is available here: https://docs.google.com/spreadsheets/d/15Ubu_jzF3-xKvSFrz13MUznu-azqrRRIfdVc8StBVHk/edit#gid=1655218739

In [None]:
topic_check(comb1971, 'Topic48')

`['Health Center Puts U.S. Flu Outbreaks At Epidemic Stage ATLANTA, Jan. 17 (UPI)-- The National Center for Disease control report today that influenza had spread throughout most of the United States. US Disease Control Center repts that influenza has spread through most of US; says influenza deaths, reptd at 754 for 2d wk of Jan, have exceeded expected levels in121 major cities; center epidemiologist Dr Lawrence Corey says localized outbreaks are occurring in Midwest and on East Coast (M)',
 'Seen Conducive to Epidemic A public health official warned today that immunization against polio was dropping to a level that might permit epidemics torecur in the United States. US Disease Control Center officials on July 1 warn that immunization against polio is dropping to level that might permit epidemics to recur in US',
 'FLU NOW EXCEEDS EPIDEMIC LEVEL; But Disease Control Office Is Cautious About Figures A Blip on the Graph The Center for Disease Control said today that deaths from influenza and pneumonia this winter across the country had moved above the epidemic threshold for the first time. PHS Center for Disease Control on Jan 12 says that deaths from pneumonia in US have moved above epidemic threshhold for 1st time',
 'NEW STRAINS OF FLU EXPECTED IN WINTER ATLANTA, Ga., Oct. 6 (AP) -- New strains of Type A influenza are likely to appear in the United States this winter, the Center for Disease Control reported today. US Center for Disease Control on Oct 6 repts new strains of Type A influenza are likely to appear in US during winter, but cannot determine whether widespread outbreaks are likely to occur; says currently available influenza vaccine should offer some protection against new strains',
 'Influenza Epidemic Strikes in Fort Lauderdale FORT LAUDERDALE, Fla., Jan. 5 (AP) -- An outbreak of influenza that produced 18 deaths in one week has forced health officials to declare an epidemic in this resort city. Fort Lauderdale has been rapidly filling up with vacationing elderly, who are the most, susceptible to the disease.  ',
 'FLU DEATH EPIDEMIC SPREADING TO COAST ATLANTA, July 24 (UPI) Deaths from influenza remained above the "epidemic threshold" for the second consecutive week, with the disease spreading to the West Coast for the first time this winter, the National Center for Disease Control reported today. Florida: Statement by Pinellas County (Fla) Health Dept dir George Dame that tourists should stay away from St Petersburg area because of flu epidemic draws sharp reaction from businessmen; Dame later suggests that older people wait 1 wk while younger people would be basically unaffected (S)',
 'HOG CHOLERA FOUND IN TEXAS OUTBREAK WASHINGTON, July 5 (AP) Hog producers throughout the Southwest and in the rest of the nation are being warned to watch for signs of hog cholera in their herds following the first confirmed outbreak of the disease in the United States in more than a year. Agr Dept confirms outbreak of hog cholera and warns hog producers throughout Southwest to watch for signs of disease; outbreak has been found in herd of 170 hogs owned by Louis Woodford, Texas (S)',
 'U Thant Is Dead of Cancer at 65; UT hant Is Dead of Cancer; United Nations Mourns U Thant, the quiet Burmese schoolteacher who became the third Secretary General of the United Nations and held that post longer than any other person, died yesterday at Columbia-Presbyterian Medical Center. He was 65 years old.  ',
 'Painful Lessons in \'Realism And Candor\'; United Nations: UNITED NATIONS -- "We have reached the point at which it is no service to the idea of the United Nations and no contribution to its future to blink at its limitations. . . . We believe that the time has come for a large dose of realism and candor in United States policy toward the United Nations."  ',
 'Flu Is Linked to 29 Deaths In Three Florida Counties ST. PETERSBURG, Fla., Jan. 22 (UPI)--A virulent strain of influenza, hitting especially hard at retired persons in the Tampa Bay area, was linked today to at least 29 deaths. virulent strain of influenza, hitting especially hard at retired persons in Tampa Bay area, Fla, linked Jan 22 to at least 29 deaths; heaviest toll is in Pinellas County where flu is given as contributing factor in 17 deaths, mostly of elderly persons suffering heart ailments (S)',
 'TYPHOID EPIDEMIC IN MEXICO ABATES; Doctors Say That Outbreak Is Waning After 3 Years Warning Issued on Haiti An epidemic of typhoid fever that affected thousands of Mexicans and that was described icans and that was described as the largest in the world in several decades has abated, Mexican and United States Public Health Service doctors said in a report received yesterday. US PHS repts typhoid epidemic in Mexico is abating; warns travelers to Haiti that typhoid fever is endemic in that country; warns that vaccination alone cannot replace good health practice in food and water consumption (M)',
 "Fighter Against TB As recently as three decades ago, forty thousand or more Americans died of tuberculosis each year, victims of what was then still one of mankind's major enemies among the infectious diseases. By the mid-nineteenfifties, however, the annual number of tuberculosis deaths in this country was down to ten thousand and it has continued to decline ever since. ed lauds development by Dr S A Waksman, who died on Aug 16, of streptomycin, which initiated fight to eradicate tuberculosis in US",`
 
....

#### Sum up the probabilities and standardize by total # of articles

Since we have a different number of articles within each time period as we can see from the graph in the beginning, to make them comparible, we summed up the probabilities of each topic through out all articles and standardized it by the total number of articles within that certain time range. We tried to group the data by month at first, but the plots turned out to have too much noise. Therefore, we changed our time grid into quarter to group the data by quarter and created time series data based on the date and mean probabilities within that quarter.

In [None]:
ts = comb1981.set_index('date').resample('Q')['Topic10']

#### Plotting each topic

Besides picking up interesting topics based on the probability and text from LDA output, we also considered looking at the trend of topic without knowing any of the text, which actually gave us very instersting results since it lowers the bias of subjective sense based on the texts themselves. We plotted every single topic over time to see how the popularities change over time. We considered those who have an apparent regular pattern (periodic/increasing/decreasing) and those who have an unusual spike at some point as potential interesting topics or events, and then take a further look at the documents as well as historical evidence for them.

In [None]:
def plottopic(df, n) : 
    fig, axes = plt.subplots(nrows=n/5, ncols=5, figsize=(15, int(n)/2), tight_layout=True)
    for i in range(n) : 
        ts = df.set_index('date').resample('Q')['Topic'+str(i)]
        ax = axes[int(i)/5][i-5*(int(i)/5)]
        ax.plot(ts,label="Topic%s" % i)
        ax.legend(frameon=False, loc="upper right")

In [None]:
plottopic(comb1966,50)

![](http://i.imgur.com/bsKQKgk.png)

# Final Analysis:

#### Plotting
We define fonts for consistency in plotting.

In [None]:
font = {'family': 'serif',
        'color':  'darkred',
        'weight': 'normal',
        'size': 16,
        }
font2 = {'family': 'serif',
         'color':  'black',
         'weight': 'normal',
         'size': 16,
         }
font3 = {'family': 'serif',
         'color':  'black',
         'weight': 'normal',
         'size': 20,
         }
font4 = {'family': 'serif',
         'color':  'black',
         'weight': 'normal',
         'size': 12,
         }

### Event plots

By looking at plots of each topic over time, we were able to identify some topics with large spikes which we were able to correlate with specific events ranging from outbreaks to high profile lawsuits to some recent events.  Most of the identified events were New York related - such as the Long Island Railroad massacre of 1993 (an event none of our team had heard of, but was important enough in the NYT to warrent its own topic). 

The next sequence of plots highlight events of interest that appeared in LDA topic modeling. Rather than being consistent throughout datasets, these events ocurred uniquely within a certain 5 year interval and reflected a very specific circumstance that dominated series of New York Times articles. We identified these topics in the topic check, and then plotted them with the expectation that the topic would "spike" during its actual ocurrence, and then trail off in interest afterwards. Over the 5 decades, we identified 13 different events.  The Y axis represents LDA topic probabilities for articles averaged within quarters in a year (in other words the 'popularity' of a topic), and the X axis is time.  We divided time into quarters of a year to cut back on noise while still being able to see how the topic popularity changes. 

Code for all plots can be found here: https://github.com/dyan1211/teamsignificant/blob/master/AIDS_vis.ipynb

In [None]:
#example code for a specific event plot
with plt.style.context('fivethirtyeight'):
    plot = comb[1991].set_index('date').resample('Q')['Topic66'].plot()
    plot.set_ylim(0.008,0.018)
    plot.set_xlim(pd.to_datetime('1990-12-1'),pd.to_datetime('1996-1-1'))
    plot.text(pd.to_datetime('1993-7-20'), 0.017, "Philip Morris Lawsuit", fontdict=font)
    plot.set_title("1991-1996 Topic 66", fontdict=font2)
    plot.tick_params(axis='both', which='major', labelsize=15) # increase xlabel fontsize
    plot.axes.get_yaxis().set_ticks([]) # remove yticks
    plot.set_xlabel("") # remove xlabel
    plot.grid(False) # remove grid

#### Some specific Health events and descriptions

Love Canal is a city in upstate NY which was built on top of the chemical dumping grounds of Hooker Chemical in the 1950s when the dangers of chemical waste were not fully understood.  The population, especially the children, began experiencing increased rates of illness.  National attention to the issue started in 1978 (first spike) when President Carter declared it a federal health emergency and some families were relocated from the inner town area.  In 1979 (second spike) a second evacuation was issued for pregnant women and infants after research confirmed increased numbers of miscarriage and birth defect.  It was not until late 1980 (the large spike in our graph) that the remaining residents were evacuated and the government reimbursed them for the cost of their houses.

![](http://i.imgur.com/CyOWVGT.png)

In 1979, the Three Mile Island nuclear power plant in Pennsylvania had a partial meltdown.  This event fueled anti-nuclear sentiment in the US.
![](http://imgur.com/JnmPgdv.png)

West Nile Virus is a mosquito-born virus which has the potential to cause neurological disease and death (usually in less than 1% of cases).  Before the 1990s, it was sporadic and not considered dangerous.  Outbreaks began in the mid-1990s in Northern Africa and Europe.  In 1999 (medium spike), the virus spread to the US with several cases of encephalitis occurring in New York.  The next year (large spike), it had spread to 14 states in the US and many states, including New York, began spraying chemical pesticides to in an attempt to be proactive against further cases occurring.  Many of the articles during this spike are notices of spraying in New York.
![](http://i.imgur.com/G9twGkq.png)

Some other interesting events: 
![](http://i.imgur.com/4r7BXaz.png)
![](http://i.imgur.com/o5J2nYC.png)
![](http://i.imgur.com/qIsMFFP.png)
![](http://i.imgur.com/9E5ars8.png)
![](http://i.imgur.com/n1RkYA4.png)
![](http://i.imgur.com/khwkd1C.png)
![](http://i.imgur.com/mOFLWM0.png)

#### Compared with Google Trends

Google Trends provides data on topic searches which we thought could be a useful comparison for some of our events. It only has data from 2005, so we picked three intereting findings in health related topics from recent years, which are Swine Flue, Ebola and ACA(Health Bill), to compare the LDA topic modeling based on New York Times articles and the popularity of terms in Google trends. The numbers that appear in Google Trends data show total searches for a term relative to the total number of searches done on Google over time (scaled from 0 to 100). It is technically a relative popularity of the term. We managed the score from Google trends to a comparible scale with the probabilities we had from LDA, and plotted them together. 

The code below is an example of Swine Flu. The entire process can be found at https://github.com/dyan1211/teamsignificant/blob/master/Topics+LDA.ipynb

In [None]:
event_swine = comb2006.set_index('date').resample('Q')['Topic34']
swine = pd.read_csv("swine.csv")
swine['Week'] = pd.to_datetime(swine['Week'])
swines = swine.set_index('Week').resample('Q')['swine flu']/600
swinedf = pd.DataFrame({"NYT":event_swine,"Google Trends":swines},index=event_swine.index)

In [None]:
with plt.style.context('fivethirtyeight'):
    plot = swinedf.plot()
    plot.set_title("Swine Flu - LDA Topic Modeling vs Google Trends Popularity", fontdict=font3)
    plot.tick_params(axis='both', which='major', labelsize=15) # increase xlabel fontsize
    plot.axes.get_yaxis().set_ticks([]) # remove yticks
    plot.set_xlabel("") # remove xlabel
    plot.grid(False) # remove grid
    plt.savefig('swineflu.png', bbox_inches='tight')

#### Swine Flu

The result shows that the two trends conincide with each other very well - both show a double spike (though the Google Trends one is less pronounced). It seems our model using The New York Times data correctly captures the actual trend of topic popularity. We note here (and for the other plots) that due to LDA assigning some documents to this topic which are not related to the event, the baseline levels will be greater than 0 throughout.

![](http://i.imgur.com/CSoF5MS.png)

#### Patient Protection and Affordable Care Act

Unlike Swine Flu and Ebola (see below), the plot of Google Trends search for ACA shows a little bit of a lagged reaction of general attention compared to the New York Times, which actually makes sense because the other two are both disease outbreak that spread extremely fast while PPACA is a politics related topic which takes time for people to really pay attention to. 

![](http://i.imgur.com/a2mgS0C.png)

#### Ebola

Ebola showed a very significant spike which correlates with the major outbreak last year. It also coincides perfectly with Google Trends. 

Overall, though these correlations do not provide an exact method for verifying our results, it does increase our confidence at least in the ability of LDA to identify major topics.

![](http://i.imgur.com/K97Td1J.png)

### Compared with PubMed

For some specific health-related topics, we were able to conduct further exploration by comparing our LDA topic modeling results with the publication frequency trends in PubMed. A great example here is the AIDs outbreak which was in popular news and then sparked an entire body of health literature.

The original code for everything under this heading can be found here: https://github.com/dyan1211/teamsignificant/blob/master/AIDS_vis.ipynb

#### Reading in topic dataframes

In [None]:
comb = {}
for i in range(1966,2006,5):
    print "getting dataframe %d..." % i
    comb[i] = pd.read_csv("/Users/Luke/Documents/cs109/project/comb%d.csv" % i)
    comb[i]['date'] = pd.to_datetime(comb[i]['date'])
    comb[i]['year'] = comb[i]['date'].dt.year
    comb[i]['month'] = comb[i]['date'].dt.month

#### Issues with 2006 dataframe, NaT values in date column

A small issue was encountered with the 2006 dataset in the `dates` column. It is possible there was an erorr in downloading the file on this computer, as some of us did not encounter this issue, though reuploading and redownloading the csv did not change the outcome. Eventually it was discovered that two documents in the dataframe out of 45020 were missing values in the `dates` column. These values were dropped. Given their size relative to the full dataset, the potential loss of information is negligible.

In [None]:
s = comb[2006]['date'].convert_objects(convert_dates='coerce')
s[pd.isnull(s)]

`27827   NaT
32989   NaT
Name: date, dtype: datetime64[ns]`

In [None]:
comb[2006].shape

`(45020, 58)`

In [None]:
comb[2006] = comb[2006].drop(27827)
comb[2006] = comb[2006].drop(32989)
comb[2006].shape

`(45018, 58)`

In [None]:
comb[2011] = pd.read_csv("/Users/Luke/Documents/cs109/project/comb2011.csv")

In [None]:
for i in [2006,2011]:
    comb[i]['date'] = pd.to_datetime(comb[i]['date'])
    comb[i]['year'] = comb[i]['date'].dt.year
    comb[i]['month'] = comb[i]['date'].dt.month

#### AIDs trends over time, as modeled by LDA

An initial look at the topic *AIDs* (acquired immunodeficiency syndrome) within our data. This topic was prevalent only in the years 1986-2001. Though the word *AIDs* did appear after 2001, it was buried by noise from other topics. Thus, the following analyses consider only data from the aforementioned time interval.

We chose to study this topic further for multiple reasons. First, it was a clear label that we easily determined in the topic check. Second, we hypothesized that there would be noticeable time trends for *AIDs* that might appear in the data. Third, *AIDs* is a unique and specific topic within the field of public health, allowing us to compare our results from LDA against scientific articles in peer-reviewed journals using PubMed's API.

The Y axis of the graph below is the probability of a specific topic (*AIDs*) generated by LDA. The X axis is the date of publication for these articles, divided into quarters within a year to reduce noise. The quarters provide the mean topic probability within that timeframe as default. We then concatenated the data together to provide a smoothe line for the entire interval of interest.

In [None]:
cont86 = comb[1986].set_index('date').resample('Q')['Topic5']
cont91 = comb[1991].set_index('date').resample('Q')['Topic55']
work = cont86.combine_first(cont91)
cont96 = comb[1996].set_index('date').resample('Q')['Topic47']
work = work.combine_first(cont96)
cont01 = comb[2001].set_index('date').resample('Q')['Topic3']
work = work.combine_first(cont01)
plt.figure(figsize=(20,6)); work.plot();

![](http://imgur.com/57Gyf4t.png)

#### Pubmed data retrieval

The following formulas were created to retrieve data from PubMed's API. We used code provided publicly on github to assist in this retrieval. Full discussion of these methods is available at the following URL: https://marcobonzanini.wordpress.com/2015/01/12/searching-pubmed-with-python/# and the author's github: https://gist.github.com/bonzanini/5a4c39e4c02502a8451d. 

First we install the Biopython package in the terminal: `pip install biopython`

Next we import the `Entrez` module from Biopython. 

In [None]:
import requests
import time
import json
from Bio import Entrez

The search function takes `query` (the topic of interest - string), `retmax` (the maximum number of articles to retrieve from PubMed - integer), `mindate` (the starting date for the search - integer), `maxdate` (the ending date for the search - integer), and `reldate` (when interested in articles from the last N days - integer). We also provide an email to the function in case the Entrez database needs to reach out to us for making excess numbers of calls. This function is the primary call to PubMed.

In [None]:
def search(query, retmax=False, mindate=False, maxdate=False, reldate=False): 
    Entrez.email = 'lukeam2929@gmail.com'
    handle = Entrez.esearch(db='pubmed', 
                            sort='relevance', 
                            retmax=retmax,
                            retmode='xml',
                            datetype='pdat',
                            #reldate=reldate,
                            mindate=mindate,
                            maxdate=maxdate,
                            term=query)
    results = Entrez.read(handle)
    return results

The function `fetch_details` combines the results from `search` over an idlist that exists in the search call.

In [None]:
def fetch_details(id_list):
    ids = ','.join(id_list)
    Entrez.email = 'your.email@example.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

The function `pubmeddf` combines the previously defined functions and outputs a pandas dataframe with the following columns: `PMID` - the id assigned to the article by PubMed, `Title` - the title of the publication, `Day`, `Month`, `Year`, and `Abstract` - the abstract of the publication, if available. In most cases, the dates for articles were publication dates. However, this information was not available for all articles, so we used the `DateCreated` parameter instead. `DateCreated` is the date PubMed assigned a PMID to the article and made it available for search in the databse. We reasoned this was justifiable given its relative closeness to the publication dates (typically a few months after publication), and that the distribution of articles using `DateCreated` would not differ substantially between our 5 year intervals.

In [None]:
def pubmeddf(term, retmax, mindate, maxdate):
    results = search(term, retmax, mindate, maxdate) # add reldate if needed
    id_list = results['IdList']
    papers = fetch_details(id_list)
    all_papers = []
    for paper in papers:
        d = {}
        try:
            d['PMID'] = int(paper['MedlineCitation']['PMID'])
            d['Title'] = paper['MedlineCitation']['Article']['ArticleTitle']
            if len(paper['MedlineCitation']['Article']['ArticleDate']) > 0:
                d['Month'] = paper['MedlineCitation']['Article']['ArticleDate'][0]['Month']
                d['Day'] = paper['MedlineCitation']['Article']['ArticleDate'][0]['Day']
                d['Year'] = paper['MedlineCitation']['Article']['ArticleDate'][0]['Year']
            else:
                d['Month'] = paper['MedlineCitation']['DateCreated']['Month']
                d['Day'] =paper['MedlineCitation']['DateCreated']['Day']
                d['Year'] = paper['MedlineCitation']['DateCreated']['Year']
            try:
                d['Abstract'] = paper['MedlineCitation']['Article']['Abstract']['AbstractText'][0]
            except:
                d['Abstract'] = None

            all_papers.append(d)
        except:
            continue

    return pd.DataFrame(all_papers)

We called on the function separately for each 5 year interval to emulate our process from LDA. We attempted to retrieve 10000 articles for each call, as this is the maximum amount PubMed allows in a single call. We also opted to use the term *HIV* instead of *AIDs* as it is more concise and represenative within medical literature.

In [None]:
pmdf = {}
pmdf[1986] = pubmeddf("HIV", "10000", 1986, 1990)
pmdf[1991] = pubmeddf("HIV", "10000", 1991, 1995)
pmdf[1996] = pubmeddf("HIV", "10000", 1996, 2000)
pmdf[2001] = pubmeddf("HIV", "10000", 2001, 2006)
pmdf[2006] = pubmeddf("HIV", "10000", 2006, 2011)
pmdf[2011] = pubmeddf("HIV", "10000", 2001, 2015)

In [None]:
print(json.dumps(papers[20], indent=2, separators=(',', ':'))) #sample output from initial search

Printing the shape for each dataframe we notice that some articles were lost in the calls. This was either due to a lack of PMID or article title for the article.

In [None]:
for k in range(1986,2016,5):
    print pmdf[k].shape

`(9999, 6)
(9999, 6)
(9997, 6)
(10000, 6)
(9996, 6)
(9999, 6)`

Next we subset the data only to include articles published in their respective interval, and performed general housekeeping. Initial looks at the data revealed that the search returned some dates outside of the range of interest (e.g. the dataset 1986-1990 might have a few articles with dates in the late 90s), so we removed them for analysis. 

In [None]:
pmdf_subset = {}
for year in range(1986,2016,5):
    pmdf[year]['date'] = pmdf[year][['Year','Month','Day']].apply(lambda x: '-'.join(x), axis=1)
    pmdf[year]['date'] = pd.to_datetime(pmdf[year]['date'])
    pmdf[year]['Year'] = pmdf[year]['Year'].astype(int)
    pmdf_subset[year] = pmdf[year][pmdf[year]['Year']<year+5]

Here we generate frequencies of articles for every month in the data, and add these frequencies as a new column at the end of the subset dataframe (from above). Next, we create an additional column that scales the frequencies down to the LDA probabilities in order to compare trends on a single graph. Note, this scaling was done approximately and does not represent any quantity of interest. Further discussion is made in graph interpretation below.

In [None]:
for year in range(1986,2016,5):
    pmdf_subset[year]['counts'] = pmdf_subset[year].groupby(['Year','Month'])['Day'].transform('count')
    pmdf_subset[year]['scaled'] = pmdf_subset[year]['counts']/200

We reassess the shape of our datasets. Loss of articles was minimal in the subsetting process, though something interesting happened in the 2006-2010 dataset. Though further investigation is needed into the Entrez database search process, it is possible that publication dates were not provided for these articles thus `DateCreated` was used instead, which may fall outside of the search bounds. Nevertheless, as our LDA data was restricted between 1986-2001, this curiosity could be ignored for analysis.

In [None]:
for k in range(1986,2016,5):
    print pmdf_subset[k].shape

`(8720, 9)
(9142, 9)
(9367, 9)
(7932, 9)
(3428, 9)
(9999, 9)`

In [None]:
pmdf_subset[1986].dtypes

`Abstract            object
Day                 object
Month               object
PMID                 int64
Title               object
Year                 int64
date        datetime64[ns]
counts               int64
scaled             float64
dtype: object`

The following code plots our LDA topic probabilities against PubMed article frequencies. The Y axis is topic probability for New York Times articles averaged over yearly quarters (i.e. 3 months), and is scaled article frequency for PubMed articles also averaged over yearly quarters. We concatenated data from different time intervals/dataframes as suggested previously. As such, the data are comparable by trend and "spikes" only (thus the removal of the Y axis labels). 

There are a few notable insights from this plot. First and foremost, we see that NYT reported on AIDs frequently and early in the *AIDs* crisis, while it took researchers more time to publish in peer-reviewed journals. Next, we see that trends between the two publishers were fairly parallel during the 90s as *AIDs* remained prevalent in the public and academic mindsets. Third, we see a spike in NYT reporting on AIDs in the early 2000s, most likely reflecting the AIDs crisis developing in Southern Africa. This spike does not seem to alter publication frequency however, which continues at its current pace. Perhaps this is due to research being reliant on grants, such that funding was being delegated to issues more prevalent in researchers' home countries (i.e. the war on terror, Iraq). Further investigation into this topic is necessary, but our exploratory analyses may provide new hypotheses. 

In [None]:
cont86 = comb[1986].set_index('date').resample('Q')['Topic5']*65
cont91 = comb[1991].set_index('date').resample('Q')['Topic55']*80
work = cont86.combine_first(cont91)
cont96 = comb[1996].set_index('date').resample('Q')['Topic47']*60
work = work.combine_first(cont96)
cont01 = comb[2001].set_index('date').resample('Q')['Topic3']*40
work = work.combine_first(cont01)

pub86 = pmdf_subset[1986].set_index('date').resample('Q')['scaled']
pub91 = pmdf_subset[1991].set_index('date').resample('Q')['scaled']
pubwork = pub86.combine_first(pub91)
pub96 = pmdf_subset[1996].set_index('date').resample('Q')['scaled']
pubwork = pubwork.combine_first(pub96)
pub01 = pmdf_subset[2001].set_index('date').resample('Q')['scaled']
pubwork = pubwork.combine_first(pub01)

with plt.style.context('fivethirtyeight'):
    plt.figure(figsize=(20,6))
    nyt = work.plot(label='NYT LDA Topic Model Probability')
    pm = pubwork.plot(label='PubMed Publication Frequency')
    nyt.legend(loc=2,prop={'size':18})
    nyt.set_title("AIDs - LDA Topic Modeling and PubMed Article Publication Frequency", fontdict=font3)
    nyt.tick_params(axis='both', which='major', labelsize=15) # increase xlabel fontsize
    nyt.axes.get_yaxis().set_ticks([]) # remove yticks
    nyt.set_xlabel("") # remove xlabel
    nyt.grid(False) # remove grid

![](http://imgur.com/5Le87j4.png)

Subplots separated by the 4 year datasets were also created for more concise comparison, using methods described previously. 

In [None]:
fig = plt.figure()

axeslist = [ax1,ax2,ax3,ax4]

ax1 = fig.add_subplot(221)
(comb[1986].set_index('date').resample('Q')['Topic5']*65).plot(ax=ax1)
(pmdf_subset[1986].set_index('date').resample('Q')['scaled']).plot(ax=ax1)
ax1.set_title("1986-1991", fontdict=font4)
ax1.axes.get_yaxis().set_ticks([])
ax1.set_xlabel("")

ax2 = fig.add_subplot(222)
(comb[1991].set_index('date').resample('Q')['Topic55']*80).plot(ax=ax2)
(pmdf_subset[1991].set_index('date').resample('Q')['scaled']).plot(ax=ax2)
ax2.set_title("1991-1996", fontdict=font4)
ax2.axes.get_yaxis().set_ticks([])
ax2.set_xlabel("")

ax3 = fig.add_subplot(223)
(comb[1996].set_index('date').resample('Q')['Topic47']*60).plot(ax=ax3)
(pmdf_subset[1996].set_index('date').resample('Q')['scaled']).plot(ax=ax3)
ax3.set_title("1996-2001", fontdict=font4)
ax3.axes.get_yaxis().set_ticks([])
ax3.set_xlabel("")

ax4 = fig.add_subplot(224)
(comb[2001].set_index('date').resample('Q')['Topic3']*40).plot(ax=ax4)
(pmdf_subset[2001].set_index('date').resample('Q')['scaled']).plot(ax=ax4)
ax4.set_title("2001-2006",fontdict=font4)
ax4.axes.get_yaxis().set_ticks([])
ax4.set_xlabel("")

#plt.suptitle('AIDs - LDA Topic Modeling and PubMed Article Publication Frequency', fontdict=font4)
plt.tight_layout()

![](http://imgur.com/tJNkFXD.png)

### Consistency Plots

We also attempted to plot topics which were consistent across time frames, though comparison was difficult as oftentimes one or two time frames did not identify the topic during LDA. For example, women's health was a consistent topic in many time frames (i.e. abortion in the 70s, breast cancer in recent years), though it was less pervasive in the 80s (compared to other topics identified by LDA). In addition, as we used variable number of topics for different time frames, topic probabilities had to be weighted in order to show trends across all time on a single plot. Specifically, we multiplied the probabilities by number of topics - providing greater weight to topics identified among many topics. For example, women's health in 1991 contained specific, relatable attributes that informed our labeling, while women's health in 2011 appealed to a broader definition. Due to time constraints on the project, we did not investigate more robust weighting techniques than demonstrated here. 

Code for all plots can be found here: https://github.com/dyan1211/teamsignificant/blob/master/AIDS_vis.ipynb

In [None]:
#example code
cont66 = comb[1966].set_index('date').resample('Q')['Topic13']*50
cont71 = comb[1971].set_index('date').resample('Q')['Topic28']*50
work = cont66.combine_first(cont71)
cont76 = comb[1976].set_index('date').resample('Q')['Topic31']*65
work = work.combine_first(cont76)

cont91 = comb[1991].set_index('date').resample('Q')['Topic4']*80
cont96 = comb[1996].set_index('date').resample('Q')['Topic50']*60
work2 = cont91.combine_first(cont96)

cont11 = comb[2011].set_index('date').resample('Q')['Topic11']*40

with plt.style.context('fivethirtyeight'):
    plt.figure(figsize=(20,6))
    work2.plot(color='cornflowerblue')
    cont11.plot(color='cornflowerblue')
    nyt = work.plot(color='cornflowerblue', label='NYT LDA Topic Model Probability')
    nyt.set_title("Women's Health", fontdict=font3)
    nyt.tick_params(axis='both', which='major', labelsize=15) # increase xlabel fontsize
    nyt.set_xlim(pd.to_datetime("1966-1-1"),pd.to_datetime("2016-1-1"))
    nyt.axes.get_yaxis().set_ticks([]) # remove yticks
    nyt.set_xlabel("") # remove xlabel
    nyt.grid(False) # remove grid

![](http://i.imgur.com/TIMAXKf.png)
![](http://i.imgur.com/FGHDe8z.png)
![](http://i.imgur.com/CaP9lgm.png)

#### Topic fonts

We explored altnerative visualization techniques of major health topics in the following diagrams. As before, we selected topics of interest based on the clarity they provided in the labeling process, and whether they fell under the broad umbrella of public health. We calculated the mean of a topic within a certain dataframe (i.e., the topic probability from LDA averaged over all documents, weighted by the number of topics generated for that dataframe), and used the mean to inform font size of the respective topic label. The number of topics for each time frame was not fixed, rather it varied by however many topics of interest we could find. In addition, topics were colored by general category to give a sense of how topics changed over time.  See below for examples.

In [None]:
# creating dataframes for topic-mean combinations
topicdict = {'Birth Control': np.mean(comb[1966]['Topic13']),
            'Hospitals': np.mean(comb[1966]['Topic30']),
            'Pollution': np.mean(comb[1966]['Topic41']),
            'Smoking': np.mean(comb[1966]['Topic49']),
            'FDA': np.mean(comb[1966]['Topic33']),
            'Medicare': np.mean(comb[1966]['Topic24']),
            'Health Research': np.mean(comb[1966]['Topic19']),
            'Heart Disease': np.mean(comb[1966]['Topic5']),
            'Community Health': np.mean(comb[1966]['Topic27']),
            'Typhoid': np.mean(comb[1966]['Topic14'])
            }
topicdict1 = {'Patients': np.mean(comb[1971]['Topic0']),
              'Heart Disease': np.mean(comb[1971]['Topic3']),
              'Family Health': np.mean(comb[1971]['Topic5']),
              'FDA': np.mean(comb[1971]['Topic7']),
              'Cancer': np.mean(comb[1971]['Topic8']),
              'Pollution': np.mean(comb[1971]['Topic20']),
              'Disasters': np.mean(comb[1971]['Topic23']),
              'Drug Abuse': np.mean(comb[1971]['Topic25']),
              "Women's Health": np.mean(comb[1971]['Topic28']),
              'Hospitals': np.mean(comb[1971]['Topic37']),
              'Epidemics': np.mean(comb[1971]['Topic48'])
             }
topicdict2 = {'Child Health': np.mean(comb[1976]['Topic10']),
              'Health Risks': np.mean(comb[1976]['Topic11']),
              'Energy': np.mean(comb[1976]['Topic17']),
              'International Disease': np.mean(comb[1976]['Topic19']),
              'Pollution': np.mean(comb[1976]['Topic20']),
              'Environment': np.mean(comb[1976]['Topic25']),
              "Abortion": np.mean(comb[1976]['Topic31']),
              'Cancer': np.mean(comb[1976]['Topic38']),
              'Doctors': np.mean(comb[1976]['Topic55']),
              'Hospitals': np.mean(comb[1976]['Topic58']),
              'Heart Disease': np.mean(comb[1976]['Topic63'])
             }
topicdict3 = {'Health Research': np.mean(comb[1981]['Topic4']),
              'Chemical Plant': np.mean(comb[1981]['Topic16']),
              'Chronic Disease': np.mean(comb[1981]['Topic22']),
              'Sports Injuries': np.mean(comb[1981]['Topic8'])
              #'General Interest': np.mean(comb[1981]['Topic21'])
             }
topicdict4 = {'Hospitals': np.mean(comb[1986]['Topic42']),
              'AIDs': np.mean(comb[1986]['Topic5']),
              'FDA': np.mean(comb[1986]['Topic9']),
              'Women': np.mean(comb[1986]['Topic10']),
              #'Dangers': np.mean(comb[1986]['Topic36']),
              'Environment': np.mean(comb[1986]['Topic17']),
              'Legislation': np.mean(comb[1986]['Topic24']),
              'Cancer': np.mean(comb[1986]['Topic27']),
              'Healthcare': np.mean(comb[1986]['Topic45']),
              'Homeless': np.mean(comb[1986]['Topic48']),
              'Childcare': np.mean(comb[1986]['Topic58']),
              'Job Safety': np.mean(comb[1986]['Topic61'])
              }
topicdict5 = {'CDC': np.mean(comb[1991]['Topic3']),
              'Abortion': np.mean(comb[1991]['Topic4']),
              'Sports': np.mean(comb[1991]['Topic5']),
              'Hospitals': np.mean(comb[1991]['Topic6']),
              'Pollution': np.mean(comb[1991]['Topic13']),
              'Children': np.mean(comb[1991]['Topic30']),
              "Women's Health": np.mean(comb[1991]['Topic60']),
              'AIDs': np.mean(comb[1991]['Topic55']),
              'FDA': np.mean(comb[1991]['Topic63']),
              'Cigarettes': np.mean(comb[1991]['Topic66']),
              'Medical Research': np.mean(comb[1991]['Topic74']),
              'Vaccines': np.mean(comb[1991]['Topic75']),
              'Patients': np.mean(comb[1991]['Topic76']),
              'Health Insurance': np.mean(comb[1991]['Topic68'])
              }
topicdict6 = {'Outbreak': np.mean(comb[1996]['Topic1']),
              'Hospitals': np.mean(comb[1996]['Topic7']),
              'Heart Research': np.mean(comb[1996]['Topic12']),
              'Cigarettes': np.mean(comb[1996]['Topic13']),
              'Family': np.mean(comb[1996]['Topic19']),
              'Health Insurance': np.mean(comb[1996]['Topic20']),
              "Poverty": np.mean(comb[1996]['Topic22']),
              'Prescription Drugs': np.mean(comb[1996]['Topic32']),
              'Chemical Plant': np.mean(comb[1996]['Topic39']),
              'Food': np.mean(comb[1996]['Topic42']),
              'Breast Cancer': np.mean(comb[1996]['Topic51']),
              'Children': np.mean(comb[1996]['Topic53'])
              }
topicdict7 = {'Research': np.mean(comb[2001]['Topic1']),
              'Anthrax': np.mean(comb[2001]['Topic2']),
              'Epidemics': np.mean(comb[2001]['Topic3']),
              'Sports Injuries': np.mean(comb[2001]['Topic7']),
              'FDA': np.mean(comb[2001]['Topic12']),
              'Hospitals': np.mean(comb[2001]['Topic15']),
              "Health Insurance": np.mean(comb[2001]['Topic19']),
              'Death': np.mean(comb[2001]['Topic28']),
              'Pollution': np.mean(comb[2001]['Topic31']),
              'Family': np.mean(comb[2001]['Topic34']),
              'Chronic Disease': np.mean(comb[2001]['Topic39'])
             }
topicdict8 = {'Gas': np.mean(comb[2006]['Topic6']),
              'Fitness': np.mean(comb[2006]['Topic9']),
              'FDA': np.mean(comb[2006]['Topic10']),
              'Hospitals': np.mean(comb[2006]['Topic11']),
              'Safety': np.mean(comb[2006]['Topic16']),
              'ACA': np.mean(comb[2006]['Topic20']),
              "Swine Flu": np.mean(comb[2006]['Topic34']),
              'Health Inusrance': np.mean(comb[2006]['Topic37']),
              'Hurricane Katrina': np.mean(comb[2006]['Topic38']),
              'Doctors': np.mean(comb[2006]['Topic41']),
              'Prescription Drugs': np.mean(comb[2006]['Topic40']),
              'Children': np.mean(comb[2006]['Topic42'])
              }
topicdict9 = {'Bird Flu': np.mean(comb[2011]['Topic0']),
              'Ebola': np.mean(comb[2011]['Topic2']),
              #'Workers': np.mean(comb[2011]['Topic6']),
              'Safety': np.mean(comb[2011]['Topic7']),
              'Breast Cancer': np.mean(comb[2011]['Topic11']),
              'Abortion': np.mean(comb[2011]['Topic13']),
              "Health Insurance": np.mean(comb[2011]['Topic21']),
              'Outbreak': np.mean(comb[2011]['Topic26']),
              'Terrorist': np.mean(comb[2011]['Topic31']),
              'Researchers': np.mean(comb[2011]['Topic32'])
              #'Jobs': np.mean(comb[2011]['Topic18'])
             }
td0 = pd.DataFrame(topicdict.items(), columns=['Topics', 'Means'])
td1 = pd.DataFrame(topicdict1.items(), columns=['Topics', 'Means'])
td2 = pd.DataFrame(topicdict2.items(), columns=['Topics', 'Means'])
td3 = pd.DataFrame(topicdict3.items(), columns=['Topics', 'Means'])
td4 = pd.DataFrame(topicdict4.items(), columns=['Topics', 'Means'])
td5 = pd.DataFrame(topicdict5.items(), columns=['Topics', 'Means'])
td6 = pd.DataFrame(topicdict6.items(), columns=['Topics', 'Means'])
td7 = pd.DataFrame(topicdict7.items(), columns=['Topics', 'Means'])
td8 = pd.DataFrame(topicdict8.items(), columns=['Topics', 'Means'])
td9 = pd.DataFrame(topicdict9.items(), columns=['Topics', 'Means'])

In [None]:
#weighting the means
tds = [td0,td1,td2,td3,td4,td5,td6,td7,td8,td9]
weights = [50,50,65,25,65,80,60,40,45,40]
tdss = []
for i, td in enumerate(tds):
    td.Means = td.Means*weights[i]
    tdss.append(td.sort('Means', ascending=False).reset_index())

In [None]:
# Adding colors by general topic category

tds[0]['col'] = ['brown', 'black', 'cyan', 'khaki', 'pink', 'black', 'khaki', 'green', 'chartreuse', 'red']
tds[1]['col'] = ['black', 'cyan', 'blue', 'khaki', 'purple', 'black', 'black', 'black', 'khaki', 'pink', 'chartreuse']
tds[2]['col'] = ['khaki', 'black', 'black', 'pink', 'blue', 'chartreuse', 'khaki', 'chartreuse', 'black', 'yellow', 'black']
tds[3]['col'] = ['khaki', 'black', 'chartreuse', 'khaki']
tds[4]['col'] = ['cyan', 'red', 'blue', 'black', 'black', 'pink', 'khaki', 'green', 'black', 'chartreuse']
tds[5]['col'] = ['green', 'blue', 'black', 'black', 'red', 'pink', 'black', 'purple', 'pink', 'brown', 'cyan', 'chartreuse', 'khaki', 'black']
tds[6]['col'] = ['black', 'blue', 'blue', 'green', 'brown', 'khaki', 'black', 'pink', 'cyan', 'chartreuse', 'purple', 'black']
tds[7]['col'] = ['purple', 'cyan', 'black', 'khaki', 'green', 'chatreuse', 'khaki', 'blue', 'black', 'red', 'black']
tds[8]['col'] = ['green', 'cyan', 'green', 'red', 'cyan', 'black','blue', 'black', 'black', 'black', 'red']
tds[9]['col'] = ['red', 'green', 'pink', 'khaki', 'pink', 'purple', 'black', 'red']

In [None]:
#list of years for reference in loop
years = range(1966,2016,5)

In [None]:
plt.figure(figsize=(35,10))
for i in range(5):
    td = tdss[i]
    plt.subplot(1,5,i+1)
    plt.ylim(0,13+0.5)
    plt.xticks([])
    plt.yticks([])
    plt.title(str(years[i])+'-'+str(years[i]+4))
    for index, row in td.iterrows():
        plt.text(0.3, 13-index-0.5, row.Topics, fontsize=row.Means*30, color=row.col)
plt.grid(False)
plt.show()

![](http://imgur.com/M60UpZX.png)

In [None]:
plt.figure(figsize=(35,10))
for i in range(5):
    td = tdss[i+5]
    plt.subplot(1,5,i+1)
    plt.ylim(0,15+0.5)
    plt.xticks([])
    plt.yticks([])
    plt.title(str(years[i+5])+'-'+str(years[i+5]+4))
    for index, row in td.iterrows():
        plt.text(0.3, 15-index-0.5, row.Topics, fontsize=row.Means*21, color=row.col)
plt.grid(False)
plt.show()

![](http://imgur.com/Yt6k0Cv.png)

In [None]:
# Color legend for topic categories

colors = ['green', 'blue', 'red', 'pink', 'purple', 'brown', 'cyan', 'chartreuse', 'yellow', 'black']
words = ['Health Insurance', 'Children/Family', 'Outbreak Events', "Women's Health", 'Outbreak (General)', 'Smoking', 'FDA/Drugs', 'Pollution', 'Chronic Disease/Research', 'Other']

plt.figure(figsize=(5,5))
plt.xticks([])
plt.yticks([])
plt.title('Legend')
for index in np.arange(0,10,1):
    plt.text(0.2, .8-index*.08, words[index], color=colors[index], fontsize=15)
plt.grid(False)
plt.show()

![](http://i.imgur.com/RcyYcsU.png)

From these topic lists we can see some patterns.  Smoking for example is only important in the late 1960s (when information about its hazards was being brought to public attention) and in the 1990s (when it was publicly banned throughout the US).  Women's Health topics are persistent over time, but the specific topic changes from birth control to abortion to breast cancer.  Every so often there is public consciousness of outbreaks or epidemics.  

#### Conclusion

Overall, we were able to accomplish our goal of identifying and classifying popular health topics over the past 50 years.  We found some interesting specific events, and were able to match these events to Google Trends topics for Ebola, Swine Flu, and ACA and to the evolution of health literature in PubMed for AIDs.  Finally, we were able to look at how major health topics have evolved over time in popularity by category.  We made some interesting insights into the history (and present) of health in this country (and globally) while deepening our knowledge of the techniques and methods involved in topic modeling.  Though we mainly performed an exploratory anlaysis, our exploration may inform further hypotheses and inspire future research into this deep and wide ranging topic!

## Presentation:

In addition to the results explained above, check out our [website](https://sites.google.com/site/cs109projectteamsignificant/home) and [video](https://youtu.be/JuxRVxczTrA). 