# PART3: LDA
This section focuses on using Latent Dirichlet Allocation (LDA). LDA is a probabilistic topic model that assumes documents are a mixture of topics and that each word in the document is attributable to the document's topics. For our implementaiton of LDA, we use the Gensim pacakage.

In [1]:
# basic imports 
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import cm
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
import gensim
from gensim import corpora

# filenames
clean_emails_filename ='preprocessed_emails.csv'
#loading the clean emails prepared in the first part
clean_emails = pd.read_csv(clean_emails_filename,index_col = 0, header = 0).dropna(how='all')

In [2]:
clean_emails.head()

Unnamed: 0,0
0,wow
1,2011 945 latest syria aid qaddafi sid hrc memo...
2,chri steven thx
3,cairo condemn final
4,11 2011 136 huma abedin latest syria aid qadda...


Below are some Gensim specific conversions; we also filter out extreme words (see inline comment)

In [3]:
#tokenize the emails for building the dictionary
email_token = [nltk.word_tokenize(t) for t in clean_emails.iloc[:,0] ]
# turn our tokenized emails into a id <-> term dictionary
email_dictionary = corpora.Dictionary(email_token)
#remove extremes (similar to the min/max df step used when creating the tf-idf matrix)
email_dictionary.filter_extremes(no_below=1, no_above=0.8)
# convert tokenized emails into a document-term matrix
email_corpus = [email_dictionary.doc2bow(text) for text in email_token]

The actual model is run by the following function. We took 10 passes to have a better convergence, but we can see that it took our machine a lot of time on our machine to run.

In [18]:
def create_lda_model(num_of_topics):
    lda_model = gensim.models.LdaModel(email_corpus, id2word=email_dictionary, num_topics=num_of_topics, update_every=5, chunksize=1000, passes=10)
    return lda_model

In [19]:
#create a lda model starting with 5 topics
%time lda_model = create_lda_model(5)

CPU times: user 1min 20s, sys: 424 ms, total: 1min 20s
Wall time: 1min 21s


In [20]:
lda_model.show_topics()

[(0,
  '0.017*"offic" + 0.016*"secretari" + 0.013*"depart" + 0.011*"senat" + 0.009*"meet" + 0.009*"state" + 0.009*"republican" + 0.008*"room" + 0.008*"hous" + 0.007*"obama"'),
 (1,
  '0.006*"would" + 0.005*"state" + 0.005*"american" + 0.005*"one" + 0.005*"us" + 0.004*"nt" + 0.004*"new" + 0.004*"said" + 0.004*"obama" + 0.004*"presid"'),
 (2,
  '0.009*"state" + 0.006*"work" + 0.006*"us" + 0.004*"issu" + 0.004*"benghazi" + 0.004*"report" + 0.004*"depart" + 0.003*"case" + 0.003*"thank" + 0.003*"also"'),
 (3,
  '0.032*"call" + 0.011*"talk" + 0.009*"tomorrow" + 0.008*"ok" + 0.007*"ap" + 0.006*"today" + 0.006*"get" + 0.005*"confirm" + 0.005*"want" + 0.005*"updat"'),
 (4,
  '0.014*"2010" + 0.012*"fyi" + 0.012*"stategov" + 0.008*"see" + 0.007*"2009" + 0.007*"call" + 0.007*"cheryl" + 0.006*"nt" + 0.006*"14" + 0.006*"thank"')]

We use the pyLDAvis library to show the topics. 

In [21]:
import pyLDAvis.gensim

viz_data = pyLDAvis.gensim.prepare(lda_model, email_corpus, email_dictionary)
pyLDAvis.display(viz_data)

In [22]:
#create a lda model starting with 10 topics
%time lda_model = create_lda_model(10)

CPU times: user 1min 18s, sys: 728 ms, total: 1min 19s
Wall time: 1min 20s


In [23]:
viz_data = pyLDAvis.gensim.prepare(lda_model, email_corpus, email_dictionary)
pyLDAvis.display(viz_data)

In [24]:
#create a lda model starting with 50 topics
%time lda_model = create_lda_model(50)

CPU times: user 1min 47s, sys: 1.49 s, total: 1min 48s
Wall time: 1min 48s


In [25]:
viz_data = pyLDAvis.gensim.prepare(lda_model, email_corpus, email_dictionary)
pyLDAvis.display(viz_data)