# Topic Modeling with Latent Dirichlet Allocation (LDA)

In this notebook, we are going to cover a statistical method in natural language processing called latent Dirichlet allocation (LDA). The essence of topic modeling with LDA is the idea that each document is made of a mix of topics. Each of these topics are made up of words that describe it best. These topics are unnamed but it is often easy to see why the words that make it up are related (or maybe not). It will be much clearer to see this idea in the visual at the end! Let's get to it. 
Firstly, I am ignoring a few warnings because I don't really want big red text taking attention away from the outputs! It isn't harmful to do this so I will continue on.

In [19]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 

In [20]:
#import packages used in the analysis
import spacy
import nltk
import random
from gensim import corpora
import pickle
import gensim
import pyLDAvis.gensim

In the first few steps, we will create functions that we will run our data through to give us the resulting topic model results and finally, the visualization. Below we employ spaCy, an industrial-grade NLP tool, to create a function that will parse our loaded text data and create tokens which is essential in carrying out further diagnostics with our data.

In [21]:
#using spaCy's english model to do cleaning
parser = spacy.load('es_core_news_sm')
#from spacy.lang.en import English
#parser = English()

#creating tokens
def tokenize(text):
    #listing them together
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('USER_NAME')
        else:
            #lowercasing tokens for consistency
            lda_tokens.append(token.lower_)
    return lda_tokens

From NLTK (natural language toolkit), I use WordNet to find lemmatized components of the words contained in our document. In case you are unsure of what lemmatizing it, it is essentially stemming but lemmatizing is clever about the context from which the word is taken from. For example, if we are talking about the verb 'meeting' vs. the noun 'meeting', lemmatizing is aware of when to cut down to 'meet' or keep the whole form of 'meeting'. 

In [22]:
from nltk.corpus import wordnet
def get_lem(word):
    lem = wordnet.morphy(word)
    if lem is None:
        return word
    else:
        return lem

This is the last step: remove stop-words, tokenize, and lemmatize our text in one function.

In [23]:
#using english stop-words from nltk
nltk.download('stopwords')
eng_stop = set(nltk.corpus.stopwords.words('english'))

#function to tokenize + clean
def prepare_data_lda(text):
    #tokenize
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    #remove stop-words
    tokens = [token for token in tokens if token not in eng_stop]
    #lemmatize
    tokens = [get_lem(token) for token in tokens]
    return tokens

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Eshita/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


For this analysis, I am using a dataset of articles which mention the Affordable Care Act, which was collected by a colleague. There is quite a few lines in this csv (34,000+) so it will take a while to process all of them, so I will randomly take a few lines to prepare the data with, but if you have a small enough dataset I think it would be okay for you to load it all in. The result is a compilation of our cleaned tokens.

In [None]:
text_data = []
#load in csv--this one using the articles2100 data
with open('articles2100.csv', errors='ignore', encoding='utf8') as f:
    for line in f:
        #comment the line below if you want to receive tokens for all documents(more accurate but slower)
        if random.random() > .80:
            tokens = prepare_data_lda(line)
            #print(tokens)
            text_data.append(tokens)

We use gensim to create a dictionary and doc2bow changes our list of words into bag-of-words format and is the last step of processing our data. 

In [24]:
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

We create and save our LDA model and create an output of 5 topics from our corpus and print three words which describe each of the topics that were created. 

In [25]:
#how many topics desired to be created from corpus
num_tops = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = num_tops, id2word=dictionary, passes=7)
#create LDA model and save to call later
ldamodel.save('model.gensim')
#three words that describe each of the five topics
topics = ldamodel.print_topics(num_words=3)
for topic in topics:
    print(topic)

(0, '0.013*"state" + 0.013*"texas" + 0.008*"think"')
(1, '0.030*"republican" + 0.025*"senate" + 0.019*"repeal"')
(2, '0.022*"trump" + 0.019*"house" + 0.017*"president"')
(3, '0.024*"state" + 0.020*"health" + 0.015*"medicaid"')
(4, '0.033*"health" + 0.028*"insurance" + 0.027*"people"')


In [26]:
#load in dictionary and model
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model.gensim')

In [27]:
#pyLDAvis visual
lda_vis = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=True)
pyLDAvis.display(lda_vis)

Saliency describes how much that word contributes to the topic group and the distance map shows how closely the topics are related. We see above that in Topic 1, words like health, insurance, coverage, and obamacare show up and Topic 2 contains words like republican, obamacare again, repeal, senate, trump, and democrat. Topic 1 seems to include words that have to do with the healthcare conversation directly, while Topic 2 seems to contain terms in the conversation of healthcare from a political party standpoint. You can play around and see how many more topics can be created and see how the topics make up your input document. 

There you have it, we are done with LDA! I find this to be a pretty cool tool within NLP and the visual is user-friendly and easy to navigate.