Some days back, we implemented topic modeling on company reviews from Indeed using LDA algorithm. 

Today, we will try to understand how exactly LDA works with an example

In [1]:
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from stop_words import get_stop_words

Our goal is build a topic model around these documents using LDA. In this post, we will try to understand the underpinnings of LDA algorithm. 

Let's say we have the following 5 sentences (copied from __[KD nuggets](http://www.kdnuggets.com/2016/07/text-mining-101-topic-modeling.html)__)

- Document 1: I had a peanut butter sandwich for breakfast.
- Document 2: I like to eat almonds, peanuts and walnuts.
- Document 3: My neighbor got a little dog yesterday.
- Document 4: Cats and dogs are mortal enemies.
- Document 5: You must not feed peanuts to your dog.

I have represented this in a list called 'documents' as shown below. 

In [2]:
documents = ['I had a peanut butter sandwich for breakfast', 
             'I like to eat almonds, peanuts and walnuts',
             'My neighbor got a little dog yesterday', 
             'Cats and dogs are mortal enemies', 
             'You must not feed peanuts to your dog']

As mentioned in the post on __[step-by-step LDA](https://dsprojectsblog.wordpress.com/2017/08/21/step-by-step-guide-to-build-your-first-topic-model-using-lda/)__, we first need to preprocess our data. At the very minimum, we should tokenize and remove stop words from our documents. 

In [3]:
#lowercase all documents
lc_doc = [doc.lower() for doc in documents]

lc_doc

['i had a peanut butter sandwich for breakfast',
 'i like to eat almonds, peanuts and walnuts',
 'my neighbor got a little dog yesterday',
 'cats and dogs are mortal enemies',
 'you must not feed peanuts to your dog']

In [4]:
#tokenize documents
tokenizer = RegexpTokenizer(r'\w+')

tokenized_docs = [tokenizer.tokenize(doc) for doc in lc_doc]
tokenized_docs

[['i', 'had', 'a', 'peanut', 'butter', 'sandwich', 'for', 'breakfast'],
 ['i', 'like', 'to', 'eat', 'almonds', 'peanuts', 'and', 'walnuts'],
 ['my', 'neighbor', 'got', 'a', 'little', 'dog', 'yesterday'],
 ['cats', 'and', 'dogs', 'are', 'mortal', 'enemies'],
 ['you', 'must', 'not', 'feed', 'peanuts', 'to', 'your', 'dog']]

In [5]:
#remove stop words
nltk_stop_wds = stopwords.words('english')
get_stop_wds = get_stop_words('en')
all_stop_words = list(set(nltk_stop_wds + get_stop_wds))
all_stop_words += ['.', '...', ',', '(', ')', ':', '`', '``', ';']
all_stop_words += ["'s", "n't"]

doc_wo_stopwords = [[token for token in doc if token not in all_stop_words] for doc in tokenized_docs]
doc_wo_stopwords

[['peanut', 'butter', 'sandwich', 'breakfast'],
 ['like', 'eat', 'almonds', 'peanuts', 'walnuts'],
 ['neighbor', 'got', 'little', 'dog', 'yesterday'],
 ['cats', 'dogs', 'mortal', 'enemies'],
 ['must', 'feed', 'peanuts', 'dog']]

Now that we have preprocessed list of documents, we can try to understand how topic modeling is performed based on this example. 

**Step1: We choose the number of topics beforehand**

If you already have the domain knowledge, you may know the number of possible topics beforehand. 

Other ways to choose the optimal value can be based on trial and error or previous estimates

By glancing over our example, we can say that we have roughly two topics: food and pets

**Step2: Randomly assign topic to every word in each document**

Let's say following was the result of the initial random assignment of topics. 

In [6]:
pd.read_clipboard()

Unnamed: 0,Document 1,Randomly assigned topic,Document 2,Randomly assigned topic.1,Document 3,Randomly assigned topic.2,Document 4,Randomly assigned topic.3,Document 5,Randomly assigned topic.4
0,peanut,pets,like,food,neighbor,food,cats,pets,must,food
1,butter,food,eat,food,got,food,dogs,food,feed,pets
2,sandwich,food,almonds,food,little,pets,mortal,pets,peanuts,food
3,breakfast,pets,peanuts,food,dog,food,enemies,pets,dogs,pets
4,,,walnuts,food,yesterday,food,,,,


**Step3:Iterate over every word in the document and update the topic**

This is done in two steps:
- By finding how relevant is word to the topic. Mathematically, this is computed as: p(word | topic), i.e, what is the probability that topic = 'food' given that our word is 'peanut'?

This is computed by calculating proportion of assignments to topic t, over all documents d, that come from word w. 

So, in our example, 'peanut' is assigned to topic 'pets' in Document 1 and to 'food' in Document 2 and 5. 

So, p(topic = 'pets' | word = 'peanuts') = 1/3
p(topic = 'food' | word = 'peanuts') = 2/3


- Next, we find how relevant is the topic to the document. Mathematically, this is computed as p(topic | document), i.e, what is the probability that we can have Document 1 given that topic = 'food' / topic = 'pets'. 

This is computed by proportion of words in document d that are assigned to topic t. 

So, in our example, this probability will be computed as:

p(topic = 'food' | Document1) = 2 / 4
p(topic = 'pets' | Document 1) = 2 / 4

- Then, we reassign word w a new topic t’, where we choose topic t’ with probability
p(topic = new topic | document d) * p(word w | topic = new topic)

So, in our example, the word 'peanut' in Document 1 is currently assigned to 'pets'. 

We will compute: 

p_food = p(topic = 'food' | Document1) * p(word = 'peanuts' | topic = 'food) = 2 / 4 * 2 / 3 = 1 / 3

On the other hand, based on our current assignment for the word 'peanut', we have:

p_pets = p(topic = 'food' | Document1) * p(word = 'peanuts' | topic = 'food) = 2 / 4 * 1 / 3 = 1 / 6

So, as we can see, p_pets < p_food. Hence, we will update our topic assignment for 'peanuts' in Document 1 to 'food'

Similary, we iterate over every word in the each document and keep updating the topics and cycling through the entire collection of documents multiple times. This iterative updating is the key feature of LDA that generates a final solution with coherent topics.

And there you have it! This is how LDA works under the hood. 