# DSCI 575: Advanced Machine Learning 
## (with an undercurrent of Natural Language Processing (NLP) applications)


## Lecture 5: The Viterbi algorithm and topic modeling 

UBC Master of Data Science program, 2018-19

Instructor: Varada Kolhatkar [ʋəɾəda kɔːlɦəʈkər]

In [3]:
# And import the libraries 

import pandas as pd
import numpy as np

In [4]:
# Thanks to Firas for the following code to make jupyter RISE slides pretty! 
from traitlets.config.manager import BaseJSONConfigManager
from pathlib import Path
path = Path.home() / ".jupyter" / "nbconfig"
cm = BaseJSONConfigManager(config_dir=str(path))
tmp = cm.update(
        "rise",
        {
            "theme": "serif",
            "transition": "fade",
            "start_slideshow_at": "selected",            
            "width": "100%",
            "height": "100%",
            "header": "",
            "footer":"",
            "scroll": True,
            "enable_chalkboard": True,
            "slideNumber": True,
            "center": False,
            "controlsLayout": "edges",
            "slideNumber": True,
            "hash": True,
        }
    )

In [5]:
%%HTML
<style>
.rendered_html table, .rendered_html th, .rendered_html tr, .rendered_html td {
     font-size: 130%;
}

body.rise-enabled div.inner_cell>div.input_area {
    font-size: 100%;
}

body.rise-enabled div.output_subarea.output_text.output_result {
    font-size: 100%;
}
body.rise-enabled div.output_subarea.output_text.output_stream.output_stdout {
  font-size: 150%;
}
</style>

In [6]:
import os.path
import numpy as np
import re
import pandas as pd
import matplotlib.pyplot as plt

import gensim 

from gensim import matutils, models
import string

%matplotlib inline

## Learning outcomes

From this lesson you will be able to

- explain the general idea and purpose of the Viterbi algorithm
- apply the Viterbi algorithm given an HMM
- explain the general idea of topic modeling
- explain the data generation process given a topic model 
- explain at a high-level how to go from raw data to topics 
- carry out topic modeling by training an [LDA model with gensim](https://radimrehurek.com/gensim/models/ldamodel.html)

### Three fundamental questions for an HMM

#### Likelihood
Given a model with parameters $\theta = <\pi, T, B>$, how do we efficiently compute the likelihood of a particular observation sequence $O$?
#### Decoding
Given an observation sequence $O$ and a model $\theta$ how do we choose a state sequence $Q={q_0, q_1, \dots q_T}$ that best explains the observation sequence?
#### Learning
Training: Given a large observation sequence $O$ how do we choose the best parameters $\theta$ that explain the data $O$? 

### Decoding

- Given an observation sequence $O$ and a model $\theta$ how do we choose a state sequence $Q={q_0, q_1, \dots q_T}$ that best explains the observation sequence?
- Purpose: finding what's most likely going on under the hood. 
- For example: It tells us the most likely part-of-speech tags given an English sentence.

<blockquote>
Will/MD tshe/DT chair/NN chair/VB the/DT meeting/NN from/IN that/DT chair/NN?
</blockquote>    

### The Viterbi algorithm: Choosing $Q={q_{0:T}}$

- Given an HMM, choose the state sequence that maximizes the probability of the output sequence.  
 * $Q^* = \arg \max\limits_Q P(O,Q;\theta)$, 
 * $P(O,Q;\theta) = \pi_{q_0}b_{q_0}(o_0) \prod\limits_{t=1}^{T}a_{q_{t-1}}a_{q_t}b_{q_t}(o_t)$

<img src="images/HMM_example_trellis.png" height="700" width="700"> 

### The Viterbi algorithm: Choosing $Q={q_{0:T}}$

- Dynamic programming algorithm.
- We use a different kind of trellis.
- Want: Given an HMM, choose the state sequence that maximizes the probability of the output sequence.  
 * $Q^* = \arg \max\limits_Q P(O,Q;\theta)$

### The Viterbi algorithm: Choosing $Q={q_{0:T}}$

- We store $\delta$ and $\psi$ values at each node in the trellis

- $\delta_i(t)$ = the probability of the most probable path leading to the trellis node at state $i$ and time $t$
- $\psi_i(t) =$ The best possible previous state if I am in state $i$ at time $t$. 

<img src="images/HMM_example_trellis.png" height="700" width="700"> 


### Viterbi: Initialization
- Initialize with $\delta_i(0) = \pi_i b_i(o_0)$ for all states
    - $\delta_🙂(0) = \pi_🙂 b_🙂(E) = 0.8 \times 0.2 = 0.16$
    - $\delta_😔(0) = \pi_😔 b_😔(E) = 0.2 \times 0.1 = 0.02$
    
- Initialize with $\psi_i(0) = 0 $, for all states   
    - $\psi_🙂(0) = 0, \psi_😔(0) = 0$
    
<img src="images/HMM_example_trellis.png" height="700" width="700"> 

### Viterbi: Induction

The best path $\delta_t$ to state $j$ at time $t$ depends on the best path to each
possible previous state $\delta_i(t-1)$ and their transitions to $j$ ($a_{ij}$). 

- $\delta_j(t) = \max\limits_i [\delta_i(t-1)a_{ij}] b_j(o_t)$
- $\psi_j(t) = \arg \max\limits_i [\delta_i(t-1)a_{ij}] $

<img src="images/HMM_example_trellis.png" height="800" width="800"> 

### Viterbi induction: $\delta$ and $\psi$ intuition

<img src="images/viterbi_explanation.png" height="150" width="150"> 


- There are two possible paths to state 🙂 at $T = 1$. Which is the best one? 
- $\delta_🙂(1) = \max \begin{bmatrix} \delta_🙂(0) \times a_{🙂🙂} \\ \delta_😔(0) \times a_{😔🙂}\end{bmatrix}  \times b_🙂(L)$
- First take the max between $\delta_🙂(0) \times a_{🙂🙂}$ and $\delta_😔(0) \times a_{😔🙂}$ and then multiply the max by $b_🙂(L)$.   
    
- $\psi_🙂(1)$ = the state at $T=0$ from where the path to 🙂 at $T=1$ was the best one.     


### Viterbi:  notation

- **Note that we use square parentheses to show two quantities for taking the max. (Not the best notation but I have seen it being used in this context.)**  
- $\delta_🙂(1) = \max \begin{bmatrix} \delta_🙂(0) \times a_{🙂🙂} \\ \delta_😔(0) \times a_{😔🙂}\end{bmatrix}  \times b_🙂(L)$
- First take the max between $\delta_🙂(0) \times a_{🙂🙂}$ and $\delta_😔(0) \times a_{😔🙂}$ and then multiply the max by $b_🙂(L)$.   

### Viterbi: Induction (T = 1)

$\delta$ and $\psi$ at state 🙂 and T = 1
- $\delta_{🙂}(1) = \max\limits_i [\delta_i(0)a_{ij}] b_j(o_t) = 
\max \begin{bmatrix} 0.16 \times 0.7 \\ 0.02 \times 0.4\end{bmatrix} \times 0.7 = 0.0784$
- $\psi_{🙂}(1) = \arg \max\limits_i [\delta_i(0)a_{ij}] = 🙂$

$\delta$ and $\psi$ at state 😔 and T = 1
- $\delta_{😔}(1) = \max\limits_i [\delta_i(0)a_{ij}] b_j(o_t) =  \max \begin{bmatrix} 0.16 \times 0.3 \\ 0.02 \times 0.6\end{bmatrix} \times 0.1 = 0.0048$
- $\psi_{😔}(1) = \arg \max\limits_i [\delta_i(0)a_{ij}] = 🙂$

<img src="images/HMM_example_trellis.png" height="800" width="800"> 

### Viterbi: Induction (T = 1)

$\delta$ and $\psi$ at state 🙂 and T = 1
- $\delta_{🙂}(1) = \max \begin{bmatrix} \delta_🙂(0) \times a_{🙂🙂} \\ \delta_😔(0) \times a_{😔🙂}\end{bmatrix}  \times b_🙂(L) = 
\max \begin{bmatrix} 0.16 \times 0.7 \\ 0.02 \times 0.4\end{bmatrix} \times 0.7 = 0.0784$
- $\psi_{🙂}(1)  = 🙂$

$\delta$ and $\psi$ at state 😔 and T = 1
- $\delta_{😔}(1) = \max\limits_i [\delta_i(0)a_{ij}] b_j(o_t) =  \max \begin{bmatrix} 0.16 \times 0.3 \\ 0.02 \times 0.6\end{bmatrix} \times 0.1 = 0.0048$
- $\psi_{😔}(1) = \arg \max\limits_i [\delta_i(0)a_{ij}] = 🙂$

<img src="images/HMM_example_trellis.png" height="800" width="800"> 

### Viterbi: Induction (T = 2)

- $\delta$ and $\psi$ at state 🙂 and T = 2
    - $\delta_{🙂}(2) = \max\limits_i [\delta_i(1)a_{ij}] b_j(o_t) =  \max \begin{bmatrix} 0.0.784 \times 0.7 \\ 0.0048 \times 0.4 \end{bmatrix}\times 0 = 0
$
    - $\psi_{🙂}(2) = \arg \max\limits_i [\delta_i(1)a_{ij}] = 🙂$

- $\delta$ and $\psi$ at state 😔 and T = 2
    - $\delta_{😔}(2) = \max\limits_i [\delta_i(1)a_{ij}] b_j(o_t) =  \max \begin{bmatrix} 0.0.784 \times 0.3 \\ 0.0048 \times 0.6 \end{bmatrix}\times 0.2 = 4.704 \times 10^{-3}$
    - $\psi_{😔}(2) = \arg \max\limits_i [\delta_i(1)a_{ij}] = 🙂$

<br>

<img src="images/HMM_example_trellis.png" height="400" width="400"> 

### Viterbi: Induction (T = 3)

- $\delta$ and $\psi$ at state 🙂 and T = 3
    - $\delta_{🙂}(3) = \max\limits_i [\delta_i(2)a_{ij}] b_j(o_t) = \max \begin{bmatrix} 0 \times 0.7 \\ 4.704 \times 10^{-3} \times 0.4 \end{bmatrix} \times 0.1 = 1.88\times10^{-4}
$
    - $\psi_{🙂}(3) = \arg \max\limits_i [\delta_i(2)a_{ij}] = 😔$

- $\delta$ and $\psi$ at state 😔 and T = 3
    - $\delta_{😔}(3) = \max\limits_i [\delta_i(2)a_{ij}] b_j(o_t) = \max \begin{bmatrix} 0 \times 0.3 \\ 4.704 \times 10^{-3} \times 0.6 \end{bmatrix} \times 0.6 = 1.69 \times 10^{-3}$
    - $\psi_{😔}(3) = \arg \max\limits_i [\delta_i(2)a_{ij}] = 😔$

<br>
<img src="images/HMM_example_trellis.png" height="400" width="400"> 


### Viterbi conclusion

- Choose the best final state: $q_t^* = \arg \max\limits_i \delta_i(t)$
- Recursively choose the best previous state: $q_{t-1}^* = \psi_{q_t^*}(t)$
    - The most likely state sequence for the observation sequence ELFC is 🙂🙂😔😔.
- The probability of the state sequence is the probability of $q_t^*$
    - $P(🙂🙂😔😔) = 1.69 \times 10^{-3}$    
    
<br>
<img src="images/HMM_example_trellis.png" height="800" width="800"> 

### Viterbi final comments

- This is how you find the best state sequence that explains the observation sequence using the Viterbi algorithm!   
- Much faster than the brute force approach of considering all possible state combinations, calculating probabilities for each of them and taking the one resulting in maximum probability. 

### HMM for ASR


<img src="images/HMM_ASR.png" height="1000" width="1000"> 

(Credit: [Stanford cs224s lectures](https://web.stanford.edu/class/cs224s/lectures/224s.17.lec3.pdf))

### HMM for Gene prediction

- Annotated VSG genes and their predictions by an HMM with 3 states for the sequence 9.

<center>
<img src="files/images/HMM_gene_prediction.png" height="600" width="600"> 
</center>

(Credit: [Mesa et al. 2015](https://arxiv.org/pdf/1508.05367.pdf))

### (Optional) HMMs with [ `hmmlearn`](https://hmmlearn.readthedocs.io) 

In [2]:
import numpy as np
from hmmlearn import hmm

# Initializing an HMM 
states = ['Happy', 'Sad']
n_states = len(states)

observations = ['Learn', 'Eat', 'Cry', 'Facebook']
n_observations = len(observations)

model = hmm.MultinomialHMM(n_components=n_states, init_params="")
model.startprob_ = np.array([0.8,0.2])
model.transprob_ = np.array([
 [0.7, 0.3],
 [0.4, 0.6]
])
model.emissionprob_ = np.array([
    [0.6, 0.3, 0.1, 0.0],
    [0.1, 0.1, 0.6, 0.2]
])

observation_sequence = np.array([[0, 0, 2, 3, 2, 0, 1, 0, 1, 2, 2]]).T
print(observation_sequence)

# Fit the model
model = model.fit(observation_sequence)

[[0]
 [0]
 [2]
 [3]
 [2]
 [0]
 [1]
 [0]
 [1]
 [2]
 [2]]


In [3]:
# Likelihood computation
X, Z = model.sample(5)
print(X)
print(Z)
print('loglikelihood of X: ', model.score(X))
X, Z = model.sample(9)
print(X)
print(Z)
print('loglikelihood of X: ', model.score(X))

[[0]
 [0]
 [0]
 [2]
 [2]]
[0 0 0 1 1]
loglikelihood of X:  -3.864361568851093
[[0]
 [0]
 [0]
 [1]
 [3]
 [2]
 [2]
 [2]
 [3]]
[0 0 0 0 1 1 1 1 1]
loglikelihood of X:  -9.733978147247116


In [4]:
# Decoding
logprob, state_seq = model.decode(observation_sequence, algorithm="viterbi")
print('State sequence: ', state_seq)
print('Observations: ', ", ".join(map(lambda x: observations[x], observation_sequence.T[0])))
print('State sequence: ', ", ".join(map(lambda x: states[x], state_seq)))
print('Log probability of the state sequence: ', logprob)

State sequence:  [0 0 1 1 1 0 0 0 0 1 1]
Observations:  Learn, Learn, Cry, Facebook, Cry, Learn, Eat, Learn, Eat, Cry, Cry
State sequence:  Happy, Happy, Sad, Sad, Sad, Happy, Happy, Happy, Happy, Sad, Sad
Log probability of the state sequence:  -12.541693722559724


## Topic modeling

Attribution: Material and presentation in the next slides is adapted from [Jordan Boyd-Graber's excellent material on LDA](http://users.umiacs.umd.edu/~jbg/teaching/CMSC_726/16a.pdf).


### Topic modeling motivation

- Suppose you have a large collection of documents on a variety of topics. 

### Example: A corpus of news articles 

<img src="images/TM_NYT_articles.png" height="2000" width="2000"> 


### Example: A corpus of food magazines 

<img src="images/TM_food_magazines.png" height="2000" width="2000"> 


### A corpus of scientific articles

<img src="images/TM_science_articles.png" height="2000" width="2000"> 

(Credit: [Dave Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))

### Topic modeling motivation

- Humans are pretty good at reading and understanding a document and answering questions such as 
    - What is it about?  
    - Which documents is it related to?     
- But for a large collection of documents it would take years to read all documents and organize and categorize them so that they are easy to search.
- You need an automated way
    - to get an idea of what's going on in the data or 
    - to pull documents related to a certain topic

### Topic modeling

- Topic modeling gives you an ability to summarize the major themes in a large collection of documents (corpus). 
    - Example: The major themes in a collection of news articles could be 
        - **politics**
        - **entertainment**
        - **sports**
        - **technology**
        - ...
- A common tool to solve such problems is unsupervised ML methods.
- Given the hyperparameter $K$, the idea of topic modeling is to describe the data using $K$ "topics"

### Topic modeling: Input and output

- Input
    - A large collection of documents
    - A value for the hyperparameter $K$ (e.g., $K = 3$)
- Output
    1. Topic-words association 
        - For each topic, what words describe that topic? 
    2. Document-topics association
        - For each document, what topics are expressed by the document? 
    

### Topic modeling: Example

- Topic-words association 
    - For each topic, what words describe that topic?  
    - A topic is a mixture of words. 

<img src="images/topic_modeling_word_topics.png" height="1000" width="1000"> 

### Topic modeling: Example

- Document-topics association 
    - For each document, what topics are expressed by the document?
    - A document is a mixture of topics. 
    
<img src="images/topic_modeling_doc_topics.png" height="800" width="800"> 

### Topic modeling: Input and output

- Input
    - A large collection of documents
    - A value for the hyperparameter $K$ (e.g., $K = 3$)
- Output
    - For each topic, what words describe that topic?  
    - For each document, what topics are expressed by the document?

<img src="images/topic_modeling_output.png" height="1000" width="1000"> 

### Topic modeling: Some applications

- Topic modeling is a great EDA tool to get a sense of what's going on in a large corpus. 
- Some examples
    - If you want to pull documents related to a particular lawsuit. 
    - You want to examine people's sentiment towards a particular candidate and/or political party and so you want to pull tweets or Facebook posts related to election.   

## How do we do topic modeling? 

### Topic modeling as matrix factorization

- You can think of topic modeling as a matrix factorization problem. 
- Given
    - $K \rightarrow $ Number of topics
    - $M \rightarrow $ Number of documents
    - $V \rightarrow $ Size of vocabulary

<img src="images/topic_modeling_matrix_factorization.png" height="800" width="800"> 

- Use SVD for factorization and it's referred to as Latent Semantic Indexing (LSA) in information retrieval. 
- Perfectly valid approach! 

[Source](http://users.umiacs.umd.edu/~jbg/teaching/CMSC_726/16b.pdf)



### Alternative: Latent Dirichlet Allocation (LDA)

- A Bayesian, probabilistic, and generative approach  
- Developed by [David Blei](http://www.cs.columbia.edu/~blei/) and colleagues in 2003. 
    * One of the most cited papers in the last 15 years.
- DISCLAIMER    
    - We won't go into the math because we do not have time to go in details. 
    - My goal is to give you an intuition of the model and show you how to use it to solve your problems. 

### LDA high-level idea

- Dirichlet distribution is a distribution of distributions. 
- In our case,
    - Every document is a discrete probability distribution of topics. 
    - Every topic is a discrete probability distribution of words.
    - So we are have distributions of distributions.     

### LDA: insight
- Each document is a mixture of corpus-wide topics
- Every topic is a mixture words

<img src="images/TM_dist_topics_words_blei.png" height="1000" width="1000"> 

(Credit: [David Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))

### Generative story of LDA
- The story that tells us how our data was generated. 
- The generative story of LDA to create Document 1 below:     
    1. Pick a topic from the topic distribution for Document 1. 
    2. Pick a word from the selected topic's word distribution. 

<img src="images/topic_modeling_generative_story.png" height="700" width="700"> 

### Mathematical presentation of the generative story (plate diagram)

- We are not going into the details but I would like you to be familiar with this picture at a high-level because it's likely that you might see it in the context of topic modeling. 

<img src="images/topic_modeling_plate_diagram.png" height="500" width="500"> 

- For each topic $k \in \{1, \dots, K\}$ draw a multinomial distribution $\beta_k$ from a Dirichlet distribution with parameter $\lambda$. 
- For each document $d \in \{1, \dots, M\}$, draw a multinomial distribution $\theta_d$
from a Dirichlet distribution with parameter $\alpha$
- For each word position $n \in \{1, \dots, N\}$, select a hidden topic $Z_n$ from the multinomial distribution parameterized by $\theta$.
- Choose the observed word $w_n$ from the distribution $\beta_{Z_n}$. 

[Source](http://users.umiacs.umd.edu/~jbg/teaching/CMSC_726/16a.pdf)

### LDA Inference

- Infer the underlying topic structure in the documents. In particular, 
    - Learn the discrete probability distributions of topics in each document
    - Learn the discrete probability distributions of words in each topic

### LDA Inference

- We are interested in the posterior distribution: $P(z, \beta, \theta| w_n, \alpha, \lambda)$
- Observations: words. Everything else is hidden (latent). 

<img src="images/topic_modeling_plate_diagram.png" height="600" width="600"> 


- $\lambda$: Hyperparameter for word proportion
    - High $\lambda$ &rarr; every topic contains a mixture of most of the words
    - Low $\lambda$ &rarr; every topic contains a mixture of only few words
    
- $\alpha$: Hyperparameter for topic proportion  
   - High $\alpha$ &rarr; every document contains a mixture of most of the topics
   - Low $\alpha$ &rarr; every document is representative of only a few topics    

### How do we find the posterior distribution? 

- We are interested in the posterior distribution: $P(z, \beta, \theta| w_n, \alpha, \lambda)$
- How do we find it? 
    - Variational inference
    - **Gibbs sampling**

### LDA algorithm: Gibbs sampling 

- Sample topic assignments
- Calculate conditional probability of single word topic assignment conditioned on the rest of the parameters. 

<img src="images/topic_modeling_topic_word_assignment.png" height="700" width="700"> 

### Gibbs sampling equation: Calculating the conditional probability

- Two components
    - How much this document likes topic $k$: 
    $$\frac{n_{d,k} + \alpha_k}{\sum^K_i n_{d,i} + \alpha_i}$$
    - How much this topic likes word $w_{d,n}$: $$\frac{V_{k, w_{d,n}} + \lambda_{w_{d,n}}}{\sum_i V_{k,i} + \lambda_i}$$ 
- The conditional probability of word topic assignment given everything else in the model: 

$$\frac{n_{d,k} + \alpha_k}{\sum^K_i n_{d,i} + \alpha_i} \frac{V_{k, w_{d,n}} + \lambda_{w_{d,n}}}{\sum_i V_{k,i} + \lambda_i}$$

- $n_{d,k} \rightarrow$ number of times document $d$ uses topic $k$ 
- $V_{k, w_{d,n}} \rightarrow$ number of times topic $k$ uses word type $w_{d,n}$
- $\alpha_k \rightarrow$ Dirichlet parameter for document to topic distribution
- $\lambda_{w_{d,n}} \rightarrow$ Dirichlet parameter for topic to word distribution

### LDA algorithm 

- Suppose $K$ is number of topics
- For each iteration $i$
    - For each document $d$ and word $n$ currently assigned to topic $Z_{old}$
        - Decrement $n_{d,Z_{old}}$ and $V_{Z_{old}, w_{d,n}}$
        - Sample $Z_{new} = k$ with probability proportional to $\frac{n_{d,k} + \alpha_k}{\sum^K_i n_{d,i} + \alpha_i} \frac{V_{k, w_{d,n}} + \lambda_{w_{d,n}}}{\sum_i V_{k,i} + \lambda_i}$
        - Increment $n_{d, Z_{new}} and V_{Z_{new}, w_{d,n}}$
    

### LDA algorithm example

### LDA algorithm example: Random topic assignment

- Randomly assign each word in each document to one of the topics. 
    - The same word in the vocabulary may have different topic assignments in different instances.  

### LDA algorithm example: Sample document and random topic assignment
- Consider this sample document (Document 10) with random topic assignment
<img src="images/topic_modeling_word_topic_assignment.png" height="800" width="800"> 


- With the current topic assignment, here are the topic counts in our document 
<img src="images/topic_modeling_doc_topic_counts.png" height="700" width="700"> 


### LDA algorithm example: Total topic counts

- For each word in our current document (Document 10), calculate how often that word occurs with each topic in all documents

<img src="images/topic_modeling_word_topic_counts.png" height="700" width="700"> 


### LDA algorithm example: Sample a word-topic assignment

- Suppose our sampled word-topic assignment is the word _probabilistic_ in Document 10 with assigned topic 3. 
- How often does the Topic 3 occur in Document 10? Once. 
- How often does the word _probabilistic_ occur with Topic 3 in the corpus? Twice.  


<img src="images/topic_modeling_word_topic_assignment.png" height="600" width="600"> 

<img src="images/topic_modeling_doc_topic_counts.png" height="600" width="600"> 

<img src="images/topic_modeling_word_topic_counts.png" height="600" width="600"> 


### LDA algorithm example: Decrement counts

- We want to update the word topic assignment of _probabilistic_ and Topic 3. 
- Decrement the count of the word from the word-topic counts.
    
<img src="images/topic_modeling_count_decrement.png" height="1200" width="1200"> 


### LDA algorithm example: Calculating conditional probability distribution
- How much does this document like each topic?
    - The document likes Topics 1 and 2 equally (2 occurrences each)
    - $\frac{n_{d,k} + \alpha_k}{\sum^K_i n_{d,i} + \alpha_i}$

- How much does each topic like the word? 
    - Topic 1 likes the word _probabilistic_ compared to other topics (15 occurrences in topic 1 vs. 1 occurrence in topic 2 and 1 occurrence in topic 3)
    - $\frac{V_{k, w_{d,n}} + \lambda_{w_{d,n}}}{\sum_i V_{k,i} + \lambda_i}$
<img src="images/topic_modeling_decremented_counts.png" height="400" width="400"> 

### LDA algorithm example: Calculating conditional probability distribution 

- How much does Document 10 like each topic?
- How much does each topic like word _probabilistic_ ? 

<img src="images/topic_modeling_decremented_counts.png" height="500" width="500"> 

<img src="images/topic_modeling_conditional_proba.png" height="800" width="800">


### Updating topic assignment

- So update the topic of the current word _probabilistic_ in document 10 to **topic 1**
- Update the document-topic and word-topic counts accordingly. 

<img src="images/topic_modeling_update_count.png" height="1200" width="1200"> 


### LDA algorithm: conclusion

- In one pass, the algorithm repeats the above steps for each word in the corpus
- If you do this for several passes, meaningful topics emerge. 

## Topic modeling examples 

### Topic modeling: Input 

<br><br>
<img src="images/TM_science_articles.png" height="2000" width="2000"> 
    
Credit: [David Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf)

### Topic modeling: output

<img src="images/TM_topics.png" height="900" width="900"> 


(Credit: [David Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))

### Topic modeling: output with interpretation
- Assigning labels is a human thing. 

<img src="images/TM_topics_with_labels.png" height="800" width="800"> 

(Credit: [David Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))

### LDA topics in Yale Law Journal
<img src="images/TM_yale_law_journal.png" height="1500" width="1500"> 

(Credit: [David Blei's paper](http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf))

### LDA topics in social media

<img src="images/TM_health_topics_social_media.png" height="1300" width="1300"> 


(Credit: [Health topics in social media](https://journals.plos.org/plosone/article/figure?id=10.1371/journal.pone.0103408.g002))

### Topic modeling with Python 

- Training LDA with [gensim](https://radimrehurek.com/gensim/models/ldamodel.html)



## Topic modeling pipeline 

- Preprocess your corpus (CRUCIAL!!)
- Train LDA using Gensim
- Interpret your topics     
- Evaluate
    - How well your model does on unseen documents? 

### Training LDA with [gensim](https://radimrehurek.com/gensim/models/ldamodel.html)

To train an LDA model with [gensim](https://radimrehurek.com/gensim/models/ldamodel.html)
, you need

- Document-term matrix 
- Dictionary (vocabulary)
- The number of topics ($K$): `num_topics`
- The number of passes: `passes`

In [7]:
toy_df = pd.read_csv('data/toy_lda_data.csv')
toy_df

Unnamed: 0,doc_id,text
0,1,famous fashion model
1,2,fashion model pattern
2,3,fashion model probabilistic topic model confer...
3,4,famous fashion model
4,5,fresh fashion model
5,6,famous fashion model
6,7,famous fashion model
7,8,famous fashion model
8,9,famous fashion model
9,10,creative fashion model


In [8]:
corpus = [doc.split() for doc in toy_df['text'].tolist()]
corpus

[['famous', 'fashion', 'model'],
 ['fashion', 'model', 'pattern'],
 ['fashion', 'model', 'probabilistic', 'topic', 'model', 'conference'],
 ['famous', 'fashion', 'model'],
 ['fresh', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['creative', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['fashion', 'model', 'probabilistic', 'topic', 'model', 'conference'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'model', 'pattern'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['fashion', 'model', 'probabilistic', 'topic', 'model', 'conference'],
 ['apple', 'kiwi', 'nutrition'],
 ['kiwi', 'health', 'nutrition'],
 ['fresh', 'apple', 'health'],
 ['probabilisti

In [9]:
import gensim
from gensim import corpora
# Create a vocabulary for the lda model 
dictionary = corpora.Dictionary(corpus)
print(dictionary.token2id)

{'famous': 0, 'fashion': 1, 'model': 2, 'pattern': 3, 'conference': 4, 'probabilistic': 5, 'topic': 6, 'fresh': 7, 'creative': 8, 'apple': 9, 'kiwi': 10, 'nutrition': 11, 'health': 12, 'hidden': 13, 'markov': 14}


In [10]:
# Convert our corpus into document-term matrix for Lda
doc_term_matrix = [dictionary.doc2bow(doc) for doc in corpus]
doc_term_matrix

[[(0, 1), (1, 1), (2, 1)],
 [(1, 1), (2, 1), (3, 1)],
 [(1, 1), (2, 2), (4, 1), (5, 1), (6, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(1, 1), (2, 1), (7, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(1, 1), (2, 1), (8, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(1, 1), (2, 2), (4, 1), (5, 1), (6, 1)],
 [(2, 1), (5, 1), (6, 1)],
 [(2, 1), (3, 1), (5, 1)],
 [(2, 1), (5, 1), (6, 1)],
 [(2, 1), (5, 1), (6, 1)],
 [(2, 1), (5, 1), (6, 1)],
 [(2, 1), (5, 1), (6, 1)],
 [(2, 1), (5, 1), (6, 1)],
 [(2, 1), (5, 1), (6, 1)],
 [(1, 1), (2, 2), (4, 1), (5, 1), (6, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(10, 1), (11, 1), (12, 1)],
 [(7, 1), (9, 1), (12, 1)],
 [(2, 1), (5, 1), (6, 1)],
 [(8, 1), (11, 1), (12, 1)],
 [(2, 1), (5, 1), (6, 1)],
 [(2, 1), (5, 1), (6, 1)],
 [(2, 1), (5, 1), (13, 1), (14, 1)],
 [(2, 1), (5, 1), (6, 1)],
 [(2, 1), (5, 1), (6, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(9, 1), (10, 1), (12, 1)],
 [(9, 1), (1

In [11]:
from gensim.models import LdaModel

# Train an lda model
lda = models.LdaModel(corpus=doc_term_matrix, 
                      id2word=dictionary, 
                      num_topics=3, 
                      passes=10)

In [12]:
### Examine the topics in our LDA model
lda.print_topics(num_words=4)

[(0, '0.228*"kiwi" + 0.203*"nutrition" + 0.203*"apple" + 0.130*"health"'),
 (1, '0.326*"fashion" + 0.306*"model" + 0.192*"famous" + 0.053*"pattern"'),
 (2,
  '0.316*"model" + 0.296*"probabilistic" + 0.263*"topic" + 0.054*"conference"')]

In [16]:
### Examine the topic distribution for a document
print('Document: ', corpus[0])
print('Topic assignment for document: ', lda[doc_term_matrix[0]])

Document:  ['famous', 'fashion', 'model']
Topic assignment for document:  [(0, 0.08340233), (1, 0.8284444), (2, 0.08815324)]


In [1]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Visualize the topics
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda, doc_term_matrix, dictionary, sort_topics=False)
vis

### Tips when you build an LDA model on a large corpus 

- **Preprocessing is crucial!**
    - Tokenize, remove punctuation, convert text to lower case
    - Discard words with length < threshold or word frequency < threshold        
    - Stoplist: Remove most commonly used words in English 
    - Possibly lemmatization: Consider the lemmas instead of inflected forms. 
    - Depending upon your application, restrict to specific part of speech;
        * For example, only consider nouns, verbs, and adjectives

In [39]:
# BEGIN STARTER CODE
import spacy
# Load English model for SpaCy
nlp = spacy.load("en_core_web_sm")

from urllib.request import urlopen
alice_url = "http://www.umich.edu/~umfandsf/other/ebooks/alice30.txt"
alice_text = urlopen(alice_url).read().decode("utf-8")
alice_text = re.sub(r'\s+',' ', alice_text)
alice_text = alice_text[2000:2100] 

In [40]:
doc = nlp(alice_text)
tokens = [token for token in doc]
lemmas = [token.lemma_ for token in doc]
pos = [token.pos_ for token in doc]
print('\nTokens: ', tokens)
print('\nLemmas: ', lemmas)
print('\nPOS: ', pos)


Tokens:  [ , but, it, was, too, dark, to, see, anything, ;, then, she, looked, at, the, sides, of, the, well, ,, and, noticed, that, the]

Lemmas:  [' ', 'but', '-PRON-', 'be', 'too', 'dark', 'to', 'see', 'anything', ';', 'then', '-PRON-', 'look', 'at', 'the', 'side', 'of', 'the', 'well', ',', 'and', 'notice', 'that', 'the']

POS:  ['SPACE', 'CCONJ', 'PRON', 'VERB', 'ADV', 'ADJ', 'PART', 'VERB', 'NOUN', 'PUNCT', 'ADV', 'PRON', 'VERB', 'ADP', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'PUNCT', 'CCONJ', 'VERB', 'ADP', 'DET']


### Summary

- Viterbi algorithm (HMM wrap-up)
    - We saw the Viterbi algorithm which is a decoding algorithm for an HMM    
- Topic modeling
    - A tool to uncover themes in a large collection of documents
    - We used LDA model for topic modeling, which is a Bayesian, probabilistic, and generative model. 
    - The primary idea of the model is 
        - A document is a mixture of topics 
        - A topic is a mixture of words in the vocabulary 
    - In the lab you will be training and interpreting your own LDA model using `gensim`. 

### Some useful resources and links 
- [Frank Rudzicz's slides on HMM](http://www.cs.toronto.edu/~frank/csc401/lectures2018/5-HMMs.pdf) 
- [Andrew McCallum's slides on HMM](https://people.cs.umass.edu/~mccallum/courses/inlp2004a/lect10-hmm2.pdf)
- [Jordan Boyd-Graber's very approachable explanation of LDA](https://www.youtube.com/watch?v=fCmIceNqVog)
- [lda2vec](https://github.com/cemoody/lda2vec)
- [Original topic modeling paper: David Blei et al. 2003](http://www.cs.columbia.edu/~blei/papers/BleiNgJordan2003.pdf)
- [Topic modeling for computational social scientists ](http://topicmodels.info/)
- [spaCy's Python for data science cheat sheet](http://datacamp-community-prod.s3.amazonaws.com/29aa28bf-570a-4965-8f54-d6a541ae4e06)