# Implementing Latent Dirichlet Allocation for Topic Modeling and Document Classification

### Andrew Cooper, Brian Cozzi

## Abstract

In this report we implement a form of Latent Dirichlet Allocation for topic modeling. Latent Dirichlet Allocation (LDA) was first introduced by Blei, Ng, and Jordan in 2003 as a hierarchical modeling approach for discrete data such as text in a corpus. This algorithm hinges on the notion that collections of data, such as text in a document, are generated according to a latent topic distribution, where each topic assigns probabilities to different words. The purpose of LDA in topic modeling is to group documents based on similar topic distributions, and to identify key words in each topic. Using a collapsed Gibbs sampler approach to LDA as described in Darling 2011, we implement an algorithm that estimates the latent topic distributions of a given corpus of documents. In addition, our algorithm returns key words assigned to each topic. We optimize our algorithm's Gibbs sampler using "just-in-time" compilation. We then evaluate the performance of our algorithm on both simulated data and documents from the Newsgroups corpus. Finally, we compare the accuracy of our algorithm to a variational bayes approach to LDA and to Latent Semantic Analysis (LSA).

Key phrases: Latent Dirichlet Allocation, Topic Modeling, Collapsed Gibbs Sampler, Newsgroups Corpus, Variational Bayes, Latent Semantic Analysis

## 1. Background

This paper provides an overview of the implementation, optimization, and applications of Latent Dirichlet Allocation (LDA). LDA was first introduced in a seminal paper by Blei, Ng, and Jordan in 2003 as an attempt to hierarchically model discrete data such as text in a set of  documents. The original paper primarily focused on topic modeling as it pertains to text data and, for consistency and in the interest of comprehensibility, this is the application area where we will focus the remainder of this paper. In addition, rather than implementing the procedure with the Expectation Maximization approach outlined in Blei, Ng and Jordan, we will instead implement the inference using a collapsed Gibbs Sampler. 

LDA is a widely applicable topic modeling framework for a variety of disciplines. In fields of social and life sciences, the simplification of large sets of categorical data are necessary to quantify similarities and detect anomialies between these large and superficially heterogeneous objects. Perhaps the most abundant and accessible source of this type of categorical data comes from text. Researchers have been able to used in projects as disparate (or seemingly disparate) as finding topics in scientific literature (3) and personalizing medical recommendations (4). With over 20,000 citations (as of 30 April 2019) LDA has emerged as one of the most prolific and dominant methods for modeling topics in text across disciplines.

LDA is a Bayesian hierarchical model with 3 "levels". The lowest is the (discrete) item level which, in an example using text, is a word. The model suggests that this item is produced by some latent "topic". Then, a finite number of these items are understood to be part of a collection (a document) making a collection a finite mixture of its constinuents' underlying topics. Over a set of these collections, each topic is modeled as an infinite mixture over topic probabilities. 

Despite its wide applicability, it is not without some minor drawbacks. While it's not unique to this approach of topic modeling, it neglects the word order in the documents. There are certain extensions of this problem to get around this. For example, rather than just considering occurrences of single words, we can also consider $n$ adjacent words (commonly called $n$-grams) to capture information about phrases. In this case too, the order of the $n$-grams is neglected.

Additionally, in its original formulation, it can be very challenging to implement. In this paper, we follow the lead of several others who have chosen to implement LDA using a form of Gibbs sampling described below.

## 2. Description of Algorithm

The LDA algorithm takes in 4 inputs: a corpus of documents, the number of topics, two optional choices for hyperparameters, and an optional specification on the number iterations for the Gibbs Sampler. The ultimate goal of the algorithm is to estimate the topic distribution for each document as well as the word distribution for each topic. It does this by making inference on the latent topics of each word in the given corpus. We perform this inference by implementing a Gibbs sampler. First the algorithm sets a starting point by randomly sampling topics for each of the words in the corpus. Then it iteratively samples new topics for each word using calculated posterior probabilities. After a number of iterations, the Gibbs Sampler returns the estimated topics for each word, which are then used to calculate the latent topic distributions and word distributions using Monte Carlo estimation. These estimated quantities are returned to the user.

### 2.1 Variables

The LDA algorithm has many different symbols and components. We establish all the symbols used in our algorithm below:

* $K$ = The number of topics
* $M$ = The number of documents in the corpus
* $N_m$ = The number of words in document $m$
* $V$ = The number of possible words
* $w_{m,n}$ = The nth word in document $m$
* $z_{m,n}$ = The nth topic in document $m$ 
* $\theta_m$ = The topic distribution of document $m$
* $\phi_k$ = The word distribution of topic $k$

### 2.2 Algorithm Input

The algorithm takes as input a corpus of documents represented as a $M$x$V$  bag-of-words matrix. In addition, it takes as input the number of topics $K$. It also takes in two positive values representing the hyperparameters for the topic distribution ($\alpha$) and the word distribution ($\beta$). For this implementation we use symmetric priors for the dirichlet distributions, which means only one value is needed as input for each prior. Finally it takes as input the number iterations for the Gibbs sampler.

### 2.3 Gibbs Sampler

The algorithm for LDA has three main steps. The first step of the algorithm is preparing the data for the Gibbs sampler. As a starting point for our sampler, we first must randomly assign topics $z_{m,n}$ for each of the words in the given corpus. We then create two different count matrices, $N_1$ and $N_2$: $N_1$ is a $M$x$K$ matrix that counts the distribution of topics across documents. $N_2$ is a $K$x$V$ matrix that counts the distribution of words across topics. The count matrices are initialized according to the random topic assignment.

The second step is the implementation of the Gibbs sampler. For each iteration, our sampler loops through every word $w_{m,n}$ in every document of our corpus. For each word, we first remove its assigned topic and decrement the appropriate count matrices $N_1$, $N_2$, and $N_3$. We then calculate the posterior probabilities of the word belonging to each of the possible topics. One of these topics is sampled using these probabilities and is assigned to the word. Finally, all of the count matrices are incremented according to this new topic. This process is done for all the words in the corpus for how many iterations the user specified.

***Gibbs Sampler Implementation***

* Randomly assign topics $z_{m,n}$ for all words in corpus, and initialize count matrices $N_1$ and $N_2$

* **for** $i = 1$ to n_iter:
   * **for** $m = 1$ to $M$:
       * **for** $n = 1$ to $N_m$:
           * $w = w_{m,n}$
           * $z_{old} = z_{m,n}$
           * $N_1[m, z_{old}] -= 1$
           * $N_2[z_{old}, w] -= 1$
           * **for** $k = 1$ to $K$:
               * $p_k = Pr(z = k | \dots) \propto (N_1[m, k] + \alpha[k])*\dfrac{N_2[k, w] + \beta[w]}{\sum_V{N_2[k, v] + \beta[v]}}$
           * Normalize $p$ 
           * $z_{new} = $ Sample from $Cat(K, p)$
           * $z_{m,n} = z_{new}$
           * $N_1[m, z_{new}] += 1$
           * $N_2[z_{new}, w] += 1$

### 2.4 Parameter Estimation

The third step to the algorithm is estimating and returning the quantities of interest. One quantity to estimate is the topic distribution $\theta_m$ for each document $m$. Another quantity to estimate is the word distribution $\phi_k$ for each topic. These quantities are estimated using the count matrices $N_1$ and $N_2$ established in the Gibbs sampler:

$\hat{\theta}_{m,k} = \dfrac{N_1[m, k] + \alpha[k]}{\sum_k{N_1[m, k] + \alpha[k]}}$ 

$\hat{\phi}_{k, v} = \dfrac{N_2[k, v] + \beta[v]}{\sum_v{N_2[k, v] + \beta[v]}}$

## 3. Description of Performance Optimization

In the algorithm described above, we see that each iteration of the algorithm runs in $O(MNK)$ time. On its surface, this does not appear to scale well especially as more documents, topics or words are added to the corpus. However, leveraging Python's interface with C and its ability to execute loops much faster than base Python, we can make the run time much more managable. 

It is also worth considering what type of optimization cannot be done. The most obvious limitation is in the outermost loop (Loop 1 in the pseudocode below). This is a Gibbs sampler which means that subsequent iterations of the sampler depend on previous iterations. Therefore, these iterations cannot be done in parallel.
JIT nopython mode: A Numba compilation mode that generates code that does not access the Python C API. This compilation mode produces the highest performance code, but requires that the native types of all values in the function can be inferred. Unless otherwise instructed, the @jit decorator will automatically fall back to object mode if nopython mode cannot be used.

It is also worth consider what type of optimization cannot be done. The most obvious limitation is in the outermost loop. This is a Gibbs sampler which means that subsequent iterations of the sampler depend on previous iterations. Therefore, these iterations cannot be done in parallel.

To investigate the remaining opportunities to optimize, we can consider the pseudocode below where we notice that there are 3 main loops for each iteration of the Gibbs sampler:

* **for** $i = 1$ to n_iter: $\boxed{\textbf{LOOP 1}}$
   * **for** $m = 1$ to $M$:  $\boxed{\textbf{LOOP 2}}$
       * **for** $n = 1$ to $N_m$: $\boxed{\textbf{LOOP 3}}$
           * $w = w_{m,n}$
           * $z_{old} = z_{m,n}$
           * $N_1[m, z_{old}] -= 1$
           * $N_2[z_{old}, w] -= 1$
           * **for** $k = 1$ to $K$: $\boxed{\textbf{LOOP 4}}$
               * $p_k = Pr(z = k | \dots) \propto (N_1[m, k] + \alpha[k])*\dfrac{N_2[k, w] + \beta[w]}{\sum_V{N_2[k, v] + \beta[v]}}$
           * Normalize $p$ 
           * $z_{new} = $ Sample from $Cat(K, p)$
           * $z_{m,n} = z_{new}$
           * $N_1[m, z_{new}] += 1$
           * $N_2[z_{new}, w] += 1$

### 3.1 Loop 4
The task of the innermost loop (Loop 4) is to calculate the posterior probability of a topic given the observed count matrices. As noted, the equation provides the joint distribution, but not the normalizing constant. Inside this loop, it is clear that the values used as inputs are not changing over subsequent iterations: it is simply performing $K$ independent operations. This makes it a great candidate for parallelization however, using JIT for the remaining loops made this impossible. We will expand on this later.

Instead, this loop was optimized using JIT and its _nopython_ option. Below, we provide initialization of a toy dataset and baseline timing for the Gibbs sampler:

#### _Initialization_

In [None]:
# Initializing Values and Count matrices 
from LDA_AandB.Optimization_Examples import initialize, Gibbs
M, doc_lens, Z, W, K, A, B, C, alpha, beta = initialize()

#### _Baseline Timing_

In [None]:
%timeit -r30 -n30 Gibbs(M, doc_lens, Z, W, K, A, B, C, alpha, beta)

#### _Loop 4 Optimization Timing_

In [None]:
from LDA_AandB.Optimization_Examples import Gibbs_faster
%timeit -r30 -n30 Gibbs_faster(M, doc_lens, Z, W, K, A, B, C, alpha, beta)

We can see from this timing that the optimization of the innermost loop resulted in a noticeable timing improvement from the original function. We now can turn our attention to loops over documents and words.

## 4. Applications to Simulated Datasets

In the LDA framework, documents are assumed to be generated under the following stochastic process:

For each document $m$, sample topic distribution $ \theta_m \sim Dirichlet(\alpha)$

For each topic $k$, sample word distribution $ \phi_k \sim Dirichlet(\beta)$

For each word $n$ in each document,

1) Sample topic $z_n \sim Cat(\theta_m)$

2) Sample word $w_n \sim Cat(\phi_{z_n})$

To assess the correctness of our LDA algorithm, we simulate data under this stochastic process. We then train this data on our algorithm and compare the parameter estimates to the true parameters.

We simulate a corpus of 10 documents containing 100 unique "words". Documents in the corpus are composed of 2 different topics and contain between 150 and 200 words.

In [22]:
# Load libraries
import numpy as np
from LDA_AandB.test_data_generator import simulate_corpus
from LDA_AandB.lda_code import lda, group_docs
np.random.seed(101)

In [23]:
# Set corpus parameters
V = 100
N_min = 150
N_max = 200
K = 2
M = 10

# Set hyperparameters
alpha_true = np.random.randint(1, 10, K)
beta_true = np.random.randint(1, 10, V)

# Generate simulated corpus
bow, theta_true, phi_true = simulate_corpus(alpha_true, beta_true, M, N_min, N_max)

The accuracy of our LDA depends on the choice of the hyperparameters $\alpha$ and $\beta$. The closer these hyperparameters are to the true values of the dataset, the better the algorithm's estimates of the topic and word distributions. 

When the hyperparameters $\alpha$ and $\beta$ are chosen to be the true values, our LDA algorithm captures the true topic distributions very well.

In [24]:
# Train data on LDA implementation
theta, phi = lda(bow, K, alpha_true, beta_true, 1000)
#print("Estimated topic distributions:\n", theta)
#print("True topic distributions:\n", theta_true)
print("Mean squared-error in topic probability estimates:", np.mean((theta - theta_true)**2))
print("LDA document groups:", group_docs(theta, K))
print("True document groups:", group_docs(theta_true, K))

Mean squared-error in topic probability estimates: 0.015881327708350247
Documents labeled in group 1 : []
Documents labeled in group 2 : [0 1 2 3 4 5 6 7 8 9]
LDA document groups: None
Documents labeled in group 1 : []
Documents labeled in group 2 : [0 1 2 3 4 5 6 7 8 9]
True document groups: None


However, in real-world scenarios we don't know what the true values of $\alpha$ and $\beta$ are. In the case where the chosen hyperparameters are $\textbf{not}$ the true values from the data, our LDA algorithm's estimates are less accurate.

In [20]:
# Train data on LDA implementation
theta, phi = lda(bow, K, 1, 1, 1000)
#print("Estimated topic distributions:", theta)
#print("True topic distributions:", theta_true)
print("Mean squared-error in topic probability estimates:", np.mean((theta - theta_true)**2))
print("LDA document groups:", group_docs(theta, 2))
print("True document groups:", group_docs(theta_true, 2))

Mean squared-error in topic probability estimates: 0.10722539488091598
Documents labeled in group 1 : [4 5 6]
Documents labeled in group 2 : [0 1 2 3 7 8 9]
LDA document groups: None
Documents labeled in group 1 : []
Documents labeled in group 2 : [0 1 2 3 4 5 6 7 8 9]
True document groups: None


While our LDA algorithm performs correctly, the simulated data testing illustrates how choosing proper prior parameters for the model can severly affect acccuracy in parameter estimation. Because of this, it is important to try different hyperparameters and perform sensitivity analysis.

## 5. Applications to Real Dataset

We apply our algorithm to the Newsgroups corpus. This popular corpus contains documents from 30 different topics ranging from science to politics to religion.

For our analysis we choose 15 randomly chosen documents from the Newsgroups corpus under the categories "Computer Graphics" and "Christianity". We then assess how accurately our LDA algorithm classifies these documents.

In [25]:
# Load libraries
import numpy as np
from LDA_AandB.lda_code import lda, group_docs, get_key_words
from LDA_AandB.test_data_generator import get_newsgroups, newsgroups_categories
np.random.seed(101)

In [26]:
cats = [newsgroups_categories[i] for i in [1, 15]]
bow_news, labels, words = get_newsgroups(cats, 15)
print("Document Categories:", [cats[i] for i in labels])

Document Categories: ['soc.religion.christian', 'soc.religion.christian', 'comp.graphics', 'soc.religion.christian', 'comp.graphics', 'soc.religion.christian', 'soc.religion.christian', 'soc.religion.christian', 'comp.graphics', 'soc.religion.christian', 'soc.religion.christian', 'comp.graphics', 'soc.religion.christian', 'soc.religion.christian', 'comp.graphics']


In [27]:
theta, phi = lda(bow_news, 2)

In [32]:
group_docs(theta, 2)
labels
get_key_words(phi, 50, words)

Documents labeled in group 1 : [ 0  1  3  5  6  7  8  9 10 11 12 13 14]
Documents labeled in group 2 : [2 4]
Key words for topic 1 :  ['accelerators', 'according', 'actions', 'actually', 'address', 'alive', 'allow', 'amiga', 'anonyomus', 'answer', 'ask', 'atheist', 'away', 'basis', 'behind', 'beset', 'beside', 'biblical', 'bigger', 'buy', 'cd', 'certainly', 'changing', 'cloudless', 'colour', 'commerical', 'copulating', 'copy', 'country', 'curious', 'curtain', 'declared', 'deforestation', 'demonstrated', 'demonstrates', 'descriptions', 'destroyed', 'determining', 'devastation', 'discovering', 'discussion', 'display', 'drilled', 'earth', 'email', 'encourage', 'evaluate', 'explanation', 'fail', 'finally']
Key words for topic 2 :  ['able', 'about', 'absolute', 'absolutely', 'absurdity', 'accept', 'accordance', 'accurate', 'across', 'acting', 'action', 'acts', 'add', 'adding', 'admit', 'adulterous', 'again', 'against', 'all', 'allows', 'alone', 'also', 'although', 'always', 'amazing', 'amou

## 6. Comparative Analysis with Competing Algorithms

We compare our LDA algorithm to an alternative approach to LDA and to another method of document classification called latent semantic analysis, or LSA. We use same simulated dataset used in the previous section.

### 6.1 Latent Dirichlet Allocation (Variational Bayes Approach)

The sklearn package in python implements LDA using the variational bayes approach as described in the original 2003 paper. The variational bayes approach introduces additional variational parameters to optimize. The algorithm minimizes the KL divergence between the posterior probability of the actual parameters and the posterior probability of the new variational parameters. The parameters are estimated using an Expectation-Maximization approach.

In [39]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import make_multilabel_classification

lda = LatentDirichletAllocation(n_components = 2,
                                random_state = 0)
lda.fit(bow) 
LatentDirichletAllocation(...)

results = lda.transform(bow)
group_docs(results, 2)

Documents labeled in group 1 : [0 1 2 3 4 5 6 7 8 9]
Documents labeled in group 2 : []




### 6.2 Latent Semantic Analysis

LSA is a different approach to classification than LDA. In essence, LSA is an application of a singular value decomposition, or SVD.

In [40]:
from sklearn.decomposition import TruncatedSVD
from sklearn.random_projection import sparse_random_matrix

svd = TruncatedSVD(n_components = 2, n_iter = 7, random_state = 42)
TruncatedSVD(algorithm = 'randomized', n_components = 2, n_iter = 7,
        random_state = 42, tol = 0.0)

results = svd.fit_transform(bow)
np.argmax(results, axis = 1)
group_docs(results, 2)

Documents labeled in group 1 : [0 1 2 3 4 5 6 7 8 9]
Documents labeled in group 2 : []


The comparison between the three algorithms shows that LSA performs the least accurately in estimating the true document distributions. LSA is the least complex of the three models, so it makes sense it performs more poorly than LDA.

LDA implemented using the variational bayes approach appears to perform similarly to LDA implemented using the collapsed Gibbs sampler, even slightly better. This is likely because the LDA algorithm implemented in Sklearn does some form of parameter tuning/optimization for the hyperparameters, which improves its accuracy. In addition, the LDA algorithm under variational bayes performs much faster. This is likely because the variational bayes approach is faster computationally, as it doesn't need to iterate over all the words in the corpus like like in the Gibbs sampler approach.

## 7. Discussion/Conclusion

The LDA algorithm proves to be a powerful tool in classifying 

## References/Bibliography

Darling, W.M. (2011). A Theoretical and Practical Implementation Tutorial on Topic Modeling and Gibbs Sampling.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (March 2003), 993-1022.