# Implementing Latent Dirichlet Allocation for Topic Modeling and Document Classification

### Andrew Cooper, Brian Cozzi

## Abstract

In this report we implement a form of Latent Dirichlet Allocation for topic modeling. Latent Dirichlet Allocation (LDA) was first introduced by Blei, Ng, and Jordan in 2003 as a hierarchical modeling approach for discrete data such as text in a corpus. This algorithm hinges on the notion that collections of data, such as text in a document, are generated according to a latent topic distribution, where each topic assigns probabilities to different words. The purpose of LDA in topic modeling is to group documents based on similar topic distributions, and to identify key words in each topic. Using a collapsed Gibbs sampler approach to LDA as described in Darling 2011, we implement an algorithm that estimates the latent topic distributions of a given corpus of documents. In addition, our algorithm returns key words assigned to each topic. We evaluate the performance of our algorithm on both simulated data and documents from the Newsgroups corpus.

Key phrases: Latent Dirichlet Allocation, Topic Modeling, Collapsed Gibbs Sampler,

## 1. Background 

This paper provides an overview of the implementation, optimization, and applications of Latent Dirichlet Allocation (LDA). LDA was first introduced in a seminal paper by Blei, Ng, and Jordan in 2003 as an attempt to hierarchically model discrete data such as text in a set of  documents. The original paper primarily focused on topic modeling as it pertains to text data and, for consistency and in the interest of comprehensibility, this is the application area where we will focus the remainder of this paper. In addition, rather than implementing the procedure with the Expectation Maximization approach outlined in Blei, Ng and Jordan, we will instead implement the inference using a collapsed Gibbs Sampler. 

However, it is worth acknowledging that across a variety of disciplines, the simplification of large sets of categorical data are necessary to quantify similarities, detect anomialies between these large and superficially heterogeneous objects. 
- Applications of LDA beyond text \

### How does this model work?
LDA is a Bayesian hierarchical model with 3 "levels". The lowest is the (discrete) item level which, in an example using text, is a word. The model suggests that this item is produced by some latent "topic".

Then, a finite number of these items are understood to be part of a collection (a document) making a collection a finite mixture of its constinuents' underlying topics. 

Over a set of these collections, each topic is modeled as an infinite mixture over topic probabilities. 

### Mathematical Formulation
The primary challenge in LDA is posterior inference of the latent variables:$\theta$, $\phi$ and $z$. $\theta$ represents a topic space for each document. $z_i$ is a representation of discrete topic. $\phi$ represents the topic distributions for each word in the vocabulary, or more generally, the topic distribution for each discrete item in the set of all possible items.

We are then tasked with computing

$$p(\theta, \phi, \mathbf{z} \mid \mathbf{w}, \alpha, \beta) = \frac{p(\theta, \phi, \mathbf{z}, \mathbf{w} | \alpha, \beta)}{p(\mathbf{w} | \alpha, \beta)}$$

The original paper by Blei, Ng, and Jordan (2003) used variational inference to calculate this. However it is also possible to address this using Gibbs Sampling where we can sample from the following conditional distributions:
$$\begin{align}
\theta_{t+1} \sim & p(\theta_{t} \mid \phi_t, \mathbf{z}_t, w, \alpha, \beta) \\
\phi_{t+1} \sim & p(\phi_{t} \mid \theta_{t+1}, \mathbf{z}_t, w, \alpha, \beta) \\
\mathbf{z}_{t+1} \sim & p(\mathbf{z}_t \mid \phi_{t+1}, \theta_{t+1}, w, \alpha, \beta) \\ \end{align}$$


Darling (2011) shows how these can be expressed as a "collapsed" Gibbs sampler because $\theta$ is just the set of document-topic proportions and $\phi$ is set of topic-word distributions, these can be calculated using just the topic index assignments $\mathbf{z}$. Therefore, we only need to sample from the full conditional of $\mathbf{z}$ (after integrating out $\phi$ and $\theta$).

In other words, we want to sample from $p(\mathbf{w}, \mathbf{z} | \alpha, \beta)$.

We start with the full joint distribution of all latent parameters and set of words $\mathbf{w}$ :$p(\theta, \phi, \mathbf{z}, \mathbf{w} | \alpha, \beta)$.

$$\begin{align} 
p(\mathbf{w}, \mathbf{z} | \alpha, \beta) =& \int \int p(\mathbf{z}, \mathbf{w}, \theta, \phi | \alpha, \beta) \mathrm{d} \theta \mathrm{d} \phi \\ 
=& \int p(\mathbf{z} | \theta) p(\theta | \alpha) \mathrm{d} \theta \int p\left(\mathbf{w} | \phi_{z}\right) p(\phi | \beta) \mathrm{d} \phi \\ \end{align}$$ 


We realize that $p(\mathbf{z}| \theta)$ and $p(\mathbf{w} | \phi_z)$ are multinomial and $p(\theta| \alpha$) and  $p(\phi| \beta)$ are Dirichlet priors. Therefore, we realize that both terms inside the integrals have Dirichlet distributions. This yields the final expression:

$$p(\mathbf{w}, \mathbf{z} | \alpha, \beta) = \prod_{d} \frac{B\left(n_{d,}+\alpha\right)}{B(\alpha)} \prod_{k} \frac{B\left(n_{k,}+\beta\right)}{B(\beta)}$$

where B is the multinomial beta distribution, and $n_d$ and $n_k$ are the word counts by document and by topic respectively.

### How can it be applied to our research
Topic modeling and LDA is everywhere there is text data. Topic modeling for health records

### Advantages over other algorithms
- LSI
- pLSI

### Disadvantages
- difficult to implement
- While it's not unique to this approach of topic modeling, it also neglects the word order. There are certain extensions of this problem to get around this. For example, rather than just considering occurrences of single words, we can also consider $n$ adjacent words (commonly called $n$-grams) to capture information about phrases. In this case too, the order of the $n$-grams is neglected.

### What will we do in this paper
In this paper, we will describe our approach to implement LDA using a collapsed Gibbs sampler method described in Darling 2011. Section 2 will discuss the algorithm that has been implemented while Section 3 will provide an overview of how this algorithm was optimized. Sections 4 and 5 will demonstrate this algorithms application to simulated and  real-world datasets respectively. Section 6 will compare this algorithm to others $\textbf{SPECIFY THIS LATER}$. We will conclude with a discussion in section 7.

## 2. Description of Algorithm

The LDA algorithm takes in 4 inputs: a corpus of documents, the number of topics, two optional choices for hyperparameters, and an optional specification on the number iterations for the Gibbs Sampler. The ultimate goal of the algorithm is to estimate the topic distribution for each document as well as the word distribution for each topic. It does this by making inference on the latent topics of each word in the given corpus. We perform this inference by implementing a Gibbs sampler. First the algorithm sets a starting point by randomly sampling topics for each of the words in the corpus. Then it iteratively samples new topics for each word using calculated posterior probabilities. After a number of iterations, the Gibbs Sampler returns the estimated topics for each word, which are then used to calculate the latent topic distributions and word distributions using Monte Carlo estimation. These estimated quantities are returned to the user.

### 2.1 Variables

The LDA algorithm has many different symbols and components. We establish all the symbols used in our algorithm below:

* $K$ = The number of topics
* $M$ = The number of documents in the corpus
* $N_m$ = The number of words in document $m$
* $V$ = The number of possible words
* $w_{m,n}$ = The nth word in document $m$
* $z_{m,n}$ = The nth topic in document $m$ 
* $\theta_m$ = The topic distribution of document $m$
* $\phi_k$ = The word distribution of topic $k$

### 2.2 Algorithm Input

The algorithm takes as input a corpus of documents represented as a $M$x$V$  bag-of-words matrix. In addition, it takes as input the number of topics $K$. It also takes in two positive values representing the hyperparameters for the topic distribution ($\alpha$) and the word distribution ($\beta$). For this implementation we use symmetric priors for the dirichlet distributions, which means only one value is needed as input for each prior. Finally it takes as input the number iterations for the Gibbs sampler.

### 2.3 Gibbs Sampler

The algorithm for LDA has three main steps. The first step of the algorithm is preparing the data for the Gibbs sampler. As a starting point for our sampler, we first must randomly assign topics $z_{m,n}$ for each of the words in the given corpus. We then create two different count matrices, $N_1$ and $N_2$: $N_1$ is a $M$x$K$ matrix that counts the distribution of topics across documents. $N_2$ is a $K$x$V$ matrix that counts the distribution of words across topics. The count matrices are initialized according to the random topic assignment.

The second step is the implementation of the Gibbs sampler. For each iteration, our sampler loops through every word $w_{m,n}$ in every document of our corpus. For each word, we first remove its assigned topic and decrement the appropriate count matrices $N_1$, $N_2$, and $N_3$. We then calculate the posterior probabilities of the word belonging to each of the possible topics. One of these topics is sampled using these probabilities and is assigned to the word. Finally, all of the count matrices are incremented according to this new topic. This process is done for all the words in the corpus for how many iterations the user specified.

***Gibbs Sampler Implementation***

* Randomly assign topics $z_{m,n}$ for all words in corpus, and initialize count matrices $N_1$ and $N_2$

* **for** $i = 1$ to n_iter:
   * **for** $m = 1$ to $M$:
       * **for** $n = 1$ to $N_m$:
           * $w = w_{m,n}$
           * $z_{old} = z_{m,n}$
           * $N_1[m, z_{old}] -= 1$
           * $N_2[z_{old}, w] -= 1$
           * **for** $k = 1$ to $K$:
               * $p_k = Pr(z = k | \dots) \propto (N_1[m, k] + \alpha[k])*\dfrac{N_2[k, w] + \beta[w]}{\sum_V{N_2[k, v] + \beta[v]}}$
           * Normalize $p$ 
           * $z_{new} = $ Sample from $Cat(K, p)$
           * $z_{m,n} = z_{new}$
           * $N_1[m, z_{new}] += 1$
           * $N_2[z_{new}, w] += 1$

### 2.4 Parameter Estimation

The third step to the algorithm is estimating and returning the quantities of interest. One quantity to estimate is the topic distribution $\theta_m$ for each document $m$. Another quantity to estimate is the word distribution $\phi_k$ for each topic. These quantities are estimated using the count matrices $N_1$ and $N_2$ established in the Gibbs sampler:

$\hat{\theta}_{m,k} = \dfrac{N_1[m, k] + \alpha[k]}{\sum_k{N_1[m, k] + \alpha[k]}}$ 

$\hat{\phi}_{k, v} = \dfrac{N_2[k, v] + \beta[v]}{\sum_v{N_2[k, v] + \beta[v]}}$

## 3. Description of Performance Optimization

In the algorithm described above, we see that each iteration of the algorithm runs in $O(MNK)$ time. On its surface, this does not appear to scale well especially as more documents, topics or words are added to the corpus. However, leveraging Python's interface with C and its ability to execute loops much faster than base Python, we can make the run time much more managable. 

JIT nopython mode: A Numba compilation mode that generates code that does not access the Python C API. This compilation mode produces the highest performance code, but requires that the native types of all values in the function can be inferred. Unless otherwise instructed, the @jit decorator will automatically fall back to object mode if nopython mode cannot be used.

It is also worth consider what type of optimization cannot be done. The most obvious limitation is in the outermost loop. This is a Gibbs sampler which means that subsequent iterations of the sampler depend on previous iterations. Therefore, these iterations cannot be done in parallel.

To investigate the remaining opportunities to optimize, we can consider the pseudocode below where we notice that there are 3 main loops for each iteration of the Gibbs sampler:

* **for** $i = 1$ to n_iter: 
   * **for** $m = 1$ to $M$:  $\boxed{\textbf{LOOP 1}}$
       * **for** $n = 1$ to $N_m$: $\boxed{\textbf{LOOP 2}}$
           * $w = w_{m,n}$
           * $z_{old} = z_{m,n}$
           * $N_1[m, z_{old}] -= 1$
           * $N_2[z_{old}, w] -= 1$
           * **for** $k = 1$ to $K$: $\boxed{\textbf{LOOP 3}}$
               * $p_k = Pr(z = k | \dots) \propto (N_1[m, k] + \alpha[k])*\dfrac{N_2[k, w] + \beta[w]}{\sum_V{N_2[k, v] + \beta[v]}}$
           * Normalize $p$ 
           * $z_{new} = $ Sample from $Cat(K, p)$
           * $z_{m,n} = z_{new}$
           * $N_1[m, z_{new}] += 1$
           * $N_2[z_{new}, w] += 1$

### Loop 3
The task of the innermost loop (Loop 3) is to calculate the posterior probability of a topic given the observed count matrices. As noted, the equation provides the joint distribution, but not the normalizing constant. Inside this loop it is clear that the values used as inputs are not changing over subsequent iterations: it is simply performing $K$ independent operations. This makes it a great candidate for parallelization.

In [3]:
## Create .py for Optimization code
## Initialize parameters and compare both using %time_it

### Loops 1 and 2
At first glance, this might also seem like a great candidate for parallelization. One might be inclined to think that this all happens at a word-level .However, upon closer inspection, notice that at each iteration the count matrices are updated based on the topics that are drawn. Furthermore, the topics that are drawn at each matrix depends on the count matrix N_2

N_2 is a corpus-level count matrix

- sampling from a multinomial was impossible: can't pass through arrays as arguments using JIT

In [2]:
### Parallel execution with Cython

#Will not work unless OpenMP is installed.

%%cython --compile-args=-fopenmp --link-args=-fopenmp --force

import cython
from cython.parallel import parallel, prange

@cython.boundscheck(False)
@cython.wraparound(False)
def matrix_multiply2(double[:,:] u, double[:, :] v, double[:, :] res):
    cdef int i, j, k
    cdef int m, n, p

    m = u.shape[0]
    n = u.shape[1]
    p = v.shape[1]

    with cython.nogil, parallel():
        for i in prange(m):
            for j in prange(p):
                res[i,j] = 0
                for k in range(n):
                    res[i,j] += u[i,k] * v[k,j]

SyntaxError: invalid syntax (<ipython-input-2-f1d107f99ac4>, line 5)

In [1]:
#Parallelization with vectorize and guvectorize
#----

@numba.vectorize([float64(float64, float64),
                  float32(float32, float32),
                  float64(int64, int64),
                  float32(int32, int32)],
                 target='parallel')
def f_parallel(x, y):
    return np.sqrt(x**2 + y**2)

SyntaxError: invalid syntax (<ipython-input-1-f3728385eed5>, line 1)

# In the next section we will show that this code optimization offers a considerable speed up over the standard Python implementation.

## 4. Applications to Simulated Datasets

## 5. Applications to Real Dataset

- topic distribution by document
- cluster topics (extra)

## 6. Comparative Analysis with Competing Algorithms

## 7. Discussion/Conclusion

## 8. Installation Guide

https://packaging.python.org/tutorials/packaging-projects/

## References/Bibliography

1) Darling, W.M. (2011). A Theoretical and Practical Implementation Tutorial on Topic Modeling and Gibbs Sampling.

2) David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (March 2003), 993-1022.