**I. Abstract**

Latent Dirichlet Analysis (LDA) is a method used to model the generative process of creating discrete data such as text corpora. LDA can take a large amount of data and create descriptions of that data. This method reduces the size of the data to these short descriptions while still maintaining the relationships necessary to carry out various inferences.  For this project, we implemented the LDA algorithm presented in the paper, “Latent Dirichlet Allocation”, written by David M. Blei, Andrew Y. Ng, and Michael I. Jordan. The implementation of the algorithm is used to analyze text of famous American historical documents. In addition, it is used to create clusters of movies based on user rating data. 

**II. Background**

The algorithm assumes a generative process for the creation of documents in a corpus D using N words. The topic for a particular word, $z_n$, is modeled as Multinomial($\theta$), where $\theta \sim Dir(\alpha)$. Then, each word, $w_n$, is chosen from a multinomial probability that is conditioned on the topic $z_n$, P($w_n | z_n, \beta)$.  (Blei et al., pg. 996)

LDA is most well-known for its application in the analysis of text data. It is used to create topics for documents, classify documents based on these topics and to determine which documents are similar to one another. The algorithm can also be used in other problems that have a similar structure to the document generating method described above. For example, in Blei et.al, 2003, the authors used a data set where web site users provide information about movies they enjoy. In this example, the users are analogous to the documents and their preferred movies are “words”.  Topics can be found by determining similar movies. 

There are multiple alternative algorithms that can be used for text analysis. These include the term-frequency inverse-document-frequency matrix, latent semantic indexing (LSI) and the probabilistic latent semantic indexing (pLSI) model. In Section 7 of the paper, the authors present the results of document modeling and document classification. In this section, the authors’ results show that their implementation of LDA performs better than competing methods in terms of perplexity measure, where better generalization performance is defined by a smaller perplexity measure. 

 In addition, this section demonstrates that other methods are prone to overfitting. As the number of topic increases, some of the alternative algorithms induce words that have small probabilities. This occurs because the documents in the corpus are divided into more collections. This can result in the perplexity measure becoming very large for these alternative algorithms. The problem comes from the requirement that the topic proportions in a future document must be seen in at least one of the training documents. On the other hand, LDA does not have this overfitting problem. For more detail, see Section 7 of Blei et al.

A disadvantage of LDA is that exact inference of the posterior is impossible as a result of intractability. Thus, it is necessary for the user to implement an approximating technique such as variational inference, a Gibbs sampler, or another technique. The approximation of the posterior can be computationally intensive and also time-consuming. 



**III. Implementation**

**a. Code**

The LDA algorithm was implemented using the Python programming language in the Jupyter notebook environment. The LDA function requires four arguments. The number of topics, k, must be specified. In addition, the output from the make_word_matrix is necessary in the LDA function. This function takes the corpus as an input and returns three things: a matrix where each word in the document is a row and the columns are the unique words in the corpus, the list of words unique to the corpus, and the number of documents in the corpus. The third argument necessary for the LDA function is the tolerance for convergence. Finally, an indicator value of the form of the corpus documents is required for the LDA. This value is 0 if the documents are a list of strings, and the value is 1 if the documents are just one long string. In addition, we created a function to return a specified number of words for each topic.


**b.	Testing and Base Cases**

In addition to the implementation described above, various checks have been included in the function to prevent incorrect arguments.  One check is that the number of topics specified by the user must be greater than one. It is not possible to have zero or less than zero topics, and one topic would just be described by the entire document. In addition, there is a check in place to ensure that the corpus is not empty and that each element of the corpus is a string. This prevents the user from receiving an error that the input is not a string. Lastly, there are checks to ensure that the tolerance is greater than zero and that the entry for needToSplit is either zero or one. These checks print messages that inform the user of why their input was problematic.


**c.	Tests for Correctness**

*i.	Generative Model*

A generative model check was utilized in order to evaluate the functionality of the algorithm. A random number was generated in order to determine the number of documents from a Poisson(5) distribution, and the resulting number was six. Then, for each of the six documents, a random number of words for each document was drawn from a Poisson(5) distribution. These documents had between 2-6 words. Random values were used to create a $\beta$ vector, and the $\alpha$ vector was set to be 50/k, where k is the number of topics and k = 3. The 50/k value was chosen based on the results shown in Gregor Heinrich’s “Parameter estimation for text analysis”.  (Heinrich, 2008, 24). The corpus was defined to have five unique words that are not stop words.
The set-up described in the previous paragraph was used to determine which of the words were in each of the documents. This process followed the generative model described in Blei et. al’s paper. (Blei et al, 2003, 996). Once the documents were generated, our LDA implementation was used on this corpus to estimate the $\alpha$ and $\beta$ values.

\begin{center}
 \begin{tabular}{||c c||} 
 \hline
 Alpha & Beta \\ [0.5ex] 
 \hline
 0.06 & 0.63\\ [1ex] 
 \hline
\end{tabular}
\end{center}

Thus, the values returned by the algorithm are relatively close to the ones used to generate the corpus. It is not surprising that the mean squared errors for the $\beta$ vector is larger because there are more elements in the beta vector. The $\alpha$ vector converges faster than the $\beta$ vector, so this could be the reason for the larger MSE for $\beta$. (EXPAND HERE?)


*ii.	Comparison to Python package*

In addition to the generative model, the performance of our LDA implementation was compared to the implementation from the genism package. This implementation can be found at https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html. 

**need to add chart for comparison, set a seed?**

The results from the Python package and our LDA algorithm were not exactly the same. There is one topic that is very similar for both of the algorithms. These topics have many of the same wrods and some of the probabilities assigned to each of these words. On the other hand, there are also some topics that are not very similar. This could be a result of some of the differences between the two algorithms. The Python implmentation requires the user to pass the number of iterations to run as an argument whereas our implemnentation sets convergence criterion. In addition, the Python implmentation randomly initiates all of the parameters. Thus, the differences between the starting positions and the convergence criterion could result in different topics. Therefore, we are satisfied with the similarity found between our LDA implementation and that using the gensim package.


**d.	Profiling Code and Speed Up**

We profiled the initial code by implementing the algorithm on a corpus of four of President Clinton’s State of the Union Addresses. It was found that the main bottleneck in the algorithm was the maximization step of the algorithm. About 75% of the time of the LDA algorithm was spent on the maximization step. In the maximization step, the $\alpha$ update occurs in a while loop, and the $\beta$ update uses four for loops. The expectation step also slowed down the code, but only accounted for approximately 13% of the time of the LDA implementation. This mostly likely is a result of having fewer for loops than the maximization step. 

We identified inefficiencies in the matrix construction function and implemented a storage system in order to avoid the unnecessary repetition of some calculations. In addition, during the initial coding of the algorithm, care was taken to avoid for loops and use vectorization where possible. Just in time compilation (Numba) was also added to the code for the algorithm implementation. 

Finally, with the assistance of Professor Chan, the code was adapted so that Cython could be used. It was first necessary to adjust the code to prevent ragged arrays because Cython does not easily work with sparse or ragged arrays. The phi and the corpus matrix both initially used ragged arrays. To fix this, the largest dimension of the matrices was found, and the rest of the matrices were filled in with zeros such that the dimensions matched. In addition, the digamma function from the Cython GSL package replaced that from the scipy package. The final change was using cdef to define the inputs to the functions and the variables used within. 

\begin{table}[H]
\caption{Toy Example Times} \label{tab:title}
\begin{center}
\begin{tabular}{ c | l}
  \hline   
  Method & Total Time \\
  \hline
  No Speed-Up & 4.36 milliseconds \\
  Numba & 3.02 seconds \\
  Cython & 4.16 milliseconds \\
  \hline  
\end{tabular}
\end{center}
\end{table}

\begin{table}[H]
\caption{Clinton SOTU Timing} \label{tab:title}
\begin{center}
\begin{tabular}{ c | l}
  \hline   
  Method & Total Time \\
  \hline
  No Speed-Up & 4 mintues, 54 seconds \\
  Numba & 4 minutes, 51 seconds  \\
  Cython & 1 minute, 48 seconds \\
  \hline  
\end{tabular}
\end{center}
\end{table}

As the tables above show, the Cython provided the fastest implementation of the algorithm. For the smaller example, Numba performed the slowest and was only marginally faster in the State of the Union example. According to the Numba documentation, there is compilation time associated with Numba. In addition, it would be faster if Numba was used in No Python mode. However, this was not possible given the use of the numpy package. For the State of the Union example, the Cython code provides approximately a three time speed up over the other implementations.


**IV.	Results**

The LDA algorithm was used to determine the topics that occurred in famous documents and speeches from American history. These included the Gettysburg Address, the Declaration of Independence, Martin Luther King’s “I Have A Dream” speech and President John F. Kennedy’s inauguration address. The first implementation defined the number of topics to be three. 

\begin{table}[H]
\caption{American Document Topics - Three Topics} \label{tab:title}
\begin{center}
\begin{tabular}{ c | l}
  \hline   
  Topic & Potential Occurring Themes \\
  \hline
  1 & will, can, one, live, men \\
  2 & mentioning family members, ready for execution  \\
  3 & not as clear as other topics, life, family, time \\
  \hline  
\end{tabular}
\end{center}
\end{table}

In order to visualize the topic assignments, the following excerpt from The Declaration of Independence. If a particular word was assigned to more than one topic, it was assigned to the topic for which it had the highest probability. 


<span style="color:red">Topic 1</span> 

<span style="color:blue">Topic 2</span> 

<span style="color:green">Topic 3</span> 

When in the <span style="color:blue">Course</span>  of <span style="color:green">human</span> <span style="color:red">events</span> , it <span style="color:red">becomes</span> <span style="color:green">necessary</span> for <span style="color:green">one</span> <span style="color:green">people</span> to <span style="color:green">dissolve</span> the <span style="color:green">political</span> <span style="color:red">bands</span> 
which have <span style="color:blue">connected</span> them with <span style="color:green">another</span>, and to <span style="color:blue">assume</span>, <span style="color:blue">among</span> the <span style="color:green">Powers</span> of the <span style="color:green">earth</span> , the <span style="color:green">separate</span> and <span style="color:green">equal</span> 
<span style="color:blue">station</span> to which the <span style="color:blue">Laws</span> of <span style="color:blue">Nature</span> and of <span style="color:blue">Nature's</span> <span style="color:green">God</span> <span style="color:green">entitle</span> them, a <span style="color:red">decent</span> <span style="color:red">respect</span> to the <span style="color:red">opinions</span> of 
<span style="color:red">mankind</span> <span style="color:green">requires</span> that they should 
<span style="color:green">declare</span> the <span style="color:red">causes</span> which <span style="color:red">impel</span> them to the <span style="color:green">separation</span>.

We <span style="color:blue">hold</span> these <span style="color:green">truths</span> to be <span style="color:blue">self-evident</span>, that all <span style="color:blue">men</span> are <span style="color:red">created</span> <span style="color:green">equal</span>, that they are <span style="color:blue">endowed</span> by their 
<span style="color:green">Creator</span> with <span style="color:blue">certain unalienable Rights</span>, that <span style="color:blue">among</span> these are <span style="color:blue">Life</span>, <span style="color:green">Liberty</span>, and the <span style="color:red">pursuit</span> of <span style="color:blue">Happiness</span>.

The second implementation defined the number of topics to be eight.

\begin{table}[H]
\caption{American Document Topics - Eight Topics} \label{tab:title}
\begin{center}
\begin{tabular}{ c | l}
  \hline   
  Topic & Most Probable Words \\
  \hline
  1 & will, can, one, live, men \\
  2 & one, us, freedom, govern, power  \\
  3 & one, us, everi, new, let, state \\
  4 & will, world, new, freedom, today \\
  5 & right, let, will, peopl, nation \\
  6 & right, let, us, nation, one \\
  7 & let, will, freedom, nation, everi\\
  8 & let, state, freedom, us, power\\
  \hline  
\end{tabular}
\end{center}
\end{table}


<span style="color:red">Topic 0</span> 
<span style="color:blue">Topic 1</span> 
<span style="color:green">Topic 2</span> 
<span style="color:darkmagenta">Topic 3</span> 
<span style="color:orange">Topic 4</span> 
<span style="color:lightseagreen">Topic 5</span> 
<span style="color:darkblue">Topic 6</span> 
<span style="color:magenta">Topic 7</span> 
<span style="color:cyan">Topic 8</span> 
<span style="color:chartreuse">Topic 9</span> 

**Declaration**

When in the <span style="color:chartreuse">Course</span> of <span style="color:lightseagreen">human</span> <span style="color:red">events</span>, it <span style="color:red">becomes</span> <span style="color:green">necessary</span> for <span style="color:blue">one</span> <span style="color:orange">people</span> to <span style="color:darkmagenta">dissolve</span> the <span style="color:magenta">political</span> <span style="color:cyan">bands</span> 
which have <span style="color:darkblue">connected</span> them with <span style="color:darkmagenta">another</span>, and to <span style="color:darkmagenta">assume</span>, <span style="color:blue">among</span> the <span style="color:chartreuse">Powers</span> of the <span style="color:chartreuse">earth</span>, the <span style="color:chartreuse">separate</span> and <span style="color:orange">equal</span> 
<span style="color:lightseagreen">station</span> to which the <span style="color:orange">Laws</span> of <span style="color:lightseagreen">Nature</span> and of <span style="color:lightseagreen">Nature's</span> <span style="color:darkmagenta">God</span> <span style="color:magenta">entitle</span> them, a <span style="color:lightseagreen">decent</span> <span style="color:green">respect</span> to the <span style="color:darkmagenta">opinions</span> of 
<span style="color:lightseagreen">mankind</span> <span style="color:magenta">requires</span> that they should 
<span style="color:lightseagreen">declare</span> the <span style="color:orange">causes</span> which <span style="color:lightseagreen">impel</span> them to the <span style="color:chartreuse">separation</span>.

We <span style="color:magenta">hold</span> these <span style="color:blue">truths</span> to be <span style="color:chartreuse">self-evident</span>, that all <span style="color:red">men</span> are <span style="color:chartreuse">created</span> <span style="color:orange">equal</span>, that they are <span style="color:blue">endowed</span> by their 
<span style="color:lightseagreen">Creator</span> with <span style="color:orange">certain</span> <span style="color:chartreuse">unalienable</span> <span style="color:lightseagreen">Rights</span>, that <span style="color:blue">among</span> these are <span style="color:red">Life</span>, <span style="color:darkblue">Liberty</span>, and the <span style="color:cyan">pursuit</span> of <span style="color:magenta">Happiness</span>.

**Gettysburg**

It is for <span style="color:lightseagreen">us</span> the <span style="color:red">living</span>, <span style="color:orange">rather</span>, to be <span style="color:orange">dedicated</span> here to the <span style="color:orange">unfinished</span>
<span style="color:magenta">work</span> which they who <span style="color:cyan">fought</span> here have <span style="color:orange">thus</span> <span style="color:darkblue">far</span> so <span style="color:lightseagreen">nobly</span> <span style="color:lightseagreen">advanced</span>.
It is <span style="color:orange">rather</span> for <span style="color:lightseagreen">us</span> to be here <span style="color:orange">dedicated</span> to the <span style="color:red">great</span> <span style="color:lightseagreen">task</span> <span style="color:chartreuse">remaining</span>
before <span style="color:lightseagreen">us</span>. . .that from these <span style="color:red">honored</span> <span style="color:darkblue">dead</span> we <span style="color:magenta">take</span> <span style="color:orange">increased</span> <span style="color:green">devotion</span>
to that <span style="color:orange">cause</span> for which they <span style="color:orange">gave</span> the <span style="color:orange">last</span> <span style="color:blue">full</span> <span style="color:magenta">measure</span> of <span style="color:green">devotion</span>. . .
that we here <span style="color:darkblue">highly</span> <span style="color:darkblue">resolve</span> that these <span style="color:darkblue">dead</span> <span style="color:magenta">shall</span> not have <span style="color:darkblue">died</span> in <span style="color:orange">vain</span>. . .
that this <span style="color:lightseagreen">nation</span>, under <span style="color:darkmagenta">God</span>, <span style="color:magenta">shall</span> have a <span style="color:darkblue">new</span> <span style="color:lightseagreen">birth</span> of <span style="color:cyan">freedom</span>. . .
and that <span style="color:darkblue">government</span> of the <span style="color:orange">people</span>. . .by the <span style="color:orange">people</span>. . .for the <span style="color:orange">people</span>. . .
<span style="color:magenta">shall</span> not <span style="color:lightseagreen">perish</span> from this <span style="color:chartreuse">earth</span>.

*ii. Movie Data*

In addition to the topic creation for the historical documents, the LDA algorithm was used to cluster movies based on user ratings. This example used the MovieLens data set from the GroupLens website. (http://grouplens.org/datasets/movielens/)  This data contains user ratings for different movies, and we use the data to determine clusters of movies based on which movies the users prefer. A movie is preferred if a user rated it 4 or 5 (out of 5). The data was then reduced to only the information about users who had at least 50 preferred movies. In this example, the users can be thought of as the documents and their preferred movies are the words. The corpus is the collection of users. LDA will create topics for these movies that should cluster movies together based on which user preferred which movies. The LDA algorithm was used to create five clusters of movies. (Final only assigns a movie to one cluster?)

In [None]:
#Cluster 1

The first cluster appears to be mostly composed of children's and family movies. One exception is Freeway, a rated R comedy, crime and thriller movie. (http://www.imdb.com/title/tt0116361/?ref_=fn_al_tt_1). In addition, Golden Eye (http://www.imdb.com/title/tt0113189/?ref_=fn_al_tt_1) is a James Bond movie that does not seem to very similar to the other movies in this cluster. Freeway and Golden Eye are probably movies enjoyed by similar viewers. So it is possible that this cluster is the reuslt of parents who watch a lot of children's movies but also enjoys action movies. 

In [3]:
#Cluster 2

This cluster has fewer children's movies than Cluster 1. There are a lot of drama and thriller movies in this cluster. These could be rated by users who enjoy those types of movies. 

In [4]:
#Cluster 3

This cluster also consistents of many children's movies. The movies that are not children's movies are the comedies Ed's Next Move, Bottle Rocket and That Thing You Do! This could be another instance of parents rating movies,  but instead of action movies they enjoy watching comedies when their children are not around. 

In [None]:
#Cluster 4

Cluster 3 only had one children's movie, but contains many comedies. These could be the movies enjoyed by users who enjoy watching humorous movies. 

In [None]:
#Cluster 5

Cluster 5 also has a lot of children's movies. 

In [None]:
#comment on why children's movies are everywhere?

**V.	Further Research**

One area in which we would like to expand our research and improve our project is in the approximation of the posterior. In the implementation for this project, variational inference was utilized to estimate the intractable posterior distribution. However, there are other methods that can be used to estimate the posterior. These include Gibbs sampling and Metropolis-Hastings. It is possible that these approximations might increase the speed of the implementation. 
Another area to expand on is the determination of the number of topics. This is an ongoing area of debate and research at the moment. The R implementation of LDA provides different measures to determine the number of topics, and these are something that we can look into adding to our code.

In order to improve the topic definitions, we want to investigate weighting terms by the number of times that they appear in the documents. Many of the results  in this report contain topics that have the same words in each of them. It is sensical that the same word may appear in documents on similar topics, such as American historical documents, but if a word appears many times in each document, that word will not be very helpful in defining topics. Another way to impropve topic definition would be to implement smoothing into the algorithm. If a word only occurs once in a corpus, it will have a very small probability. However, this word could be important in defining the topic for a particular document. Smoothing would ensure that every word would be assigned a positive probability. 

In addition, we want to continue to use the algorithms to situations beyond text corpora. The movie data example in this paper is one example of how LDA can be used for something beyond text data. The algorithm can be used for any situation that has the same structure as text corpora. The possibilities for LDA applications are endless, and we hope to explore more of these situations in the future. 


**VI. References**

1.	Colorado Reed, Latent Dirichlet Allocation: Towards a Deeper Understanding, January 2012, o
bphio.us/pdfs/lda_tutorial.pdf. 
2.	David M. Blei, Andrew Y. Ng, and Michael I. Jordan, Latent Dirichlet Allocation, Journal of 
Machine Learning Research 3, 2003, pg. 993-1022.
3. Internet Movie Database, www.imbd.com.
4.	Max Sklar, Fast MLE Computation for the Dirichlet Multinomial, May 2014.
5.	F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872
6.	Thomas P. Minka, Estimating a Dirichlet Distribution, www.msr-waypoint.com.
