# Latent Dirichlet Allocation (LDA)

---

LDA is an ** Unsupervised Learning **

LDA can model heterogeneity while classical mixtures cannot. It means each document is not limited to exhibit one topic, but multiple topics.

LDA is a probabilistic modelling, so the process are:


1. Data are assumed to be observed from a generative probabilistic process that includes hidden variables.
<br>
   --> In text, the hidden variables are the thematic structure.
<br> <br>
2. Infer the hidden structure using posterior inference
<br>
   --> What are the topics that describe this collection?
<br> <br>
3. Situate new data into the estimated model.
<br>
   --> How does a new document fit into the topic structure?

The generative model is illustrated in the picture below
* Each topic is a distribution over words
* Each document is a mixture of corpus-wide topics
* Each word is drawn from one of those topics
<img src = 'image/generative_model_lda.png'>



##### In reality, we only observe the documents. The other structure are hidden variables
##### Our goal is to infer the hidden variables


And if we draw lda into graphical model it looks like this:
<img src = 'image/lda_graphical_model.png'>

*the shaded area is the observed variable* <br>
*plate means vectors * 

This is the joint distribution of the observed in hidden random variables, which is also describe the LDA model
<img src = 'image\lda_model_equation_2.png'>

So, we got 2 dirichlet distribution here ( beta & theta -> dirichlet distribution )

## Topic Modeling

Topic modeling Definition:
* help you automatically discover patterns in a corpus [1]
* algorithms that uncover the **hidden thematic structure** in document collections. These algorithms help us develop new ways to search, browse and summarize large archives of texts [2]
* provide a simple way to analyze large volumes of **unlabeled** text. A "topic" consists of a **cluster of words** that frequently occur together [3]



Topic Modelling is the soft/fuzzy clustering: each document can belong to different degree and different clustering

<img src="image/type_of_clustering.png">

The flow of topic modeling is shown on the diagram above. Where we input the collection of text documents as dataset and get topics as output. While the black box is the algorithm we use (LDA)

#### Diagram of topic modeling
<img src="image/topic_modeling_diagram2.png">

#### Algorithm:

1. Initialize parameters
   <br>
   such as: <br>
   * K ( # of topics ) must be defined by us before
   * number of iterations
   <br> <br>
2. Initialize topics randomly
   <br> 
   Each word in each document is gonna assign topics randomly
   <br>
   <img src="image/document_topic.png" width = "30%" height = "30%">
   <br>
3. Iterate
   <br>
   Resample topic for word (given all other words and their current topic assignment):
   Remove the assigned topic in each word with a "better" topic by answering these 2 questions: <br>
   * Which topics occur in this document?
   * Which topics like the word X?
   <img src="image/document_topic_change.png" width = "35%" height = "35%">
   <br> <br>
4. Get result
   <br>
   We get final **psi** and **phi**  ( frequency of words and distribution of topics )
   <br> <br>
5. Evaluate model


<img src="image/topic_modeling_diagram3.png">

#### Evaluation 

1. Human-in-the-loop
   * Word Intrusion [5]
   <br>
     On each topic take words randomly and substitute ones with another (intruder words). See whether human can detect which on it was (the intruder words). If so, then topic is good.
   <br> <br>      
   * Topic Intrusion 
   <br>
     Add topic that doesn't belong to a document to your document and ask user to find which topic is not represented 
          
     <img src="image/human_in_the_loop.png" width = "50%" height = "50%">          
     <br>
2. Metrics [4]
    
   * Cosine similarity
       Split document into 2 parts, check :
       - intradistance (the topics must be distributed similarly)
       - interdistance (the first half similar to topics and second half of different document are mostly dissimilar)
       <br> <br>
   
   * Size (# of tokens assigned)
   * Within-doc rank
   * Similarity to corpus-wide distribution
   * Locally-frequent words
   * Co-doc Coherence

#### Python libraries

Gensim: https://radimrehurek.com/gensim/
<br>
Graphlab: https://dato.com/products/create/docs/generated/graphlab.topic_model.create.html
<br>
lda: http://pythonhosted.org//lda/

### Pipeline
<img src="image/pipeline.png">   

Our words in the text document have to be preprocessed before. There are two kinds of **PREPROCESSING**:
1. Tokenization
   The way we divide the words in the documents
   * simple : just word by word
   * collocations : combine 2/3 words together because it's more meaningful
   * entities : might be only adjective, or only something
   * combination : mix
   * lemmatization : grouping words with the same variant
   <br> <br>
2. Stopwords
   We may want to omit stopwods from
   * language generic : a, above, across, after, all, the ...
   * domain specific : material, size, advance ..
  

All the preprocess data would be convert into **VECTOR SPACE**
<img src = "image/vector_space.png">

To get a deeper insight, we can **VISUALIZE** it with a library called LDAVis 
<img src = "image/LDAVis.png">
https://github.com/bmabey/pyLDAvis

SOURCES:

[1] https://www.cs.jhu.edu/~jason/465/PowerPoint/lect-topicmodels-mpaul.pdf, 
<br>
[2] http://www.cs.princeton.edu/~blei/topicmodeling.html, 
<br>
[3] http://mallet.cs.umass.edu/topics.php
<br>
[4] http://mimno.infosci.cornell.edu/slides/details.pdf
<br>
[5] http://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf
