# Latent Dirichlet Allocation

In this course we will delve into Dirichlet Allocation which is more powerfull model probabilistic Latent Semantic Analysis (pLSA) that we covered last time. 

![The generative story of LDA](./my_lda.png)

The model is given text and given the assumptions of the generative story as described by the above figure it returns the per-topic word distributions (left figure) and the per-document topic distributions. 

In this session our goal is to code LDA. The model should be called with a number of topics and the values of the hyper-parameters $\alpha$ and $\beta$ as arguments. Below, you can find the equations and the structure of a class `LDA` that you are required to populate.


## Pseudocode for Gibbs Sampling updates

![Pseudocode for Gibbs Sampling LDA](./lda_pseudocode.png)

Note that the counts stand for: 

* $n_{d,k}$ : # of times topic $k$ is assigned to document $d$
* $n_{k,w}$ : # of times topic $k$ is assigned to word $w$ 
* $n_k$ : # of words assigned to topic $k$ 
* $z_{m,n}$ : topic assignments of each of the $n$ words for the $m$ documents

The pseudocode is copied from http://u.cs.biu.ac.il/~89-680/darling-lda.pdf.


## Todos: 
1. Load data `wikipedia_data.txt`
2. Generate the correct representation to be fed in LDA (as described in the \__init\__() of LDA_Gibbs_Sampling below
3. Train LDA for several iterations
4. Visualize top words for a topic



In [1]:
documents = open('wikipedia_data.txt').read().splitlines()
print(len(documents))
vocabulary = {}
stopwords = set( open('stop_words.txt').read().splitlines()) 
        

2825


In [22]:
documents_lda_format = []

for document in documents:
    document_lda_format = []
    words = document.split()
    for word in words:
        word = word.lower()
        
        if word.isalpha() and word not in stopwords:
            try:
                document_lda_format.append(vocabulary[word])
            except:
                vocabulary[word] = len(vocabulary)
                document_lda_format.append(vocabulary[word])
    documents_lda_format.append(document_lda_format)

In [23]:
len(documents), len(documents_lda_format)

(2825, 2825)

In [26]:
import numpy as np
class lda_gibbs_sampling:
    def __init__(self, K=25, alpha=0.5, beta=0.5, docs= None, V= None):
        """Initialization function.
        K : number of topics, integer
        alpha: Dirichlet hyper-parameter, if None set it to 1/K
        beta: Dirichlet hyper-parameter, if None set it to 1/K
        docs: document in the format of lists of integers: [[1,2,3,4,5,1,2], [2,4,5,1]] is two documents with 7 and 4 words respectively. 
        V: vocabulary length
        The goal of the function is to initialize the internal variables of the class that are those provided and the counter variables described in the pseudocode.
        """

        self.K = K
        self.alpha = alpha # parameter of topics prior
        self.beta = beta   # parameter of words prior
        self.docs = docs # a list of lists, each inner list contains the indexes of the words in a doc, e.g.: [[1,2,3],[2,3,5,8,7],[1, 5, 9, 10 ,2, 5]]
        self.V = V # how many different words in the vocabulary i.e., the number of the features of the corpus
        # Definition of the counters 
        self.z_m_n = [] # topic assignements for each of the N words in the corpus. N: total number od words in the corpus (not the vocabulary size).
        self.n_m_z = np.zeros((len(self.docs), K)) + alpha     # |docs|xK topics: number of words assigned to topic z in document m  
        self.n_z_t = np.zeros((K, V)) + beta # (K topics) x |V| : number of times a word v is assigned to a topic z 
        self.n_z = np.zeros(K) + V * beta    # (K,) : overal number of words assigned to a topic z

        self.N = 0
        for m, doc in enumerate(docs):         # Initialization of the data structures I need.
            self.N += len(doc)                 # to find the size of the corpus 
            z_n = []
            for t in doc:
                z = np.random.randint(0, K) # Randomly assign a topic to a word. Recall, topics have ids 0 ... K-1. randint: returns integers to [0,K[
                z_n.append(z)                  # Keep track of the topic assigned 
                self.n_m_z[m, z] += 1          # increase the number of words assigned to topic z in the m doc.
                self.n_z_t[z, t] += 1          #  .... number of times a word is assigned to this particular topic
                self.n_z[z] += 1               # increase the counter of words assigned to z topic
            self.z_m_n.append(np.array(z_n))# update the array that keeps track of the topic assignements in the words of the corpus.

    def inference(self):
        """    The learning process. Code only one iteration over the data. 
               In the main function a loop will be calling this function.     
        """
        for m, doc in enumerate(self.docs):
            z_n, n_m_z = self.z_m_n[m], self.n_m_z[m]
            for n, t in enumerate(doc):
                z = z_n[n]
                n_m_z[z] -= 1
                self.n_z_t[z, t] -= 1
                self.n_z[z] -= 1
                # sampling topic new_z for t
                p_z = (self.n_z_t[:, t]+self.beta) * (n_m_z+self.alpha) / (self.n_z + self.V*self.beta)                      # A list of len() = # of topic
                new_z = np.random.multinomial(1, p_z / p_z.sum()).argmax()   # One multinomial draw, for a distribution over topics, with probabilities summing to 1, return the index of the topic selected.
                # set z the new topic and increment counters
                z_n[n] = new_z
                n_m_z[new_z] += 1
                self.n_z_t[new_z, t] += 1
                self.n_z[new_z] += 1
                
    def get_topic_dist(self):
        """Returns the per document topic distribution for each of the documents. 
        This is a matrix of dimensions (D documents topics) x (K topics)
        """
        return self.n_m_z / self.n_m_z.sum(axis=1)[:, np.newaxis]
    
    def get_word_distributions(self):
        """Returns the topic-word distribution.
        This is a matrix of  (K topics) x (V words)  """
        return self.n_z_t / self.n_z[:, np.newaxis]  #Normalize each line (lines are topics), with the number of words assigned to this topics to obtain probs.  *neaxis: Create an array of len = 1

    
    


In [27]:
lda = lda_gibbs_sampling( K=10, alpha=0.1, beta=0.1, docs= documents_lda_format, V=len(vocabulary))

In [28]:
for i in range(10):
    lda.inference()
    print i

0
1
2
3
4
5
6
7
8
9


In [32]:
phi = lda.get_word_distributions()
inverse_vocabulary = {val:key for key,val in vocabulary.iteritems()}
top10 = np.argsort(phi,axis=1)[:, -10:]
for i in top10:
    print([inverse_vocabulary[x] for x in i])


['side', 'new', 'south', 'street', 'west', 'two', 'line', 'north', 'located', 'station']
['game', 'games', 'first', 'also', 'major', 'new', 'boston', 'american', 'league', 'played']
['october', 'november', 'killed', 'john', 'born', 'states', 'united', 'american', 'mayor', 'served']
['aircraft', 'air', 'world', 'corps', 'medal', 'war', 'squadron', 'states', 'united', 'marine']
['also', 'candidates', 'members', 'formed', 'group', 'album', 'music', 'rock', 'released', 'band']
['known', 'including', 'used', 'products', 'based', 'software', 'first', 'founded', 'also', 'company']
['united', 'new', 'born', 'oregon', 'state', 'school', 'also', 'american', 'served', 'university']
['league', 'team', 'york', 'college', 'one', 'national', 'new', 'played', 'football', 'american']
['general', 'assembly', 'legislative', 'province', 'created', 'election', 'riding', 'electoral', 'provincial', 'district']
['born', 'won', 'known', 'american', 'film', 'also', 'television', 'new', 'america', 'miss']
