<h2>Latent Dirichlet Allocation

Implementation by Valentin TASSEL

Latent Dirichlet allocation (LDA) is a generative probabilistic model for collections of discret data such as text corpora.

The goal is to find short descriptions of the members of a collection that enable efficient
processing of large collections while preserving the essential statistical relationships that are useful
for basic tasks such as classification, novelty detection, summarization, and similarity and relevance
judgments.

<h3> Notation and terminology


We define the following terms :
- A <i>word</i> is the basic unit of discrete data, defined to be an item from a vocabulary indexed by $\{1,...,V\}$. We represent words using unit-basis vectors that have a single component equal to one and all other components equal to zero. Thus, using superscripts to denote components, the vth word in the vocabulary is represented by a $V$-vector $w$  such that $w^v = 1$ and $w^u = 0$ for $u \neq v$.
- A <i>document</i> is a sequence of $N$ words denoted by $w = (w_1, w_2,...,w_N)$, where $w_n$ is the $n$th word in the sequence
- A <i>corpus</i> is a collection of $M$ documents denoted by $D = \{w_1,w_2,...,w_M\}$


The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.

We define the distributions :
- Distribution of topics over the documents, we call it $\theta$. Specifically $\theta_i$ is the topic distribution for document $i$.

- We introduce $\varphi_k$ as the word distribution for topic $k$

We use the Gibbs sampling to find $\theta$ and $\varphi$ using the 20 newsgroup dataset

<h3>Importation of library / definition of the vocabulary</h3>
<br>
We use a Reddit News dataset 

In [35]:
import numpy as np
import matplotlib.pyplot as plt
import csv
from tqdm.notebook import tqdm

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

In [50]:
n_samples = 10000

In [51]:
with open('data/RedditNews.csv', newline='') as f:
    reader = csv.reader(f)
    data = list(reader)


In [54]:
corpus = [x[1] for x in data]
corpus = corpus[:n_samples] 

In [55]:
len(corpus)

10000

Vocabulary fitted on the news corpus

In [57]:
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=10000,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(corpus)


In [59]:
vocabulary = tf_vectorizer.vocabulary_

In [68]:
dict(list(vocabulary.items())[0:10]) 

{'news': 5392,
 '117': 26,
 'year': 8821,
 'old': 5540,
 'woman': 8759,
 'mexico': 5070,
 'city': 1637,
 'finally': 3255,
 'received': 6483,
 'birth': 1069}

In [69]:
docs = []
for row in tf.toarray():
    present_words = np.where(row != 0)[0].tolist()
    present_words_with_count = []
    for word_idx in present_words:
        for count in range(row[word_idx]):
            present_words_with_count.append(word_idx)
    docs.append(present_words_with_count)


<h3>Implement LDA with Gibbs sampling</h3>

- D : Number of documents
- V : Size of the vocabulary
- T : Number of topics
<br>
<br>
Parameters
- Beta : The parameter of the Dirichlet prior on the per-document topic distributions
- Alpha : The parameter of the Dirichlet prior on the per-topic word distribution

In [71]:
D = len(docs)
V = len(vocabulary)
T = 10

beta = 1/T
alpha = 1/T

We do the sampling of a new topic $z_{ij}$ for a word $w_{ij}$ by the following formula :
<br>
$P(z_{ij}|z_{kl}\ with\ k \neq i\  and\  l \neq j, w) = \frac{\theta_{ik} \alpha}{N_i + \alpha T} \frac{\Phi_kw + \beta}{\sum_{w\in V}\Phi_{kw} + \beta V}$

In [84]:
z_d_n = [[0 for _ in range(len(d)) ] for d in docs] #z_i_j Topic z of a word
theta_d_z = np.zeros((D, T)) #Distribution of topics over the documents
phi_z_w = np.zeros((T, V)) #Distribution of word over topics
n_d = np.zeros((D))
n_z = np.zeros((T))

<h3>References 


Tobias Sterbak - Latent Dirichlet ALlocation 
