# Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) formulates a hierachical model over collections of discrete data sets, for instance *words*, *documents* and *topics*. Following the [original paper](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) by David Blei:

1) every word of a document derives from a finite mixture of topics,

2) every topic is derived from an possibly infinite set of topics,

3) every document is consists of by several topics.

A *word* is the most basic unit of the data set, represented by a categorial variable and belonging to a vocabulary of size $W$. A *document*  $\mathbf{D}$ is a sequence of $N$ words, i.e. a $N$-dimensional vector of categorical variables.
A *corpus* $\mathbf{C}$ is a collection of $D$ documents, i.e. a binary $(D \times N)-$dimensional matrix. The goal of LDA is to infer topic probabilities for single words.

In this notebook we implement a Gibbs sampler for inferring the distribution over the parameters following Kevin Murphy's [book](https://mitpress.mit.edu/books/machine-learning-1) and the [original paper](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf).

LDA defines the following hierarchical Bayesian model for a corpus of documents:

\begin{align*}
\boldsymbol \pi_d & \sim \text{Dirichlet}_T(\alpha, \dots, \alpha) \\
\boldsymbol \phi_t & \sim \text{Dirichlet}_W(\gamma, \dots, \gamma) \\
z_{di} \mid \boldsymbol \pi_d & \sim \text{Categorical}(\boldsymbol \pi_d) \\
x_{di} \mid z_{di} = t & \sim \text{Categorical}(\boldsymbol \phi_t)
\end{align*}

The variables represent:
* $d$ indexes a document
* $T$ is the number of different topics
* $\boldsymbol \pi_d$ are probabilities for each topic in document $d$
* $t$ indexes a topic
* $W$ is the number of different words in the vocabulary
* $\boldsymbol \phi_t$ are probabilities for different words for topic $t$
* $i$ indexes the $i$th word in a document $d$
* $z_{di}$ is the latent topic of the $i$th word in document $d$
* $x_{di}$ is the $i$th word in document $d$

This inspires a very simple Gibbs sampler which we can use to infer the probabilities and the latent topics. Following Murphy's book:

1) Compute $P(z_{di} = t \mid \dots) \propto \exp \left( \log \pi_{dt} + \log \phi_{k, x_{di}} \right)$

2) Compute $P(\boldsymbol \pi_d \mid \dots) = \text{Dirichlet}\left(\{ \alpha_t + \sum_i \mathbb{I}(z_{di} = t )\} \right)$

3) Compute $P(\boldsymbol \phi_t \mid \dots) = \text{Dirichlet}\left(\{ \gamma_w + \sum_d \sum_i \mathbb{I}(x_{di} = w, z_{di} = t) \right \})$

In [53]:
n_w <- 5
words <- seq(5)
n_d <- 6
n_k <- 2
X <- matrix(c(0, 0, 1, 2, 2, 0, 0,
                      1, 1, 1, 0, 1, 2, 2, 2, 
                      rep(4, 5), 3, 3, 4 , 4, 4, 3, rep(4, 4)), ncol=5, byrow = T) + 1
X

0,1,2,3,4
1,1,2,3,3
1,1,2,2,2
1,2,3,3,3
5,5,5,5,5
4,4,5,5,5
4,5,5,5,5


In [35]:
alpha = 1
gamma = 1

In [7]:
library(e1071)
library(MCMCpack)

Loading required package: coda
Loading required package: MASS
##
## Markov Chain Monte Carlo Package (MCMCpack)
## Copyright (C) 2003-2018 Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park
##
## Support provided by the U.S. National Science Foundation
## (Grants SES-0350646 and SES-0350613)
##


In [55]:
(Z = matrix(rdiscrete(n_d * n_w, c(0.5, 0.5)), n_d, n_w))

0,1,2,3,4
2,2,1,2,1
1,1,1,1,2
2,1,1,2,1
2,1,1,1,2
1,2,2,2,2
1,2,2,2,2


In [75]:
Pi <- matrix(0, n_d, n_k)
for (i in seq(n_d)) 
    Pi[i,] <- rdirichlet(1, alpha * rep(1, n_k))

In [78]:
Pi

0,1
0.65091572,0.3490843
0.12720843,0.8727916
0.8342422,0.1657578
0.05478778,0.9452122
0.13609949,0.8639005
0.89028534,0.1097147


In [76]:
B <- matrix(0, n_k, n_w)
for (i in seq(n_k))
    B[i,] <- rdirichlet(1, gamma * rep(1, n_w))

In [79]:
B

0,1,2,3,4
0.1789866,0.1825516,0.33164416,0.1551337,0.151684
0.258672,0.2663638,0.01744799,0.1545057,0.3030105


In [80]:
for (iter in seq(100)) {
    for (i in seq(n_d)) {
        for (l in seq(n_w)) {
            p_bar_il = exp(log(Pi[i, ]) + log(B[, X[i, l]]))
            p_il = p_bar_il / sum(p_bar_il)
            z_il = rdirichlet(1, p_il)
            Z[i, l] = which.max(z_il)
        }
    }
    
     for (i in seq(n_d)) {
        m <- sapply(seq(n_k), function(k) sum(Z[i, ] == k))
        Pi[i, ] <= rdirichlet(1, alpha + m)
    }
                    
    for (k in seq(n_k)) {
        n <- rep(0, n_w)
        for (v in seq(n_w)) {
            for (i in seq(n_d)) {
                for (l in seq(n_w)) {
                    n[v] <- n[v] + sum((X[i, l] == v) & (Z[i, l] == k))
                }
            }
        }
        B[k, ] <- rdirichlet(1, gamma + n)
    }
}

In [81]:
Pi

0,1
0.65091572,0.3490843
0.12720843,0.8727916
0.8342422,0.1657578
0.05478778,0.9452122
0.13609949,0.8639005
0.89028534,0.1097147


In [83]:
B

0,1,2,3,4
0.1461327,0.03898888,0.393484027,0.2640625,0.1573319
0.1958099,0.28953842,0.009634441,0.1345433,0.3704739


In [84]:
Z

0,1,2,3,4
1,1,2,1,1
2,2,2,2,2
1,1,1,1,1
2,2,2,2,2
1,2,2,2,2
1,1,1,1,1
