# 4. Modelling Document Collections

## 4.1. Introduction

* **Bag of Words:** frequency of occurrence of every distinct word
* **Zipf's Law:** the frequency of any word is inversely proportional to its rank in the frequency table

><img src = 'images/image4_01.png'>

## 4.2. Discrete Binary Distributions

* **Discrete Distributions**

>||binary|multi-valued|
|-|-|-|
|**sequence**|binary categorical, $\pi^k(1-\pi)^{n-k}$|categorical, $\prod^k_{i=1} \pi_i^{k_i}$|
|**counts**|binomial|multinomial|

* **Bernoulli Distribution**

>$$p(X=1)=\pi \;\;\;,\;\;\; p(X=0)=1-\pi$$
>$$p(X=x|\pi) = \pi^x (1-\pi)^{1-x}$$

* **Binomial Distribution** ($k$ heads out of $n$ tosses)

>$$p(k|\pi, n) = \left( \begin{matrix} n \\ k \end{matrix} \right) \pi^k (1-\pi)^{n-k} \;\;\;\rightarrow\;\;\; \textbf{MLE: } \pi = \frac{k}{n}$$

>$$\left( \begin{matrix} n \\ k \end{matrix} \right) = \frac{n!}{k!(n-k)!}$$

* **Beta Distribution**

>$$\text{Beta}(\pi|\alpha,\beta) = \frac{1}{B(\alpha,\beta)} \pi^{\alpha-1} (1-\pi)^{\beta-1} \;\;\;\rightarrow\;\;\; E(\pi) = \frac{\alpha}{\alpha+\beta}$$

>$$\frac{1}{B(\alpha,\beta)} = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \;\;\;,\;\;\; \Gamma(\alpha)=\int^\infty_0 x^{\alpha-1} e^{-x} dx$$

* **Posterior** (Prior: Beta, Likelihood: Binomial)

>\begin{align}
p(\pi|\mathcal{D}) &= \frac{p(\pi|\alpha,\beta)p(\mathcal{D}|\pi)}{p(\mathcal{D})} \\
&\propto \text{Beta}(\pi|\alpha,\beta) \pi^k (1-\pi)^{n-k} \\
&\propto \text{Beta}(\pi|\alpha+k,\beta+(n-k))
\end{align}

* **Making Predictions**

>$$p(x_{next}=1 | \mathcal{D}) = \int p(x=1|\pi) p(\pi|\mathcal{D}) d\pi$$

>$$p(\pi > 0.5 | \mathcal{D}) = \int^1_{0.5} p(\pi'|\mathcal{D}) d\pi'$$

## 4.3. Discrete Categorical Distribution

* **Categorical Distribution** (Multinomial with one trial)
* **Multinomial Distribution**

>$$p(\mathbf{k}|\boldsymbol{\pi},n) = \frac{n!}{k_1!...k_m!} \prod^m_{i=1} \pi_i^{k_i}$$

>$$\sum^m_{i=1} k_i = n \;\;\;,\;\;\; \sum^m_{i=1} \pi_i = 1$$

* **Dirichlet Distribution**

>$$\text{Dir}(\boldsymbol{\pi}|\alpha_1,...\alpha_m) = \frac{\Gamma (\sum^m_{i=1} \alpha_i)}{\prod^m_{i=1} \Gamma(\alpha_i)} \prod^m_{i=1} \pi_i^{\alpha_i - 1} = \frac{1}{B(\boldsymbol{\alpha})} \prod^m_{i=1} \pi_i^{\alpha_i - 1}$$

>$$E(\pi_j) = \frac{\alpha_j}{\sum^m_{i=1} \alpha_i}$$

>* $\boldsymbol{\alpha}=[\alpha_1,...,\alpha_m]^T$: shape parameters
>* $B(\boldsymbol{\alpha})$: multivariate Beta fn.
>* Symmetric Dirichlet distribution: $\alpha_i = \alpha, \forall i$ 
>  * `w=randg(alpha,D,1); bar(w/sum(w));`

## 4.4. Document Models

* **Simple Model**

><img src='images/image4_02.png' width = 200>

>* $w_{nd} \sim \text{Cat}(\boldsymbol{\beta})$, where $\boldsymbol{\beta}=[\beta_1,...,\beta_M]^T$
>* $n$-th word in $d$-th document, $M$-words

* **Simple Model - MLE**

>$$\log p(\mathbf{w}|\boldsymbol{\beta}) = \sum^M_{m=1} c_m \log \beta_m$$

>$$\text{Cost: } F = \sum^M_{m=1} c_m \log \beta_m + \lambda \left( 1-\sum^M_{m=1} \beta_m \right) \;\;\rightarrow\;\; \beta_m = \frac{c_m}{n}$$

* **Mixture of Categoricals Model**

><img src='images/image4_03.png' width = 300>

>\begin{align}
p(\mathbf{w}|\boldsymbol{\theta},\boldsymbol{\beta}) &= \prod^D_{d=1} p(\mathbf{w}_d | \boldsymbol{\theta},\boldsymbol{\beta}) \\
&= \prod^D_{d=1} \sum^K_{k=1} p(\mathbf{w}_d,z_d=k|\boldsymbol{\theta},\boldsymbol{\beta}) \\
&= \prod^D_{d=1} \sum^K_{k=1} p(z_d=k|\boldsymbol{\theta})p(\mathbf{w}_d|z_d=k,\boldsymbol{\beta}_k) \\
&= \prod^D_{d=1} \sum^K_{k=1} p(z_d=k|\boldsymbol{\theta}) \sum^{N_d}_{n=1} p(w_{nd}|z_d=k,\boldsymbol{\beta}_k) \\
\end{align}

* **Mixture of Categoricals Model - EM**

>$$F(\mathbf{R},\boldsymbol{\theta},\boldsymbol{\beta}) = \sum_{k,d} r_{kd} \left( \sum^M_{m=1} c_{md} \log \beta_{km} + \log \theta_k \right)$$

>$$\hat{\theta}_k = \underset{\theta_k}{\text{argmax}} \left[ F(\mathbf{R},\boldsymbol{\theta},\boldsymbol{\beta}) + \lambda \left( 1-\sum^K_{k'=1} \theta_k' \right) \right] = \frac{\sum^D_{d=1} r_{kd}}{D}$$

>$$\hat{\beta}_{km} = \underset{\beta_{km}}{\text{argmax}} \left[ F(\mathbf{R},\boldsymbol{\theta},\boldsymbol{\beta}) + \sum^K_{k'=1} \lambda_{k'} \left( 1-\sum^M_{m'=1} \beta_k'm' \right) \right] = \frac{\sum^D_{d=1} r_{kd} c_{md} }{\sum^M_{m'=1} \sum^D_{d=1} r_{kd} c_{m'd}}$$

## 4.5. Gibbs Sampling for Bayesian Mixture

* **Bayesian Mixture of Categoricals Model**
  * Returns **posterior** distributions of $\boldsymbol{\theta}$ and $\boldsymbol{\beta}$
  
><img src='images/image4_04.png' width = 550>


* **Latent Posterior**

>$$p(y_n|z_n=k,\boldsymbol{\beta}) = p(y_n|\beta_k) = p(y_n|\beta_{z_n}) \;\;\;,\;\;\; p(\boldsymbol{\beta}_k|\gamma) = \text{Dir}(\gamma)$$

>$$p(z_n = k | \boldsymbol{\theta}) = \theta_k \;\;\;,\;\;\; p(\boldsymbol{\theta}|\alpha) = \text{Dir}(\alpha)$$

>$$\therefore \;\;\; p(z_n = k | y_n, \boldsymbol{\theta}, \boldsymbol{\beta}) \propto p(z_n=k | \boldsymbol{\theta}) p(y_n | z_n=k, \boldsymbol{\beta}) \propto \theta_k p(y_n | \beta_{z_n})$$

>* $\rightarrow$ Discrete distribution with $K$ possible outcomes

* **Gibbs Sampling**

>* Component parameters (the mixture aspect eliminated)

>$$p(\beta_k | \mathbf{y}, \mathbf{z}) \propto p(\beta_k) \prod_{n:z_n=k} p(y_n|\beta_k)$$

>* Latent allocations

>$$p(z_n = k | y_n, \boldsymbol{\theta}, \boldsymbol{\beta}) \propto \theta_k p(y_n|\beta_{z_n})$$

>* Mixing proportions ($c_k$: counts for mixture $k$)

>$$p(\boldsymbol{\theta}|\mathbf{z}, \alpha) \propto p(\boldsymbol{\theta}|\alpha) p(\mathbf{z}|\boldsymbol{\theta}) = \text{Dir} \left( \frac{c_k + \alpha_k}{\sum^K_{j=1} c_j + \alpha_j} \right)$$

* **Collapsed Gibbs Sampler**

>* Marginalisaiton over $\theta$ ($-n$: all except $n$)

>$$p(z_n=k|\mathbf{z}_{-n},\alpha) = \frac{\alpha+c_{-n,k}}{\sum^K_{j=1} \alpha + c_{-n,j}}$$

>* Collapsed Gibbs Sampler for the latent assignments

>$$p(z_n=k|y_n,z_{-n},\boldsymbol{\theta},\alpha) \propto p(y_n|\beta_k) \frac{\alpha + c_{-n,k}}{\sum^K_{j=1} \alpha + c_{-n,j}}$$

>* $z_n$: cond. independent given $theta$ $\rightarrow$ **dependent**
>* ***Rich get Richer*** property

## 4.6. Latent Dirichlet Allocation for Topic Modelling

* **MoC - Generative Model**

>1. Draw a distribution $\boldsymbol{\theta}$ over $K$ topics from $\text{Dir}(\alpha)$
>2. For each topic $k$, draw a distribution $\boldsymbol{\beta}_k$ over words from $\text{Dir}(\gamma)$
>3. For each document $d$, draw a topic $z_d$ from $\text{Cat}(\boldsymbol{\theta})$
>4. For each document $d$, draw $N_d$ words $w_{nd}$ from $\text{Cat}(\boldsymbol{\beta}_{z_d})$
>  * $\rightarrow$ **However, the documents may span more than one topic**

* **LDA - Generative Model** (Now, every document has $\boldsymbol{\theta}_d$)

>1. For each document $d$, draw a distribution $\boldsymbol{\theta}_d$ over topics from $\text{Dir}(\alpha)$
>2. For each topic $k$, draw a distribution $\boldsymbol{\beta}_k$ over words from $\text{Dir}(\gamma)$
>3. Draw a topic $z_{nd}$ for the $n$-th word in document $d$ from $\text{Cat}(\boldsymbol{\theta}_d)$
>4. Draw word $w_{nd}$ from $\text{Cat}(\boldsymbol{\beta}_{z_{nd}})$

* **LDA - Graphical Model**

><img src='images/image4_05.png' width=400>

>* Each **topic**: distribution over words
>* Each **document**: mixture of corpus-wide topics
>* Each **word**: drawn from one of these topics

>  * **Goal:** infer hidden variables from documents
>  * **Hidden Variables:** topics, proportions, assignments

* **LDA - Inference Problem** $\rightarrow$ **Intractable**

>$$p(\boldsymbol{\beta}_{1:K},\boldsymbol{\beta}_{1:D},\{z_{nd}\},\{w_{nd}\}|\gamma,\alpha) $$
$$\;$$
$$= \prod^K_{k=1} \; p(\boldsymbol{\beta}_k|\gamma) \; \prod^D_{d=1} \; \bigg[ \; p(\boldsymbol{\theta}_d|\alpha) \prod^{N_d}_{n=1} \; \big[ \; p(z_{nd}|\boldsymbol{\theta}_d) \; p(w_{nd}|\boldsymbol{\beta}_{1:K},z_{nd}) \; \big] \; \bigg]$$

>* Normalising constant of the posterior (evidence): Average over all possible $z_{nd}$

>$$p(\{w_{id}\}) = \int \int \sum_{z_{id}} \prod^D_{d=1} \prod^K_{k=1} \prod^{N_d}_{n=1} p(z_{nd}|\boldsymbol{\theta}_d) p(\boldsymbol{\theta}_d|\alpha) p(w_{nd}|\boldsymbol{\beta}_{1:K}, z_{nd}) p(\boldsymbol{\beta}_k|\gamma) d\boldsymbol{\beta}_k d\boldsymbol{\theta}_d$$

* **Collapsed Gibbs Sampler for LDA**

>\begin{align}
p(z_{nd}=k|\{z_{-nd}\},\{w\},\gamma,\alpha) &\propto p(z_{nd}=k|\{z_{-nd}\},\alpha) \; p(w_{nd}|z_{nd}=k,\{w_{-nd}\},\{z_{-nd}\},\gamma) \\
\;\\
&= \frac{\alpha + c^k_{-nd}}{\sum^K_{j=1} \left( \alpha+c^j_{-nd} \right)} \frac{\gamma + \tilde{c}^k_{-w_{nd}}}{\sum^M_{m=1} \left( \gamma + \tilde{c}^k_{-m} \right)}
\end{align}

>* $c^k_{-nd}$: no. of words from document $d$, excluding $n$, assigned to topic $k$
>* $\tilde{c}^k_{-m}$: no. of times word $m$ was generated from topic $k$, excluding $nd$

* **Per Word Perplexity**

>$$\text{perplexity} = \exp \left( -\frac{\text{log joint prob. over the words}}{\text{number of words}} \right)$$

>* Perplexity of $g$: uncertainty associated with a dice with $g$ sides