# Exercise 6- Latent Dirichlet Allocation

In this exercise, we will formulate the generative model of Latent Dirichlet Allocation (LDA) and use Gibbs Sampling to derive the latent topics of documents.

In the event of a persistent problem, do not hesitate to contact the course instructors under
- paul.kahlmeyer@uni-jena.de

### Submission

- Deadline of submission:
        11.12.2022
- Submission on [moodle page](https://moodle.uni-jena.de/course/view.php?id=34630)

### Help
In case you cannot solve a task, you can use the saved values within the `help` directory:
- Load arrays with [Numpy](https://numpy.org/doc/stable/reference/generated/numpy.load.html)
```
np.load('help/array_name.npy')
```
- Load functions, classes and other objects with [Dill](https://dill.readthedocs.io/en/latest/dill.html)
```
import dill
with open('help/some_func.pkl', 'rb') as f:
    func = dill.load(f)
```

to continue working on the other tasks.

# Dataset

We will use a dataset consisting of [titles from scientific papers](https://www.kaggle.com/datasets/blessondensil294/topic-modeling-for-research-articles).

For a fast lookup, we convert the titles into lists of word indices based on a fixed vocabulary.

For example with the vocabulary

| idx | word |
| :- | -: | 
| 0 | big |
| 1 | sentence |
| 2 | random |

the sentence
    
    "This is a random sentence"
    
would be converted to 
    
    [2, 1]

### Task 1

Load the articles from `titles.txt` and transform them into lists of word indices using the vocabulary from `vocab.txt`.
Discard titles that have no words in the vocabulary.

In [8]:
import re

# TODO: load titles and vocab
with open("titles.txt", 'r') as f:
    titles = f.read().replace(";", "").splitlines()
with open("vocab.txt", 'r') as f:
    vocab = f.read().replace(";", "").splitlines()

# TODO: transform each document into list of indices
vocab_id = {word: id for id, word in enumerate(vocab)}


def to_bag(title: str) -> list[int]:
    # replace non-alphabetical characters with whitespace
    title = re.sub(r"[^a-zA-Z]", " ", title)
    # strip whitespace left and right
    title = title.strip()
    # collapse all longer whitespace into 1 space
    title = re.sub(r"\s+", " ", title)
    # read the words
    words = title.split(" ")
    # convert to ids (can become an empty list)
    return [vocab_id[word] for word in words if word in vocab_id]


# collect all bags where the bags aren't empty
bags = [to_bag(title) for title in titles if len(to_bag(title)) != 0]


# LDA Topic Model

Latent Dirichlet Allocation (LDA) is a model of how documents are generated from unobserved topics.
The goal is then to learn the hidden parameters of the model which essentially means to discover the topics of documents.

The model has the following parameters:

- $K$... number of topics
- $D$... number of documents
- $V$... size of vocabulary
- $N_i$... number of words in document $i$
- $z_{il}$... topic of $l$-th word of document $i$
- $w_{il}$... $l$-th word of document $i$

and is depicted in the following graphical model:
<div>
<img src="images/generative_model.png" width="400"/>
</div>


Sampling a corpus of documents is done with four steps:

1. Sample the topic distribution for each document from a dirichlet distribution 

\begin{equation}
\theta_i\sim\mathcal{D}(\alpha)\text{, for }i=1,\dots,D
\end{equation}

2. Sample the word distribution for each topic from a dirichlet distribution 

\begin{equation}
\varphi_j\sim\mathcal{D}(\beta)\text{, for }j=1,\dots,K
\end{equation}

3. Sample the topics for each word-position according to the document specific topic distribution

\begin{equation}
z_{il}\sim Cat(\theta_i)\text{, for }l=1,\dots N_i
\end{equation}

4. Sample the words for each word-position according to the topic specific word distribution

\begin{equation}
w_{il}\sim Cat(\varphi_{z_{il}})\text{, for }l=1,\dots N_i
\end{equation}

For further detail, please refer to the following videos:

- intuitive guide [Part1](https://www.youtube.com/watch?v=T05t-SqKArY), [Part2](https://www.youtube.com/watch?v=BaM1uiCpj_E)
- more technical [ML Lecture from Tübingen](https://www.youtube.com/watch?v=z2q7LhsnWNg)

### Task 2

Implement this generative model and explore the influence of the dirichlet parameters $\alpha$ and $\beta$ on the topics of the document.

For simplicity we assume that all documents have the same length $N$, and that we draw from the dirichlet distributions using the uniform vectors $[\alpha]_{1,\dots, D}$ and $[\beta]_{1,\dots, K}$.

In [23]:
import numpy as np


def sample_LDA(alpha: float, beta: float, D: int, K: int, N: int, V: int) -> np.ndarray:
    '''
    @Params:
        alpha... dirichlet prior for document-topic distribution
        beta... dirichlet prior for topic-word distribution
        D... number of documents
        K... number of topics
        N... number of words/document
        V... size of vocabulary

    @Returns:
        samples from LDA model
    '''

    # turn alpha and beta into the respective vectors
    alpha = np.full(K, alpha)
    beta = np.full(V, beta)
    # theta[d], d in [D]: probability distribution of topics for document d
    # theta[d][t], d in [D], t in [K]: probability that document d is topic t
    theta = np.array([np.random.dirichlet(alpha) for _ in range(D)])
    # phi[t], t in [K]: probability distribution of words for topic t
    # phi[t,w], t in [K], w in [V]: probability that topic t turns into word w
    phi = np.array([np.random.dirichlet(beta) for _ in range(K)])
    # z[d,n], d in [D], n in [N]: sampled topic in document d at position n
    z = np.array([np.random.choice(K, size=N, p=theta[d]) for d in range(D)])
    # w[d,n], d in [D], n in [N]: sampled word in document d at position n from topic distribution phi[t] where t = z[d][n] is the topic at that position
    w = np.array([[np.random.choice(V, size=1, p=phi[z[d, n]]) for n in range(N)] for d in range(D)])
    # return the documents filled with words
    return theta, phi, z, w


# TODO: observe statistics
# n_documents = len(bags)
# n_topics = 20
# n_words_per_document = 10
# vocab_size = len(vocab)
n_documents = 3
n_topics = 5
n_words_per_document = 10
vocab_size = 5

priors = [
    (0.1, 0.1),
    (1, 1),
    (0.5, 2),
    (2, 0.5),
    (5, 5)
]

# NOTE: low prior seems to promote sparsity
with np.printoptions(precision=3, suppress=True):
    for alpha, beta in priors:
        theta, phi, z, w = sample_LDA(alpha, beta, n_documents, n_topics, n_words_per_document, vocab_size)
        print(f"alpha, beta: {alpha:.2f}, {beta:.2f}\ntopics probs:\n{theta}")


alpha, beta: 0.10, 0.10
topics probs:
[[0.011 0.    0.004 0.985 0.   ]
 [0.    0.905 0.001 0.    0.094]
 [0.    0.012 0.    0.936 0.052]]
alpha, beta: 1.00, 1.00
topics probs:
[[0.432 0.133 0.152 0.261 0.021]
 [0.311 0.257 0.164 0.149 0.119]
 [0.053 0.491 0.101 0.041 0.313]]
alpha, beta: 0.50, 2.00
topics probs:
[[0.13  0.307 0.075 0.473 0.016]
 [0.44  0.179 0.024 0.    0.357]
 [0.02  0.001 0.098 0.125 0.756]]
alpha, beta: 2.00, 0.50
topics probs:
[[0.356 0.303 0.129 0.124 0.088]
 [0.221 0.049 0.242 0.219 0.269]
 [0.1   0.293 0.177 0.158 0.272]]
alpha, beta: 5.00, 5.00
topics probs:
[[0.199 0.263 0.183 0.226 0.129]
 [0.236 0.196 0.203 0.197 0.169]
 [0.15  0.187 0.181 0.35  0.133]]


# Parameter Learning

In reality, we only know $W$ and we are interested in the latent matrices
\begin{equation}
\Theta = [\theta_i]_{i=1,\dots,D}
\end{equation}
and
\begin{equation}
\Phi = [\varphi_j]_{j=1,\dots,T}\,,
\end{equation}
which hold the probabilities over topics for each document and the probabilities over words for each topic respectively.
Both matrices are latent and controlled via latent Dirichlet-Priors, hence the name "latent dirichlet allocation".

In other words we are interested in estimates 
\begin{align}
\hat{\Theta}&\approx\Theta\\
\hat{\Phi}&\approx\Phi\\
\end{align}

In practice, it is sufficient to get an estimate for the word - topic assignments $Z$.

From $\hat{Z}$ and $W$ one can then simply calculate the Maximum Likelihood estimates $\hat{\Theta}_{\text{ML}}$ and $\hat{\Phi}_{\text{ML}}$ for the categoricals.


## Gibbs Sampling

The standard way to get $\hat{Z}$ is by estimating the posterior
\begin{equation}
p(Z|W)\,.
\end{equation}

Unfortunately this posterior is intractable to compute directly, so we have to fall back to sampling.

Let 
- $W$ be the given word collection
- $Z$ be an assignment of words to topics (collection of topics)
- $d_i\subseteq W$ be a specific document (collection of words)
- $(d_i, w_i, z_i)$ be the triple that defines a specific word in a document with a topic

[Gibbs Sampling](https://en.wikipedia.org/wiki/Gibbs_sampling) in the form we want to use here consists of a simple loop:

1. Create an initial $\hat{Z}$
2. For each word in each document $(d_i, w_i, z_i)$:
   - resample $z_i$ from $p(z_i = j | Z\setminus z_i, W)$
   - update $\hat{Z}$

It turns out that

\begin{equation}
p(z_i = j | Z\setminus z_i, W)\propto\cfrac{|\left\{w\in W: w=w_i\wedge \text{topic}(w)=j\right\}| + \beta}{|\left\{w\in W: \text{topic}(w)=j\right\}| + V\beta}\cdot\cfrac{|\left\{w\in d_i: \text{topic}(w)=j\right\}| + \alpha}{|d_i| + K\alpha}\,.
\end{equation}

### Task 3

Implement the following class for Gibbs Sampling.

The function `.sample` should return a dictionary that holds the intermediate results for each iteration (we need this later).

Use the class to sample for 200 iterations with $K=5$ topics, and $\alpha=\beta=0.1$.

In [40]:
class LDA_GibbsSampler:
    def __init__(self, docs: list[list[int]], vocab: list[str], K: int, alpha: float, beta: float):
        '''
        @Params:
            docs... list of lists of indices (see Task 1)
            vocab... list of words (see Task 1)
            K... number of topics
        '''

        self.docs = docs
        self.vocab = vocab
        self.word2idx = {v: i for i, v in enumerate(self.vocab)}
        self.idx2word = {i: v for i, v in enumerate(self.vocab)}

        self.D = len(self.docs)
        self.V = len(self.word2idx)
        self.K = K
        self.alpha = alpha
        self.beta = beta

    def sample(self, iterations: int, seed: int = 0) -> dict:
        '''
        Performs Gibbs Sampling for LDA topic model

        @Params:
            iterations... number of iterations
            seed... random seed (for initialization)
        @Returns:
            dictionary with results from sampling process (key = iteration, value = results after iteration)
        '''

        # prepare word list
        word_list = np.array([word for document in self.docs for word in document], dtype=int)
        # we've got a topic for every word in every document
        z_size = len(word_list)
        # prepare respective documents
        respective_docs = np.array([i for i, document in enumerate(self.docs) for word in document], dtype=int)

        # seed it
        np.random.seed(seed)
        # create initial Z (a topic t in [K] for every word)
        Z = np.random.choice(self.K, size=z_size)

        # remember where that word occurs
        # (w, i) => w == w_i
        same_word = np.array([word_list == word for word in range(self.V)])
        # remember where each document is
        # (d, i) => d == d_i
        same_document = np.array([respective_docs == document for document in range(self.D)])
        # remember where each topics occurs
        # (t, i) => t == z_i
        same_topic = np.array([Z == topic for topic in range(self.K)])

        # number of occurences of that word
        n_occurences = np.array([np.count_nonzero(same_word[word]) for word in range(self.V)])
        # remember how long each document is
        document_length = np.array([np.count_nonzero(same_document[document]) for document in range(self.D)])
        # (word, topic) -> number of occurences of the that word with that topic
        n_occurences_with_topic = np.array([[np.count_nonzero(same_word[word] & same_topic[topic]) for topic in range(self.K)] for word in range(self.V)])
        # (doc, topic) -> number of words in the document with that topic
        n_occurences_of_topic_in_doc = np.array([[np.count_nonzero(same_document[document] & same_topic[topic]) for topic in range(self.K)] for document in range(self.D)])
        # divisor never changes for any word/document combination
        # (doc, word) => divisor
        divisor = [[(n_occurences[word] + self.V * self.beta) * (document_length[document] + self.K * self.alpha) for word in range(self.V)] for document in range(self.D)]
        # create the snapshot dict
        snapshots: dict[int, np.ndarray] = {}
        for iteration in range(iterations):
            for i in range(z_size):
                old_topic = Z[i]
                word = word_list[i]
                document = respective_docs[i]
                # calculate the probability for every new topic j
                divident = (n_occurences_with_topic[word] + self.beta) * (n_occurences_of_topic_in_doc[document] + self.alpha)
                p = divident / divisor[document][word]
                # normalize probability
                p /= np.sum(p)
                # sample from it and overwrite Z
                new_topic = np.random.choice(self.K, p=p)
                Z[i] = new_topic
                # update the number of entries with the same word and topic
                n_occurences_with_topic[word][old_topic] -= 1
                n_occurences_with_topic[word][new_topic] += 1
                # update the number of entries in the same document with the same topic
                n_occurences_of_topic_in_doc[document][old_topic] -= 1
                n_occurences_of_topic_in_doc[document][new_topic] += 1
            snapshots[iteration] = Z.copy()
        return snapshots


# TODO: sample 1000 samples from p(Z|W)
n_topics = 5
alpha = 0.1
beta = 0.1
sampler = LDA_GibbsSampler(bags, vocab, n_topics, alpha, beta)
snapshots = sampler.sample(iterations=200)
print(z)


[2 3 2 ... 2 2 2]


### Task 4

A good topic model should have two properties:

1. Each document should have only a few topics
2. Each topic should have only a few words

Think about how to measure these two properties, track them over the sampling iterations and visualize this process.

In [None]:
# TODO: track + visualize properties of topic assignment

# Inference

The result of our Gibbs Sampling process are samples from $p(Z|W)$.

Remember that we want to have an estimate $\hat{Z}$ and from there calculate $\hat{\Theta}_{\text{ML}}$ and $\hat{\Phi}_{\text{ML}}$.

Based on these we can then answer all kinds of inference queries.

### Task 5

Use the topic assignments from the sampling process $Z^{(i)}$ to estimate 

\begin{equation}
\hat{Z} = \cfrac{1}{m} \sum\limits_{i=1}^m Z^{(i)} \approx \mathbb{E}_{p(Z|W)}Z
\end{equation}

For the estimation, take every 10-th sample from the 100-th iteration onwards.

In [None]:
# TODO: estimate Z

### Task 6

Think about what the Maximum Likelihood estimates $\hat{\Theta}_{\text{ML}}$ and $\hat{\Phi}_{\text{ML}}$ should be given $\hat{Z}$.

Use your estimate $\hat{Z}$ to calculate $\hat{\Theta}_{\text{ML}}$ and $\hat{\Phi}_{\text{ML}}$.

In [None]:
# TODO: estimate phi, theta

# Inference

### Task 7

For each topic, what are the top three words?

In [None]:
# TODO: top three words for topics

### Task 8

What is the topic distribution for the second title?

In [None]:
# TODO: topic distribution for second title

### Task 9

Select the top 10 titles that are most typical for topic 4.

In [None]:
# TODO: top 10 titles for topic 4

### Task 10

Visualize the soft clustering that is induced by the topic distribution vectors for each title.

In [None]:
# TODO: visualize clustering