## [Probabilistic Topic Modeling](http://pyro.ai/examples/prodlda.html#Probabilistic-Topic-Modeling)

#### Topic models are a suite of unsupervised learning algorithms that aim to discover and annotate large archives of documents with thematic information. Probabilistic topic models use statistical methods to analyze the words in each text to discover common themes, how those themes are connected to each other, and how they change over time. They enable us to organize and summarize electronic archives at a scale that would be impossible with human annotation alone. The most popular topic model is called latent Dirichlet allocation, or LDA.

![](https://i.ibb.co/zV5rjX6/Screen-Shot-2020-09-24-at-11-21-38.png)

##### We assume that there is a given number of “topics,” each of which is a probability distributions over words in the vocabulary (far left). Each document is assumed to be generated as follows: i) first, randomly choose a distribution over the topics (the histogram on the right); ii) then, for each word, randomly choose a topic assignment (the colored coins), and randomly choose the word from the corresponding topic. For an in-depth intuitive description, please check the excellent article from David Blei. The goal of topic modeling is to automatically discover the topics from a collection of documents. The documents themselves are observed, while the topic structure — the topics, per-document topic distributions, and the per-document per-word topic assignments — is hidden. The central computational problem for topic modeling is to use the observed documents to infer the hidden topic structure.

### [Pre-processing Data and Vectorizing Documents](http://pyro.ai/examples/prodlda.html#Pre-processing-Data-and-Vectorizing-Documents)

In [7]:
import os
import torch

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
news = fetch_20newsgroups(subset='all')

In [6]:
vectorizer = CountVectorizer(max_df=0.5, min_df=20, stop_words='english') 
# Removing rare words (words that appear in less than 20 documents) and 
# common words (words that appear in more than 50% of the documents)

In [9]:
docs = torch.from_numpy(vectorizer.fit_transform(news['data']).toarray())

In [25]:
docs.shape

torch.Size([18846, 12722])

In [23]:
vocab = pd.DataFrame(columns=['word', 'index'])
vocab['word'] = vectorizer.get_feature_names_out()
vocab['index'] = vocab.index

In [24]:
vocab.tail()

Unnamed: 0,word,index
12717,zoom,12717
12718,zuma,12718
12719,zurich,12719
12720,zx,12720
12721,zz,12721


##### We have a dictionary of 12,722 unique words and indices for each of them! And our corpus is comprised of almost 19,000 documents, where each row represents a document, and each column represents a word in the vocabulary. The data is the count of how many times each word occurs in that specific document. Now we are ready to move to the model.

### [Probabilistic Modeling and the Dirichlet distribution in Pyro](http://pyro.ai/examples/prodlda.html#Probabilistic-Modeling-and-the-Dirichlet-distribution-in-Pyro)

##### To understand how probabilistic modeling and Pyro work, let’s imagine a very simple example. Let’s say we have a dice and we want to determine whether it is loaded or fair. We cannot directly observe the dice’s ‘fairness’; we can only infer it by throwing the dice and observing the results. So, we throw it 30 times and observe the following results:

$$\mathcal{D} = \{5, 4, 2, 5, 6, 5, 3, 3, 1, 5, 5, 3, 5, 3, 5, 3, 5, 5, 3, 5, 5, 3, 1, 5, 3, 3, 6, 5, 5, 6\}$$

$$p(\mathcal{D} | \theta) = \prod_{i = 1}^{6} \theta_{k}^{N_k}$$

In [26]:
import pyro
import pyro.distributions as dist
from pyro.infer import MCMC, NUTS

In [28]:
def model(counts):
    theta = pyro.sample('theta', dist.Dirichlet(torch.ones(6)))
    total_count = int(counts.sum())
    pyro.sample('counts', dist.Multinomial(total_count, theta), obs=counts)

In [46]:
def model1(data):
    theta = pyro.sample('theta', dist.Dirichlet(torch.ones(6)))
    with pyro.plate('data', len(data)):
        pyro.sample('obs', dist.Categorical(theta), obs=data)

In [34]:
data = torch.tensor([5, 4, 2, 5, 6, 5, 3, 3, 1, 5, 5, 3, 5, 3, 5, \
                     3, 5, 5, 3, 5, 5, 3, 1, 5, 3, 3, 6, 5, 5, 6])

counts = torch.unique(data, return_counts=True)[1].float()

In [47]:
nuts_kernel = NUTS(model1)
num_samples, warmup_steps = (1000, 200)

In [48]:
mcmc = MCMC(nuts_kernel, num_samples, warmup_steps)
mcmc.run(data -1)

Sample: 100%|██████████| 1200/1200 [00:15, 79.61it/s, step size=5.59e-01, acc. prob=0.938]


In [49]:
hmc_samples = {k: v.detach().cpu().numpy() for k, v in mcmc.get_samples().items()}

In [50]:
hmc_samples['theta'].mean(axis=0)

array([0.08168021, 0.05532762, 0.27888504, 0.05335578, 0.41732746,
       0.11342413], dtype=float32)

In [51]:
hmc_samples['theta'].std(axis=0)

array([0.04482878, 0.03823426, 0.07980289, 0.03685267, 0.08043303,
       0.05437296], dtype=float32)

### [LDA pseudocode, mathematical form, and graphical model](http://pyro.ai/examples/prodlda.html#LDA-pseudocode,-mathematical-form,-and-graphical-model)