# Chapter 1 Results

**Objective:** Basic illustrative statistics to demonstrate the background.

**Method**: Using LDA/Gibbs, LDA/VB, LDA/CVB MoM/Gibbs and MoM/VB demonstrate how well topic models work. Using LDA/VB and HDP demonstrate how well HDP finds ideal topic amounts. Using LDA/VB and online LDA (just use the version [packaged with sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) demonstrate how online learning helps expedite learning on fast datasets. Using LDA and MoM on Reuters and 20News, use labels to demonstrate other evaluation metrics.

# Prelude

## Imports

In [1]:
import pandas as pd
import numpy as np
import numpy.random as rd
import scipy as sp
import scipy.stats as stats
import pathlib
import os
import sys
from IPython.display import display, Markdown

import matplotlib.pyplot as plt

In [2]:
%matplotlib inline

In [3]:
sys.path.append(str(pathlib.Path.cwd().parent))

In [4]:
from sidetopics.model.common import DataSet


## Configuration

In [5]:
DATASET_DIR = pathlib.Path('/') / 'Volumes' / 'DatasetSSD'
CLEAN_DATASET_DIR = DATASET_DIR / 'words-only'

T20_NEWS_DIR = CLEAN_DATASET_DIR / '20news4'
NIPS_DIR = CLEAN_DATASET_DIR / 'nips'
REUTERS_DIR = CLEAN_DATASET_DIR / 'reuters'

TRUMP_WEEKS_DIR = DATASET_DIR / 'TrumpDb'
NUS_WIDE_DIR = DATASET_DIR / 'NusWide'

CITHEP_DATASET_DIR = DATASET_DIR / 'Arxiv'
ACL_DATASET_DIR = DATASET_DIR / 'ACL' / 'ACL.100.clean'

In [6]:
DTYPE = np.float32

# DataSet Load

In [9]:
t20_news = DataSet.from_files(words_file=T20_NEWS_DIR / 'words.pkl')
reuters = DataSet.from_files(words_file=REUTERS_DIR / 'words.pkl')
acl = DataSet.from_files(words_file=ACL_DATASET_DIR / 'words.pkl')
arxiv = DataSet.from_files(words_file=CITHEP_DATASET_DIR / 'words.pkl')
nips = DataSet.from_files(words_file=NIPS_DIR / 'words.pkl')

In [10]:
t20_news.convert_to_dtype(DTYPE)
reuters.convert_to_dtype(DTYPE)
acl.convert_to_dtype(DTYPE)
arxiv.convert_to_dtype(DTYPE)
nips.convert_to_dtype(DTYPE)

In [11]:
def corpus_stats(dataset: DataSet) -> str:
    quarts = np.percentile(a=dataset.words.sum(axis=1), q=[0, 25, 50, 75, 100]).astype(np.int32)
    quarts_str = ' | '.join(f'{q:,}' for q in quarts)
    return f'{dataset.doc_count:,} | {int(dataset.word_count):,} | {dataset.words.shape[1]} | {quarts_str}'


display(Markdown(f"""

| Dataset | Document Count | Total Words | Vocabulary Size | DocLen (Min) | DocLen (25) | DocLen (50) | DocLen (75) | DocLen (Max) |
| ------- | -------------- | ----------- | --------------- | ------------ | ----------- | ----------- | ----------- | ------------ |
| Reuters-21578 | {corpus_stats(reuters)} |
| 20-News | {corpus_stats(t20_news)} |
| NIPS | {corpus_stats(nips)} |
| ACL | {corpus_stats(acl)} |
| Arxiv | {corpus_stats(arxiv)} |

"""))



| Dataset | Document Count | Total Words | Vocabulary Size | DocLen (Min) | DocLen (25) | DocLen (50) | DocLen (75) | DocLen (Max) |
| ------- | -------------- | ----------- | --------------- | ------------ | ----------- | ----------- | ----------- | ------------ |
| Reuters-21578 | 10,788 | 922,811 | 7729 | 4 | 29 | 56 | 105 | 999 |
| 20-News | 18,821 | 3,029,297 | 20835 | 1 | 57 | 98 | 168 | 7,393 |
| NIPS | 1,740 | 2,543,236 | 10422 | 19 | 1,272 | 1,495 | 1,724 | 4,773 |
| ACL | 13,554 | 41,009,480 | 107954 | 100 | 2,080 | 3,022 | 3,689 | 19,238 |
| Arxiv | 543 | 1,838,163 | 5563 | 12 | 1,614 | 2,504 | 3,875 | 43,606 |



The thing to emphasise here is that we're deliberately chose a mix of datasets with small document lengths and large document lengths to look into overparameterisation.

In [79]:
from sklearn.datasets import fetch_rcv1
# rcv1 = fetch_rcv1()  a log TF-IDF rep, not much good

# Issue 1: MoM vs LDA

In [12]:
train, test = reuters.cross_valid_split(test_fold_id=0, num_folds=5)

In [13]:
test, valid = test.doc_completion_split()

In [20]:
from gensim.sklearn_api import HdpTransformer
from sklearn.decomposition import LatentDirichletAllocation

In [21]:
LatentDirichletAllocation.score?

[0;31mSignature:[0m [0mLatentDirichletAllocation[0m[0;34m.[0m[0mscore[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mX[0m[0;34m,[0m [0my[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Calculate approximate log-likelihood as score.

Parameters
----------
X : array-like or sparse matrix, shape=(n_samples, n_features)
    Document word matrix.

y : Ignored

Returns
-------
score : float
    Use approximate bound as score.
[0;31mFile:[0m      ~/Documents/GitHub/sidetopics/env/lib/python3.7/site-packages/sklearn/decomposition/online_lda.py
[0;31mType:[0m      function


# How are we evaluating this?

First off, the `LatentDirichletAllocation` class in `sklearn` will use the variational bound as an approximation of the log likeihood with a give set of doc-to-topic distributions. There's no doc-completion thing here. 

This is the basis of `score()`, which internally calculates the "unnormalized" topic distribution of the documents, then uses the variational bound to approximate the log likelihood; this in turn is the basis of `perplexity()`.

What did Hannah Wallach say?

 * Well she's thinking of T topics, and I guess ? words so here component distribution is  $\Phi \in \mathbb{R}^{T \times ?}$ with prior $\text{Dir}(\phi_t; \beta \boldsymbol{n})$
 * For each of the $D$ documents there's a topic distribution $\theta_d$ with prior $\text{Dir}(\theta_d; \alpha \boldsymbol{m})$

Finally, she notes the Polya identity, allowing the marginalisation of most parameters.

She then moves out into how to evaluate the probability of some held out documents $W$ given some training documents $W'$ which is

$$
p(W | W') = \int d\Phi d\alpha d\boldsymbol{m}
             \text{ } p(W | \Phi, \alpha, \boldsymbol{m}) \text{ } p(\Phi, \alpha, \boldsymbol{m}|W')
$$

The thing to note here is she has already margnalised out $\Theta$ for the new documents. She assumes you learn the "global" parameters -- priors and component distribution -- and then fix these and use them to evaluate the new documents

> So we have to think about what we're doing here. A mixture model is a good case. You can just directly evaluate the log likelihood $p(w|\alpha, \Phi) = \sum_k p(w | \phi_k)p(z=k|\alpha)$. Or you can determine the posterior over clusters and use that to evaluate... except that it doesn't decompose $p(w|\alpha, \Phi) = \sum_k p(w, z=k|\alpha, \Phi) = p(z=k|w, \alpha, \Phi)p(w|\ldots)$. But it seems obvious to see how well you can "explain" documents: this is what doc-completion does. Hence it should be introduced in the clustering section. It's also a good metric to use if you want to consider the predictive ability to, e.g. predict hashtags.

Now either way, you have to make a choice about your parameters. Are you using the _distribution_ over the parameters, or are you just taking a point estimate?

1. Drawing samples from the parameter posterior and taking an average to evaluate the integral, i.e.  $\mathbb{E}_{p(\Phi, \alpha, \boldsymbol{m}|W')}\left[ p(W | \Phi, \alpha, \boldsymbol{m}) \right]$. 
    * Stick a log in that expectation and you can start thinking about a variational approximation.
2. Taking a point estimate of -- I guess $\Phi, \alpha, \boldsymbol{m}$ -- and then use that to approximate

The paper is concerned with point estimates. So where's the uncertainty.... Apparently its in $p(\boldsymbol{w}_d | \alpha \boldsymbol{m}, \Phi)$

The next thing is that we've marginalised out $\theta$ for each of the inference documents. We need this too. If you hold $\Phi$ fixed (and so let it be found by any inference method), you can use Gibbs sampling to quickly get a distribution over $z$ and thereby, $\theta$.

 * This is used by many methods she describes, being: FIXME
 * There are other methods that do not require this, being: FIXME
 


### Estimating $p(w|\Phi, a \boldsymbol{m})$



#### Using Importance Sampling


Hence there are two options:

Directly sample $\theta \sim Dir(\alpha \boldsymbol{m})$ and average over all settings. But importance sampling doesn't work well in high-dimensions: it has high-variance, indeed, infinite variance with real-values high-dim values.

The other is to choose a proposal distribution and weight such samples in the usual importance-sampling way. The proposal distribution is in fact a method for evaluating the posterior $p(z|w, \alpha \boldsymbol{m}, \Phi)$

$$
\theta^0 \propto \left(\alpha \boldsymbol{m}\right) \text{.* } \Phi_{\cdot, w_{n}} 
$$

Which is just the prior over topics and the probability of words under each topic, i.e. $p(z = k| w, \Phi, \alpha \boldsymbol{m}) \propto p(w|Phi, z=k)p(z=k| \alpha \boldsymbol{m})$

To draw samples, simply iterate
$$
\begin{align*}
\text{for }& s = 0 \ldots S \\
 & z_n^{(s)} \sim \text{Mul}(\theta^{(s)}, 1) \\
 & \theta^{(s+1)} \propto \left(\alpha \boldsymbol{m} + \sum_{n' \neq n} \theta^{(s)} \text{.* } \boldsymbol{\bar{z}}_{n'}\right) \Phi_{\cdot, w_{n}}
\end{align*}
$$

(Recall that in more normal notation $\alpha \boldsymbol{m} = \boldsymbol{\alpha}$ and parameterises the prior. Also $z_n$ is the scalar and $\bar{\boldsymbol{z}}_n$ is the indicator vector.

#### Use the Harmonic Mean

Use Gibbs sampling to get a _posterior_ distribution over $z_n^s$.

Then instead of using that to materlise an estimate of $\theta$ (WHY), use it directly to figure out $p(w | \alpha \boldsymbol{m}, \Phi)$