# Chapter 1 Results

**Objective:** Basic illustrative statistics to demonstrate the background.

**Method**: Using LDA/Gibbs, LDA/VB, LDA/CVB MoM/Gibbs and MoM/VB demonstrate how well topic models work. Using LDA/VB and HDP demonstrate how well HDP finds ideal topic amounts. Using LDA/VB and online LDA (just use the version [packaged with sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) demonstrate how online learning helps expedite learning on fast datasets. Using LDA and MoM on Reuters and 20News, use labels to demonstrate other evaluation metrics.

# Prelude

## Imports

In [None]:
import numpy as np
import numpy.random as rd
import scipy as sp
import scipy.stats as stats
import pathlib
import os
import sys
from IPython.display import display, Markdown

In [None]:
sys.path.append(str(pathlib.Path.cwd().parent))

In [None]:
from sidetopics.model.common import DataSet


## Configuration

In [None]:
DATASET_DIR = pathlib.Path('/') / 'Volumes' / 'DatasetSSD'
CLEAN_DATASET_DIR = DATASET_DIR / 'words-only'

T20_NEWS_DIR = CLEAN_DATASET_DIR / '20news4'
NIPS_DIR = CLEAN_DATASET_DIR / 'nips'
REUTERS_DIR = CLEAN_DATASET_DIR / 'reuters'

TRUMP_WEEKS_DIR = DATASET_DIR / 'TrumpDb'
NUS_WIDE_DIR = DATASET_DIR / 'NusWide'

CITHEP_DATASET_DIR = DATASET_DIR / 'Arxiv'
ACL_DATASET_DIR = DATASET_DIR / 'ACL' / 'ACL.100.clean'

In [None]:
DTYPE = np.float32

# DataSet Load

In [None]:
t20_news = DataSet.from_files(words_file=T20_NEWS_DIR / 'words.pkl')
reuters = DataSet.from_files(words_file=REUTERS_DIR / 'words.pkl')
acl = DataSet.from_files(words_file=ACL_DATASET_DIR / 'words.pkl')
arxiv = DataSet.from_files(words_file=CITHEP_DATASET_DIR / 'words.pkl')
nips = DataSet.from_files(words_file=NIPS_DIR / 'words.pkl')

In [None]:
t20_news.convert_to_dtype(DTYPE)
reuters.convert_to_dtype(DTYPE)
acl.convert_to_dtype(DTYPE)
arxiv.convert_to_dtype(DTYPE)
nips.convert_to_dtype(DTYPE)

In [None]:
display(Markdown(f"""

| Dataset | Document Count | Total Words | Vocabulary Size |
| ------- | -------------- | ----------- | --------------- |
| 20-News | {t20_news.doc_count:,} | {int(t20_news.word_count):,} | {t20_news.words.shape[1]} |
| Reuters | {reuters.doc_count:,} | {int(reuters.word_count):,} | {reuters.words.shape[1]} |
| NIPS | {nips.doc_count:,} | {int(nips.word_count):,} | {nips.words.shape[1]} |
| ACL | {acl.doc_count:,} | {int(acl.word_count):,} | {acl.words.shape[1]} |
| Arxiv | {arxiv.doc_count:,} | {int(arxiv.word_count):,} | {arxiv.words.shape[1]} |

"""))

# Issue 1: MoM vs LDA