In [3]:
# Change directory to VSCode workspace root so that relative path loads work correctly. Turn this addition off with the DataScience.changeDirOnImportExport setting
# ms-python.python added
import os
try:
	os.chdir(os.path.join(os.getcwd(), '../..'))
	print(os.getcwd())
except:
	pass

/Users


In [6]:
! cd ./sanjeevtewani/x/coursework/coms6998/SommeliAI/notebooks/

 # Why do we care?
 We're interested in exploring probabilistic programming as it
 applies to LDA and its variants.  LDA describes an extremely
 intuitive generative process, and because of this it enjoys
 continued research into its expansions. It is also both flexible
 and interpretable.  Since everybody here knows what LDA is, we
 won't go over the details.

 But, inference is really, really hard.  Most expansions to LDA that
 get published require subtle tricks to even get inference working.

 Examples are:
   1. Supervised LDA for classification only works for real-valued supervision;
      (Blei & McAuliffe, 2008)
   2. Multiple Classification LDA was published a full year later; it required a
      subtle application of Jensen's inequality to reduce O(K^N) time to O(K^2)
      (Wang et al., 2009)
   3. Hierarchical Supervised LDA (Perotte, 2011) can't model a true 
      is-a-this-and-not-that hierarchical relationship
 Just about every incremental idea requires some special trick, no matter how
 logical the idea is.

__In spite of how understandable and flexible LDA__ is, even statisticians and
practitioners will have a tough time deploying criticizable models for their needs.

Fortunately, using Pyro to overcome intractable integrals was day 2; so let's get started
with LDA in Pyro!

 # Dataset and Course of Research
 We study almost 300,000 wine reviews from WineEnthusiast.com.  This dataset is
 richly tagged, with numerical scores from 0 to 100, hierarchical region information
 (country->region->winery) and variety.
 Reviews are short but basically already bags of words; we don't anticipate many words
 being wasted on nuanced semantics or stop words, so we believe that these reviews may
 be long enough for LDA to work.
 
 > "Damp earth, black plum and dank forest herbs show
 > on the nose of this single-vineyard expression. The palate offers
 > cranberry and raspberry as well as savory soy and dried beef flavors,
 > all with earthy herbs in the background."
 
As such, we expect to see a progression of topic modelling capabilities as we march
down our list of models:
 - LDA,
 - LDA + classification v. supervised LDA,
 - Hierarchical LDA + classification v. supervised LDA
 - Hierarchical supervised LDA v. supervised LDA
 - Spectral Methods for supervised LDA v. supervised LDA

 # Vanilla LDA
 __Top Words Per Topic (10 topics)__
 - aroma,dry,black,tannin,drink,cherry,finish,fruit,wine,flavor
 - black,drink,acidity,palate,finish,tannin,cherry,fruit,wine,flavor
 - aroma,ripe,dry,tannin,acidity,finish,cherry,fruit,wine,flavor
 - acidity,dry,drink,tannin,aroma,finish,cherry,fruit,flavor,wine
 - drink,black,palate,acidity,tannin,finish,cherry,fruit,wine,flavor
 - acidity,drink,tannin,aroma,dry,cherry,finish,fruit,flavor,wine
 - aroma,acidity,palate,black,tannin,finish,cherry,fruit,wine,flavor
 - acidity,black,aroma,tannin,dry,cherry,finish,fruit,wine,flavor
 - drink,aroma,cherry,acidity,dry,tannin,finish,fruit,flavor,wine
 - aroma,dry,palate,acidity,tannin,cherry,finish,fruit,wine,flavor

I'm not certain what iteration or hyperparameter was used for this run,
but I can assure you we ran many, and the results were all the same.

# Supervised LDA?
Perhaps our data is truly just a very poor fit for LDA; this was always a risk.  We do expect that using more supervision, however, we can separate out meaningful topics for classification; if the objective of LDA
may itself not offer enough rewards separate out topics, an extra loss may help.

# sLDA's top words per topic
 - palate, style, structure, aromas, sweet, crisp, mouth, cherry, soft, spice
 - soft, citrus, cherry, crisp, palate, blackberry, apple, like, aromas, cabernet
 - aromas, cherry, spice, palate, chocolate, soft, cabernet, best, sweet, pinot
 - sweet, soft, cherry, palate, style, crisp, aromas, shows, mouth, plenty
 - cherry, soft, aromas, palate, touch, cedar, spice, sweet, made, plenty
 - cherry, palate, aromas, chocolate, merlot, crisp, plenty, fresh, pomegranate, sauvignon
 - sweet, cherries, shows, soft, aromas, palate, structure, cherry, cabernet, green
 - cherry, palate, like, spice, shows, sweet, soft, full, blend, green
 - palate, cherry, shows, aromas, cola, soft, cherries, raspberry, structure, crisp
 - cherry, palate, like, sweet, shows, citrus, polished, aromas, soft, cherries
 
What's going on?

# ALL VISUALIZATIONS

# Posterior Collapse


We want to scrutinize each parameter update from Pyro and CAVI to get a sense of what's going on.  Pyro employs BBVI for discrete distributions; it will furthermore use context clues from plates to Rao-Blackwellize where possible.  We hand derive the updates for the specific case of vanilla LDA.


# Parameter Updates: Gamma
BBVI:
![](files/bbvi_update_gamma.png)
![](files/bbvi_update_phi.png)

Via Monte-Carlo (B samples from gamma)
![](files/montecarlo_update_gamma.png)
![](files/montecarlo_update_phi.png)

And compared to CAVI
![](files/cavi_update_gamma.png)
![](files/cavi_update_phi.png)


A few things to note:
 - updates in each aren't directly comparable; CAVI is run until convergence while BBVI will step iteratively (along with other parameters).  But the similarities are evident
 - All updates show some tradeoff between updating a parameter and shrinking toward its prior. \theta isn't the prior of \phi, but they are closely related via the model.
 - Both BBVI updates have Monte-Carlo estimates that can be written as KL divergences

How so?

![](files/bbvi_kldiv_gamma.png)
![](files/bbvi_kldiv_phi.png)


This term: log(p(z_{d,n}) / q(z_{d,n} \vert \phi_{d,n}), should trigger some alarm bells.  This term will guide our variational distribution towards the unconditional p(z).  This is a hallmark of posterior collapse.  Once the variational parameters are guided towards a collapsed maximum, the gradient updates will drag all model parameters towards the collapse.

 # CAVI for LDA

 # References
 (Blei & McAuliffe, 2008) https://papers.nips.cc/paper/3328-supervised-topic-models.pdf
 (Wang et al., 2009) http://vision.stanford.edu/pdf/WangBleiFei-Fei_CVPR2009.pdf
 (Schroeder, 2018) https://edoc.hu-berlin.de/bitstream/handle/18452/19516/thesis_schroeder_ken.pdf?sequence=3
 (Perotte et al., 2011) https://papers.nips.cc/paper/4313-hierarchically-supervised-latent-dirichlet-allocation
 (Thoutt, 2017) https://www.kaggle.com/zynicide/wine-reviews