# pyLDAvis

[`pyLDAvis`](https://github.com/bmabey/pyLDAvis) is a python libarary for interactive topic model visualization.
It is a port of the fabulous [R package](https://github.com/cpsievert/LDAvis>) by Carson Sievert and Kenny Shirley.  They did the hard work of crafting an effective visualization. `pyLDAvis` makes it easy to use the visualiziation from Python and, in particualr, IPython notebooks. To learn more about the method behind the visualization I suggest reading the [original paper](http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf) explaining it.

This notebook provides a quick overview of how to use `pyLDAvis`. Refer to the [documenation](https://pyldavis.readthedocs.org/en/latest/) for details.


## BYOM - Bring your own model

`pyLDAvis` is agnostic to how your model was trained. To visualize it you need to provide the topic-term distribtuions, document-topic distributions, and basic information about the corpus which the model was trained on. The main function is the [`prepare`](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.prepare) function that will transform your data into the format needed for the visualization.

Below we load a model trained in R and then visualize it. The model was trained on a corpus of 2000 movie reviews parsed by [Pang and Lee (ACL, 2004)](http://www.cs.cornell.edu/people/pabo/movie-review-data/), originally gathered from the IMDB archive of the rec.arts.movies.reviews newsgroup.

In [8]:
import json
import numpy as np

def load_R_model(filename):
    with open(filename, 'r') as j:
        data_input = json.load(j)
    data = {'topic_term_dists': data_input['phi'], 
            'doc_topic_dists': data_input['theta'],
            'doc_lengths': data_input['doc.length'],
            'vocab': data_input['vocab'],
            'term_frequency': data_input['term.frequency']}
    return data

movies_model_data = load_R_model('data/movie_reviews_input.json')

print('Topic-Term shape: %s' % str(np.array(movies_model_data['topic_term_dists']).shape))
print('Doc-Topic shape: %s' % str(np.array(movies_model_data['doc_topic_dists']).shape))

Topic-Term shape: (20, 14567)
Doc-Topic shape: (2000, 20)


Now that we have the data loaded we use the `prepare` function:

In [9]:
import pyLDAvis
movies_vis_data = pyLDAvis.prepare(**movies_model_data)

Once you have the visualization data prepared you can do a number of things with it. You can [save the vis](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.save_html) to an stand-alone HTML file, [serve it](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.show), or [dispaly it](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.display) in the notebook. Let's go ahead and display it:

In [10]:
pyLDAvis.display(movies_vis_data)

Pretty, huh?! Again, you should be thanking the original [LDAvis people](https://github.com/cpsievert/LDAvis) for that. You may thank me for the IPython integartion though. :)

To see other models visualzied check out [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Movie%20Reviews,%20AP%20News,%20and%20Jeopardy.ipynb).

*ProTip:* To avoid tediously typing in `display` all the time use:

In [11]:
pyLDAvis.enable_notebook()

## Making the common case easy - Gensim and others!

Built on top of the generic `prepare` function are helper functions for [gensim](https://radimrehurek.com/gensim/) and [GraphLab Create](https://dato.com/products/create/). To demonstrate below I am loading up a trained gensim model and coresponding dictionary and corpus (see [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Gensim%20Newsgroup.ipynb) for how these were created):

In [12]:
import gensim

dictionary = gensim.corpora.Dictionary.load('newsgroups.dict')
corpus = gensim.corpora.MmCorpus('newsgroups.mm')
lda = gensim.models.ldamodel.LdaModel.load('newsgroups_50.model')

In the dark ages in order to inspect our topics all we had was `show_topics` and friends:

In [13]:
lda.show_topics()

[u'0.017*minnesota + 0.014*income + 0.012*morris + 0.012*roy + 0.012*partners + 0.010*francis + 0.008*amounts + 0.007*gear + 0.007*antenna + 0.007*motorola',
 u"0.015*don't + 0.014*know + 0.012*said + 0.011*time + 0.011*one + 0.010*didn't + 0.009*get + 0.008*like + 0.008*i'm + 0.007*going",
 u'0.016*university + 0.016*article + 0.015*pat + 0.012*john + 0.012*tony + 0.010*greg + 0.010*orbit + 0.009*toronto + 0.009*gerald + 0.009*steve',
 u'0.018*engine + 0.017*moon + 0.014*senate + 0.012*front + 0.011*car + 0.011*new + 0.010*bmw + 0.009*dod + 0.009*ford + 0.008*honda',
 u'0.022*president + 0.018*government + 0.015*clinton + 0.013*white + 0.011*house + 0.011*security + 0.010*secret + 0.010*clipper + 0.009*david + 0.009*encryption',
 u'0.026*bit + 0.023*scsi + 0.021*chip + 0.018*mac + 0.018*speed + 0.015*data + 0.011*chips + 0.011*mhz + 0.010*fast + 0.009*serial',
 u'0.025*greece + 0.024*doug + 0.023*energy + 0.022*blood + 0.014*jew + 0.013*judas + 0.012*ron + 0.011*murray + 0.010*hiv + 0

Thankfully, in addition to these *still helpful functions*, we can get a feel for all of the topics with this one-liner:

In [14]:
import pyLDAvis.gensim

pyLDAvis.gensim.prepare(lda, corpus, dictionary)

## GraphLab

As I mentioned above you can also easily visualize GraphLab TopicModels as well. Check out [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/GraphLab.ipynb#topic=7&lambda=0.41&term=) if you are interested in that.


## Go forth and visualize!

What are you waiting for? Go ahead and `pip install pyldavis`.