## Visualization of Topics

We are using pyLDAvis visualization by [Ben Mabey](https://github.com/bmabey/pyLDAvis) who adapted the original R package to Python.

pyLDAvis shows topics as circles in a 2D plot. This is an approximation of topic similarity. The more similar two topics are, the closer they will be in the plot. The size of the circle corresponds to the presence of the topic in the corpus.

The visualization shows the top 30 most salient terms (not frequent!), where saliency refers to the importance of each word for the topic. If a word is frequent in a topic, but also in the entire corpus, it will get a lower saliency score than a word that is frequent in a topic alone. Conceptually, it is similar to TF-IDF.
If the topic is selected, it shows most relevant (frequent) terms in a selected topic. Relevance is similar to saliency. It is a weighted measure of term probability and lift, where lambda = 1 ranks only by probability of the term and lambda = 0 ranks only by lift (the ratio of a term’s probability within a topic to its marginal probability across the corpus).

In [13]:
import pickle
import pyLDAvis.gensim
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel

In [9]:
# used to hush the warnings that appear in pyLDAvis
import warnings
warnings.filterwarnings("ignore")

Load the data saved in the previous step and prepare it for the visualization.

In [10]:
dictionary = Dictionary.load('Innovation/innov_dictionary.gensim')
corpus = pickle.load(open('Innovation/innov_corpus.pkl', 'rb'))
lda = LdaModel.load('Innovation/model5.gensim')

lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)

An interactive visualization of topics from LDA model.

You can select the topic manually by clicking on the circle in the plot or by selecting topic number in the control area at the top.

On the right, you see the most relevant terms for the selected topic. If you click on a word in the histogram on the right, topic circles will resize according to the saliency of the term in the topic.

In [11]:
pyLDAvis.display(lda_display)

You can learn more about pyLDAvis visualization in a slightly more complex example:

https://nbviewer.jupyter.org/github/bmabey/hacker_news_topic_modelling/blob/master/HN%20Topic%20Model%20Talk.ipynb

Or read the original paper for the method:

https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf