## Visualization of Topics

We are using pyLDAvis visualization by [Ben Mabey](https://github.com/bmabey/pyLDAvis) who adapted the original R package to Python.

pyLDAvis shows topics as circles in a 2D plot. This is an approximation of topic similarity. The more similar two topics are, the closer they will be in the plot. The size of the circle corresponds to the presence of the topic in the corpus.

The visualization shows the top 30 most salient terms (not frequent!), where saliency refers to the importance of each word for the topic. If a word is frequent in a topic, but also in the entire corpus, it will get a lower saliency score than a word that is frequent in a topic alone. Conceptually, it is similar to TF-IDF.
If the topic is selected, it shows most relevant (frequent) terms in a selected topic. Relevance is similar to saliency. It is a weighted measure of term probability and lift, where lambda = 1 ranks only by probability of the term and lambda = 0 ranks only by lift (the ratio of a term’s probability within a topic to its marginal probability across the corpus).

In [1]:
import pickle

from gensim import corpora
from gensim.models.ldamodel import LdaModel
import pyLDAvis.gensim

In [2]:
# used to hush the warnings that appear in pyLDAvis
import warnings
warnings.filterwarnings("ignore")

In [3]:
tokens = pickle.load(open('../Preprocessing/tokens.pkl', 'rb'))

Please check if any of the words in the list are redundant.

In [None]:
from nltk.probability import FreqDist

flat_tokens = [t for doc in tokens for t in doc]
fdist = FreqDist(flat_tokens)
for i, j in fdist.most_common(100):
    print(i, ": ", j)

In [4]:
dictionary = corpora.Dictionary(tokens)
dictionary.filter_extremes(0.1, 0.9)
dictionary.save('LDA_dictionary.gensim')

In [5]:
corpus = [dictionary.doc2bow(text) for text in tokens]
pickle.dump(corpus, open('LDA_corpus.pkl', 'wb'))

Decide on the number of topics you wish to observe.

In [6]:
NUM_TOPICS = 10
ldamodel = LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('LDA_model{}.gensim'.format(NUM_TOPICS))

In [7]:
topics = ldamodel.print_topics(num_words=10)
for topic in topics:
    print("Topic {}:".format(topic[0]+1))
    for word in topic[1].split(' + '):
        print('   {}: {}'.format(word.split('*')[1], word.split('*')[0]))

Topic 1:
   "development": 0.021
   "economic": 0.008
   "industry": 0.007
   "business": 0.006
   "region": 0.006
   "destination": 0.006
   "new": 0.006
   "model": 0.006
   "regional": 0.006
   "process": 0.006
Topic 2:
   "service": 0.015
   "technology": 0.010
   "information": 0.009
   "destination": 0.007
   "customer": 0.007
   "industry": 0.007
   "travel": 0.006
   "management": 0.006
   "use": 0.006
   "hotel": 0.006
Topic 3:
   "destination": 0.008
   "hotel": 0.007
   "city": 0.006
   "park": 0.005
   "sector": 0.005
   "website": 0.004
   "activity": 0.004
   "service": 0.004
   "local": 0.004
   "development": 0.004
Topic 4:
   "industry": 0.007
   "development": 0.006
   "service": 0.006
   "city": 0.006
   "smart": 0.005
   "theme": 0.005
   "article": 0.005
   "process": 0.004
   "technology": 0.004
   "issue": 0.004
Topic 5:
   "community": 0.008
   "knowledge": 0.008
   "new": 0.008
   "local": 0.008
   "development": 0.007
   "approach": 0.005
   "product": 0.004
 

An interactive visualization of topics from LDA model.

You can select the topic manually by clicking on the circle in the plot or by selecting topic number in the control area at the top.

On the right, you see the most relevant terms for the selected topic. If you click on a word in the histogram on the right, topic circles will resize according to the saliency of the term in the topic.

In [None]:
lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary, sort_topics=False)
# pyLDAvis will throw a FutureWarning, which you can ignore
pyLDAvis.display(lda_display)