# LDA Model Exploration 

**Questions**

If the `print_topics` and `show_topic` methods are showing the probability of that word being in that topic, why are the numbers so low? 
The probabilities should add to one over all the words in the vocabulary right? So how can this be the interpretation if the most significant word doesn't
have a higher probability than 2%? Is it just because there are so many words in the vocabulary? that a probability of 2% is actually high relative to all the other words? There are just over 4 million tokens in the cleaned corpus. 

## import packages 

In [14]:
import pandas as pd 

# gensim
from gensim.models.ldamulticore import LdaMulticore
from gensim.corpora.dictionary import Dictionary

# LDA viz
import pyLDAvis
import pyLDAvis.gensim

## load the LDA model 

In [3]:
model1 = LdaMulticore.load("results/lda_model1.gensim")

## print topics
This method prints the most significant topics. The topics are ordered by significance. 

Each word in a topic 

In [13]:
model1.print_topics(num_words=20)

[(0,
  '0.020*"movie" + 0.010*"love" + 0.009*"like" + 0.009*"character" + 0.007*"time" + 0.007*"story" + 0.007*"great" + 0.007*"watch" + 0.007*"film" + 0.007*"think" + 0.006*"good" + 0.006*"see" + 0.006*"play" + 0.006*"well" + 0.006*"life" + 0.005*"know" + 0.005*"find" + 0.005*"year" + 0.004*"come" + 0.004*"family"'),
 (1,
  '0.013*"film" + 0.006*"man" + 0.006*"war" + 0.004*"world" + 0.004*"people" + 0.004*"time" + 0.004*"story" + 0.004*"life" + 0.003*"movie" + 0.003*"like" + 0.003*"live" + 0.003*"way" + 0.003*"character" + 0.003*"human" + 0.003*"come" + 0.003*"take" + 0.003*"know" + 0.003*"end" + 0.002*"american" + 0.002*"year"'),
 (2,
  '0.018*"film" + 0.009*"play" + 0.006*"role" + 0.006*"well" + 0.005*"good" + 0.005*"man" + 0.005*"performance" + 0.005*"star" + 0.005*"cast" + 0.004*"character" + 0.004*"scene" + 0.004*"time" + 0.004*"great" + 0.004*"look" + 0.004*"get" + 0.003*"actor" + 0.003*"john" + 0.003*"work" + 0.003*"like" + 0.003*"plot"'),
 (3,
  '0.058*"film" + 0.009*"like" + 

**The `show_topic` method returns a list of tuples where the first tuple provides the most significant word contributing to that topic along with its probability of occurring in that topic.**

In [5]:
model1.show_topic(topicid=0, topn=10)

[('movie', 0.01983073),
 ('love', 0.01042427),
 ('like', 0.009094607),
 ('character', 0.00899004),
 ('time', 0.007485336),
 ('story', 0.0073223356),
 ('great', 0.0072282213),
 ('watch', 0.006888573),
 ('film', 0.006662686),
 ('think', 0.006589944)]

## get top topics 

This method returns the topics with the highest cohererence score. Important to note here, that `print_topics` shows the topics in order of significance which probably means in order of how much the topic is represented across the corpus (more documents are discussing **Topic 0** than other topics). However, in `top_topics`, **Topic 3** is determined to be the most coherent based on the `u_mass` coherence metric. 

In [11]:
# load docs
docs = pd.read_parquet("data/train_clean.parquet")
docs = docs.tokenized_docs
# initialize the dictionary
imdb_dictionary = Dictionary(docs)
imdb_corpus = [imdb_dictionary.doc2bow(doc) for doc in docs]

In [12]:
model1.top_topics(corpus=imdb_corpus, coherence='u_mass')

[([(0.057911348, 'film'),
   (0.008806334, 'like'),
   (0.0075893807, 'character'),
   (0.0075204363, 'story'),
   (0.0067268494, 'time'),
   (0.006577165, 'scene'),
   (0.00598517, 'movie'),
   (0.005974197, 'see'),
   (0.0052036904, 'watch'),
   (0.004626276, 'well'),
   (0.004608676, 'good'),
   (0.0045326888, 'director'),
   (0.0045085903, 'work'),
   (0.004289764, 'feel'),
   (0.0041118073, 'look'),
   (0.004108855, 'plot'),
   (0.004057151, 'think'),
   (0.003788145, 'way'),
   (0.0037632992, 'end'),
   (0.0037044906, 'act')],
  -1.1627743286282455),
 ([(0.05261422, 'movie'),
   (0.016147379, 'like'),
   (0.011652745, 'bad'),
   (0.010767524, 'watch'),
   (0.0105256215, 'good'),
   (0.008885056, 'think'),
   (0.007826289, 'time'),
   (0.0075075435, 'see'),
   (0.007425785, 'film'),
   (0.007005749, 'look'),
   (0.0067885183, 'wrong'),
   (0.0066666137, 'act'),
   (0.0064782854, 'get'),
   (0.0062964913, 'thing'),
   (0.0060903034, 'go'),
   (0.0059217117, 'people'),
   (0.0057807

## Visualize

In [15]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(model1, imdb_corpus, imdb_dictionary)
vis

## Save LDA vis

In [16]:
pyLDAvis.save_html(data=vis, fileobj="lda_model1.html")