# DS15 Lecture Topic Modeling
## IMDB Movie Reviews
Author: DS15
## Guiding Question
What are common themes in movies that people liked?
## Approach
 - Apply NLP to movie reports
 - Estimate a topic model on the movie reviews 
 - Create visualizations relating the topics ot the review scores

In [1]:
import pandas as pd

df = pd.read_csv('IMDB Dataset.csv')

FileNotFoundError: [Errno 2] File IMDB Dataset.csv does not exist: 'IMDB Dataset.csv'

In [None]:
df.shape
df.head

In [None]:
df['review'] = df['review'].apply(lambda x: x.replace('<br />', '')

## Applying NLP

* We need to tokenize the text
* How should we do it?
 - Use Spacy
 - Figure out what our unit of analysis is (lemmas, adjs, keywords, nouns, spacy tokens, etc.)

In [None]:
import spacy 

nlp = spacy.load('en_core_web_lg')

In [None]:
# Try lemmatization as our first experiment

def get_lemmas(text):
    
    lemmas = []
    
    doc = nlp(text)
    
    for token in doc:
        conditions = (token.is_stop == False) and (token.is_punct == False) and (token.pos != 'PRON')
        if conditions: 
            lemmas.append(token.lemma_)
            
    return lemmas

In [None]:
# This is just adding a progress bar to the same function as above, since it takes so long 

def get_lemmas(texts):
    
    ln = len(texts)
    all_lemmas = []
    
    doc = nlp(text)
    
    for i, text in enumerate(texts):
        
        lemmas =[]
        
        doc = nlp(text)
        
        for token in doc:
            
            conditions = (token.is_stop == False) and (token.is_punct == False) and (token.pos != 'PRON')
            if conditions: 
                lemmas.append(token.lemma_)
                
        all_lemmas.append(lemmas)
        print(f"{(i/ln)*100:.2f}%") 
            
    return lemmas

In [None]:
df['lemmas'] = df['review'].apply(get_lemmas)

## Topic Modeling w/ Gensim
- Learn a vocabulary
- Create a bag of words representation of each document
- Estimate our LDA model
- Clean up the results
- Add topic information back to dataframe


In [None]:
import gensim
from gensim import corpora
from gensim.models.ldamulticore import LdaMulticore

In [None]:
id2word = corpora.Dictionary(df['lemmas'])

In [None]:
# Gets rid of very common or very infrequent words

id2word.filter_extremes(no_below=50, no_above=.90)

In [None]:
# Returns how many keys are in the corpus

len(id2word.keys())

In [None]:
corpus = [id2word.doc2bow(doc) for doc in df['lemmas']]

lda = LdaMulticore(corpus=corpus,
                   id2word=id2word,
                   num_topics=15,
                   passes=10, #how many times you reallocate distribution to get a better model; the higher the passes, the better the model. how many times you fit, evaluate, and update the prameters
                   workers=12,
                   random_state=812
                   )

In [None]:
import re

words = [re.findall(r'"([^"]*)"',t[1]) for t in lda.print_topics()]


In [None]:
topics = [' '.join(t[0:5]) for t in words]

In [None]:
print(topics[0])

In [None]:
# If we need to remove stop words after running the funciton on the entire dataframe

df['lemmas'] = df['lemmas'].apply(x: [l for l in x if l not in ['movie', 'film', 'character', 'actor']])

## Analyzing the Results of LDA
- How good are the topics themselves?
    * Using intertopic distance visualization
    * Looking at some of the token distributions 
- Which topics are using the LDA topics for analysis?
    * Score each review with the top topic
    * Create summary visualizations of top topic vs. sentiment 

In [6]:
import pyLDAvis.gensim

pyLDAvis.enable_notebook()

In [None]:
pyLDAvis.gensim.prepare(lda, corpus, id2word)

the circles should be relatively equal in size and should be far apart from each other/distinct