# Topic Modeling (Prepare)

On Monday we talked about summarizing your documents using just token counts. Today, we're going to learn about a much more sophisticated approach - learning 'topics' from documents. Topics are a latent structure. They are not directly observable in the data, but we know they're there by reading them.

> **latent**: existing but not yet developed or manifest; hidden or concealed.

## Use Cases
Primary use case: what the hell are your documents about? Who might want to know that in industry - 
* Identifying common themes in customer reviews
* Discovering the needle in a haystack 
* Monitoring communications (Email - State Department) 

## Learning Objectives
*At the end of the lesson you should be able to:*
* <a href="#p0">Part 0</a>: Warm-Up
* <a href="#p1">Part 1</a>: Describe how an LDA Model works
* <a href="#p2">Part 2</a>: Estimate a LDA Model with Gensim
* <a href="#p3">Part 3</a>: Interpret LDA results
* <a href="#p4">Part 4</a>: Select the appropriate number of topics

## Warm-Up
How do we do a grid search? 

In [5]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer

In [6]:
data = fetch_20newsgroups()

### GridSearch on Just Classifier
* Fit the vectorizer and prepare BEFORE it goes into the gridsearch

In [7]:
v1 = TfidfVectorizer()
X_train = v1.fit_transform(data['data'])

In [10]:
p1 = {
    'n_estimators':[10,20],
    'max_depth': [None, 7]
}

In [11]:
clf = RandomForestClassifier()
gs1 = GridSearchCV(clf, p1, cv=5,n_jobs=-1, verbose=1)
gs1.fit(X_train, data['target'])

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  20 | elapsed:   16.2s remaining:    1.8s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:   16.4s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rando

In [12]:
gs1.predict(["No the drama llama was in Portland last week."])

ValueError: could not convert string to float: 'No the drama llama was in Portland last week.'

### GridSearch with BOTH the Vectoizer & Classifier

In [18]:
from sklearn.pipeline import Pipeline

v2 = TfidfVectorizer()
clf = RandomForestClassifier()

pipe = Pipeline([
    ('magicUnicorns', v2),
    ('clf', clf)
])

p2 = {
    'magicUnicorns__max_features':[1000,5000],
    'clf__n_estimators':[10,20],
    'clf__max_depth': [None, 7]
}

gs2 = GridSearchCV(pipe, p2, cv=5,n_jobs=-1, verbose=1)
gs2.fit(data['data'], data['target'])

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:   32.9s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('magicUnicorns',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                 

In [19]:
gs2.predict(["No the drama llama was in Portland last week."])

array([9])

Advantages to using GS with the Pipe:
* Allows us to make predictions on raw text increasing reproducibility. :)
* Allows us to tune the parameters of the vectorizer along side the classifier. :D 

## Intro to LDA

In [21]:
df = pd.read_csv('./data/imbd_keywords (2).csv')

In [22]:
df.head()

Unnamed: 0,review,sentiment,keywords
0,One of the other reviewers has mentioned that ...,positive,"['other shows', 'graphic violence', 'prison ex..."
1,A wonderful little production. The filming tec...,positive,"['halliwell', 'michael sheen', 'realism', 'com..."
2,I thought this was a wonderful way to spend ti...,positive,"['spirited young woman', 'devil wears prada', ..."
3,Basically there's a family where a little boy ...,negative,"['playing parents', 'jake', 'parents', 'descen..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"['mr. mattei', 'good luck', 'mattei', 'human r..."


## Estimating LDA

In [23]:
# Tiny Update to change "keywords" from string to a list
from ast import literal_eval

df['keywords'] = df['keywords'].apply(literal_eval)

In [26]:
import gensim
from gensim import corpora
from gensim.models.ldamulticore import LdaMulticore

In [37]:
# A Dictionary Representations of all the words
# List of Lists of Strings - tokens, lemmas, phrases, etc.
id2word = corpora.Dictionary(df['keywords'])

In [28]:
id2word.token2id['spirited young woman']

112

In [29]:
id2word.doc2bow(['other show', 'graphic violence'])

[(16, 1)]

In [31]:
len(id2word.keys())

625927

In [38]:
id2word.filter_extremes(no_below=7, no_above=0.95)

In [39]:
len(id2word.keys())

25363

In [40]:
corpus = [id2word.doc2bow(text) for text in df['keywords']]

In [43]:
corpus[56]

[(24, 1),
 (33, 1),
 (64, 1),
 (66, 1),
 (105, 1),
 (344, 1),
 (439, 1),
 (486, 1),
 (588, 1),
 (780, 1),
 (1090, 1),
 (1091, 1),
 (1092, 1),
 (1093, 1),
 (1094, 1),
 (1095, 1),
 (1096, 1),
 (1097, 1),
 (1098, 1),
 (1099, 1),
 (1100, 1)]

In [45]:
lda = LdaMulticore(corpus=corpus, 
                  id2word=id2word,
                  num_topics=20, 
                  passes=50,
                  workers=12
                )

In [46]:
lda.print_topics()

[(0,
  '0.013*"this movie" + 0.012*"it" + 0.011*"i" + 0.008*"this film" + 0.007*"the movie" + 0.006*"first" + 0.005*"you" + 0.005*"start" + 0.005*"finish" + 0.004*"this one"'),
 (1,
  '0.010*"i" + 0.009*"it" + 0.008*"american" + 0.008*"this film" + 0.008*"this movie" + 0.007*"today" + 0.007*"the film" + 0.006*"british" + 0.006*"war" + 0.006*"the story"'),
 (2,
  '0.013*"i" + 0.011*"this film" + 0.009*"it" + 0.007*"people" + 0.006*"the story" + 0.006*"this movie" + 0.005*"who" + 0.005*"the movie" + 0.004*"american" + 0.004*"time"'),
 (3,
  '0.011*"it" + 0.010*"i" + 0.010*"this movie" + 0.007*"the film" + 0.006*"people" + 0.006*"this film" + 0.005*"movies" + 0.004*"you" + 0.004*"second" + 0.004*"the end"'),
 (4,
  '0.013*"first" + 0.011*"i" + 0.010*"this film" + 0.008*"the story" + 0.008*"it" + 0.007*"the end" + 0.006*"second" + 0.006*"a lot" + 0.005*"people" + 0.005*"this movie"'),
 (5,
  '0.009*"love" + 0.009*"the story" + 0.008*"the film" + 0.008*"life" + 0.005*"the end" + 0.005*"it" 

In [47]:
import re

words = [re.findall(r'"[^"]*"', t[1]) for t in lda.print_topics(20)]

In [48]:
topics = [', '.join(t[0:5]) for t in words]

In [49]:
for id, t in enumerate(topics): 
    print(f"------ Topic {id} ------")
    print(t, end="\n\n")

------ Topic 0 ------
"this movie", "it", "i", "this film", "the movie"

------ Topic 1 ------
"i", "it", "american", "this film", "this movie"

------ Topic 2 ------
"i", "this film", "it", "people", "the story"

------ Topic 3 ------
"it", "i", "this movie", "the film", "people"

------ Topic 4 ------
"first", "i", "this film", "the story", "it"

------ Topic 5 ------
"love", "the story", "the film", "life", "the end"

------ Topic 6 ------
"i", "gore", "the film", "this movie", "this one"

------ Topic 7 ------
"the film", "it", "i", "the end", "first"

------ Topic 8 ------
"first", "it", "the movie", "the end", "the film"

------ Topic 9 ------
"i", "this movie", "the movie", "the film", "this film"

------ Topic 10 ------
"i", "this movie", "this film", "the acting", "first"

------ Topic 11 ------
"i", "this movie", "the movie", "it", "this film"

------ Topic 12 ------
"first", "this movie", "i", "the film", "it"

------ Topic 13 ------
"the movie", "i", "the film", "it", "peop

### How could could change the text processing to get better results?
- Remove Stopwords
    - film
    - this movie
    - it
    - the film

In [58]:
import spacy
nlp = spacy.load("en_core_web_lg")

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [59]:
custom = {'film', 'movie', 'story'}
nlp.Defaults.stop_words |= custom

def tokenize(text):
    """
    Parse a raw string return lemmas
    """
    
    doc = nlp(text)
    lemmas = []
    
    for token in doc:
        if (token.is_stop == False) and (token.is_punct == False) and (token.pos != 'PRON'):
            lemmas.append(token.lemma_)
    
    return lemmas

In [65]:
from tqdm import tqdm

tqdm.pandas()

df['lemmas'] = df['review'].progress_apply(tokenize)

  from pandas import Panel
  8%|▊         | 3357/40436 [02:01<22:17, 27.73it/s]


KeyboardInterrupt: 

## Interpret LDA Results


In [67]:
import pyLDAvis.gensim

pyLDAvis.enable_notebook()

In [68]:
pyLDAvis.gensim.prepare(lda,corpus, id2word)