# Guiding Question: What are customers saying about our movies?

## Our approach to analyzing:
- Analyze text of movie reviews
- Clean the review text
- HOW: Topic Modeliung
- Label the most reviews with the most important topics
- Visualize the results

In [1]:
%pwd

'/Users/ekselan/Documents/GitHub/DS-Unit-4-Machine-Learning/1-NLP/DS-Unit-4-Sprint-1-NLP/module4-topic-modeling'

In [2]:
import pandas as pd

PATH = "/Users/ekselan/Desktop/LAMBDA/134715_320111_compressed_IMDB Dataset.csv.zip"
df = pd.read_csv(PATH)

print(df.shape)
df.head()

(50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Clean Text Data

In [None]:
# Remove br tags
"""
1. .replace method
2. BeautifulSoup get text
3. Regex
"""

In [3]:
from bs4 import BeautifulSoup

def clean_description(desc):
    soup = BeautifulSoup(desc)
    return soup.get_text()

In [4]:
df['review'] = df['review'].apply(clean_description)

In [5]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [14]:
samp = df.sample(15000)

## Tokenization

- Spacy
- Pure Python (stopwords, punct, etc)
- Gensim
- NLTK (not recommended)

In [7]:
import spacy

nlp = spacy.load("en_core_web_lg") #> md and lg use good word embeddings

In [15]:
## Use Lemmas as our tokens

tokens = []

for doc in nlp.pipe(samp['review']):
    
    doc_tokens = []
    
    for token in doc:
        if (token.is_stop == False) & (token.is_punct == False) & (token.pos != 'PRON'):
            doc_tokens.append(token.lemma_.strip())
            
    tokens.append(doc_tokens)

In [19]:
# Sanity check that our function worked properly

len(tokens) == samp.shape[0]

True

## Gensim LDA Topic Modeling

In [16]:
import gensim

from gensim import corpora
from gensim.models.ldamulticore import LdaMulticore

Sklearn's 'CountVectorizer' v. 'Gensim.corpora'

| 'CountVectorizer' | 'Gensim.corpora' | Description|
| --- | --- |
| Instance | - | N/A

In [17]:
id2word = corpora.Dictionary(tokens)

In [18]:
len(id2word.keys())

71879

In [23]:
id2word.filter_extremes(no_below=10, no_above=.95)

In [24]:
len(id2word.keys())

11468

In [25]:
corpora = [id2word.doc2bow(doc) for doc in tokens]

In [None]:
corpora[0]

In [26]:
lda = LdaMulticore(corpus=corpora,
                   id2word=id2word,
                   num_topics=15,
                   passes=100,
                   workers=12)

In [27]:
lda.print_topics()

[(0,
  '0.014*"play" + 0.014*"game" + 0.013*"movie" + 0.011*"like" + 0.009*"school" + 0.008*"music" + 0.008*"love" + 0.008*"great" + 0.006*"girl" + 0.006*"song"'),
 (1,
  '0.080*"movie" + 0.020*"bad" + 0.018*"like" + 0.016*"good" + 0.015*"watch" + 0.015*"film" + 0.012*"think" + 0.011*"see" + 0.010*"time" + 0.008*"character"'),
 (2,
  '0.051*"film" + 0.016*"horror" + 0.015*"movie" + 0.013*"see" + 0.013*"good" + 0.009*"watch" + 0.008*"great" + 0.008*"year" + 0.007*"time" + 0.007*"dvd"'),
 (3,
  '0.009*"kill" + 0.007*"like" + 0.007*"get" + 0.007*"scene" + 0.007*"film" + 0.006*"bad" + 0.006*"look" + 0.006*"movie" + 0.006*"guy" + 0.005*"man"'),
 (4,
  '0.040*"movie" + 0.024*"funny" + 0.022*"like" + 0.015*"comedy" + 0.013*"laugh" + 0.012*"watch" + 0.012*"good" + 0.010*"bad" + 0.009*"think" + 0.009*"time"'),
 (5,
  '0.014*"film" + 0.006*"scene" + 0.006*"New" + 0.006*"good" + 0.005*"go" + 0.005*"play" + 0.005*"man" + 0.004*"car" + 0.004*"thriller" + 0.004*"plot"'),
 (6,
  '0.037*"film" + 0.006

In [28]:
import re
words = [re.findall('"([^"]*)"',t[1]) for t in lda.print_topics()]
topics = [' '.join(t[0:5]) for t in words]

In [29]:
for id, t in enumerate(topics):
    print(f"----- Topic {id} -----")
    print(t, end="\n\n")

----- Topic 0 -----
play game movie like school

----- Topic 1 -----
movie bad like good watch

----- Topic 2 -----
film horror movie see good

----- Topic 3 -----
kill like get scene film

----- Topic 4 -----
movie funny like comedy laugh

----- Topic 5 -----
film scene New good go

----- Topic 6 -----
film story time work life

----- Topic 7 -----
series episode tv show season

----- Topic 8 -----
film play role performance good

----- Topic 9 -----
film character like good story

----- Topic 10 -----
book play story good character

----- Topic 11 -----
film good western great short

----- Topic 12 -----
movie action good fight bad

----- Topic 13 -----
film life love man people

----- Topic 14 -----
movie love kid like child



## Interpret LDA Results

1. Topic Term Distribution (how good is our LDA model?) / what are the topics
2. Document Topic Distribution
    - What are the documents about?
    - What are the most common themes(topics)?
    - What topics are associated with positive and negative sentiment?

In [30]:
# Part 1: Topic Disctance Visualization
"""
Tells us if the topics are disctinct, and what terms are most import
"""

import pyLDAvis.gensim

pyLDAvis.enable_notebook()

In [32]:
pyLDAvis.gensim.prepare(lda, corpora, id2word)

# Generally don't want overlapping topics

In [None]:
# Part 2: What are the documents about?
# Eqv. to a `.predict` statement in sklearn
# Scoring the topic distribution of a single document

lda[corpora[1]]

In [34]:
distro = [lda[d] for d in corpora] #> score all docs in corpus

In [35]:
def update(doc):
    d_dist = {k:0 for k in range (0,15)}
    for t in doc:
        d_dist[t[0]] = t[1]
    return d_dist

new_distro = [update(d) for d in distro]

In [36]:
doc_topics = pd.DataFrame.from_records(new_distro)
doc_topics.columns = topics

In [37]:
doc_topics.head()

Unnamed: 0,play game movie like school,movie bad like good watch,film horror movie see good,kill like get scene film,movie funny like comedy laugh,film scene New good go,film story time work life,series episode tv show season,film play role performance good,film character like good story,book play story good character,film good western great short,movie action good fight bad,film life love man people,movie love kid like child
0,0.286967,0.0,0.0,0.213201,0.293703,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.195935,0.0
1,0.150832,0.0,0.0,0.031807,0.0,0.0,0.040211,0.0,0.419399,0.0,0.230947,0.0,0.0,0.115591,0.0
2,0.0,0.0,0.096118,0.197774,0.0,0.0,0.363851,0.0,0.170439,0.0,0.0,0.034946,0.0,0.0,0.133371
3,0.39535,0.0,0.0,0.0,0.248255,0.0,0.0,0.055433,0.0,0.0,0.121306,0.0,0.0,0.170517,0.0
4,0.102382,0.387505,0.0,0.257865,0.0,0.0,0.0,0.0,0.0,0.04157,0.0,0.0,0.0,0.201652,0.0


In [39]:
doc_topics['primary_topic'] = doc_topics.idxmax(axis=1)

In [40]:
doc_topics['primary_topic'].value_counts() #> getting idea of most important topics

movie bad like good watch          3935
film story time work life          1693
film character like good story     1615
film life love man people          1323
kill like get scene film           1184
movie funny like comedy laugh      1127
film horror movie see good          872
book play story good character      631
movie love kid like child           534
film play role performance good     512
film scene New good go              387
series episode tv show season       379
play game movie like school         366
film good western great short       223
movie action good fight bad         219
Name: primary_topic, dtype: int64

## Selecting the Number of Topics (Learn)

In [None]:
from gensim.models.coherencemodel import CoherenceModel

def compute_coherence_values(dictionary, corpus, limit, start=2, step=3, passes=5):
    """
    Compuite u_mass coherence for various number of topics
    
    Parameters:
    -----------
    dictionary: Gensim dictionary
    corpus: Gensim corpus
    limit: Max number of topic
    passes: the number of times the entire lda model & coherence values are calculated
    
    Returns:
    ---------
    coherence_values
    """

In [None]:
# Will take a long time to run - test first with a very small sample

coherence_values = compute_coherence_values(dictionary=id2word,
                                                       corpus=corpora,
                                                       start=3,
                                                       limit=40,
                                                       step=2,
                                                       passes=1)

In [None]:
topic_coherence = pd.DataFrame.from_records(coherence_values)

In [None]:
topic_coherence.head()

In [None]:
### looking for the highest point past zero when looking at this plot

import seaborn as sns

az = sns.lineplot(x="num_topic", y="coherence_score", data=topic_coherence)

### Once you get your optimal number of topic, re-run the lda model with
### that number, crank up num of passes to get very stable results, and
### then interpret

In [None]:
### tokenization, document classification, and a 
### little lda (gensim api / analyzing results of model)