## Bigram Model 

**Notes:**

- Use the `Phraser` when you no longer need to update bigram stats with new documents
- Export the `Phrases` to a `Phraser` to save memory and make the model faster
- Save the `Phraser` version as a `.pkl`

**Theoretical Questions**

- Should the bigram model be applied on the cleaned text? Or text that is more raw? Will it come up with nonsense bigrams otherwise? 
- the cleaned text has already been lemmatized and the stopwords have been removed which changes the word ordering. How does this affect the quality of bigrams detected? 

## import packages

In [2]:
import os
import pandas as pd 

from gensim.models.phrases import Phrases, Phraser

In [3]:
# set notebook's path to project root 
root = "/Users/lesleymi/data_science_portfolio/"
os.chdir(root + 'IMDB_Sentiment_Analysis/')

## load docs 

In [4]:
# load movie reviews
docs = pd.read_parquet('data/train_clean.parquet').tokenized_docs

## train bigram model 

**Question**

Do i need to supply my own stopword list to the `Phrases` model? what is it doing by default? the docs says `common_terms=frozenset({})`. What is "frozenset"?

In [5]:
phrases = Phrases(docs, min_count=5, threshold=10)

In [6]:
# Export the trained model = use less RAM, faster processing. Model updates no longer possible.
bigram = Phraser(phrases)

In [7]:
# Apply the exported model to each movie review in the corpus:
bigram_docs = []
for doc in bigram[docs]:
    bigram_docs.append(doc)

## bigram model exploration

Look at a few movie reviews. What kinds of bigrams did the model learn?

I'm happy with the bigrams found in `Doc 0`. They make sense. 

In [8]:
bigram_docs[0]

['grow',
 'watch',
 'love',
 'thunderbirds',
 'mate',
 'school',
 'watch',
 'play',
 'thunderbirds',
 'school',
 'lunch',
 'school',
 'want',
 'virgil',
 'scott',
 'want',
 'alan',
 'count',
 'art_form',
 'take',
 'child',
 'movie',
 'hope',
 'glimpse',
 'love',
 'child',
 'bitterly_disappoint',
 'high',
 'point',
 'snappy',
 'theme_tune',
 'compare',
 'original',
 'score',
 'thunderbirds',
 'thankfully',
 'early',
 'saturday_morning',
 'television',
 'channel',
 'play',
 'rerun',
 'series',
 'gerry_anderson',
 'wife',
 'create',
 'jonatha',
 'frakes',
 'hand',
 'director_chair',
 'version',
 'completely',
 'hopeless',
 'waste',
 'film',
 'utter_rubbish',
 'cgi',
 'remake',
 'acceptable',
 'replace',
 'marionette',
 'homo',
 'sapiens',
 'subsp',
 'sapiens',
 'huge',
 'error',
 'judgment']

In [9]:
bigram_docs[1]

['movie',
 'dvd_player',
 'sit',
 'coke',
 'chip',
 'expectation',
 'hope',
 'movie',
 'contain',
 'strong_point',
 'movie',
 'awsome',
 'animation',
 'good',
 'flow',
 'story',
 'excellent',
 'voice',
 'cast',
 'funny',
 'comedy',
 'kick_ass',
 'soundtrack',
 'disappointment',
 'find',
 'atlantis_milo',
 'return',
 'read_review',
 'let',
 'follow',
 'paragraph',
 'direct',
 'see',
 'movie',
 'enjoy',
 'primarily',
 'point',
 'scene',
 'appear',
 'shock',
 'pick',
 'atlantis_milo',
 'return',
 'display',
 'case',
 'local',
 'videoshop',
 'expectation',
 'music',
 'feel',
 'bad',
 'imitation',
 'movie',
 'voice',
 'cast',
 'replace',
 'fit',
 'exception',
 'character',
 'like',
 'voice',
 'sweet',
 'actual',
 'drawing',
 'not',
 'bad',
 'animation',
 'particular',
 'sad',
 'sight',
 'storyline',
 'pretty',
 'weak',
 'like',
 'episode',
 'schooby',
 'doo',
 'single',
 'adventurous',
 'story',
 'get',
 'time',
 'not',
 'misunderstand',
 'good',
 'schooby',
 'doo',
 'episode',
 'not',
 'la

In [10]:
bigram_docs[2]

['people',
 'know',
 'particular',
 'time',
 'past',
 'like',
 'feel',
 'need',
 'try',
 'define',
 'time',
 'replace',
 'woodstock',
 'civil_war',
 'apollo',
 'moon',
 'land',
 'titanic_sink',
 'get',
 'realistic',
 'flick',
 'formulaic',
 'soap_opus',
 'populate',
 'entirely',
 'low',
 'life',
 'trash',
 'kid',
 'young',
 'allow',
 'woodstock',
 'fail',
 'grade_school',
 'composition',
 'old',
 'meanie',
 'movie',
 'prove',
 'know',
 'nuttin',
 'topic',
 'money',
 'yes',
 'know',
 'thing',
 'watch',
 'film',
 'little',
 'insight',
 'underclass',
 'think',
 'time',
 'slut',
 'bar',
 'look_like',
 'diane_lane',
 'run',
 'way',
 'child_abuse',
 'let',
 'parent',
 'worthless',
 'raise',
 'kid',
 'audience',
 'abuse',
 'simply',
 'stick',
 'woodstock',
 'moonlanding',
 'flick',
 'ipso',
 'facto',
 'mean',
 'film',
 'portray']

## export phrases

In [23]:
bigrams_list = []
scores = []
for phrase, score in phrases.export_phrases(docs):
    bigrams_list.append(phrase)
    scores.append(score)
    
    

In [60]:
# create a df of the bigrams and their scores 
bigram_dict = {'bigrams': bigrams_list, 'score': scores}
bigram_df = pd.DataFrame(bigram_dict)
bigram_df.head()

Unnamed: 0,bigrams,score
0,b'art form',29.572952
1,b'bitterly disappoint',280.301529
2,b'theme tune',27.897067
3,b'saturday morning',822.454814
4,b'gerry anderson',333.113855


In [38]:
# how many bigrams were detected in the corpus? 
print("There are {} total bigrams detected.\n".format(bigram_df.shape[0]))
print("There are {} unique bigrams detected.".format(bigram_df.bigrams.nunique()))

There are 313752 total bigrams detected.

There are 13939 unique bigrams detected.


In [30]:
# what's the max bigram score? 
print("The max bigram score in the corpus is: {}".format(round(bigram_df.score.max(), 2)))

The max bigram score in the corpus is: 104250.76


In [61]:
# drop the duplicate bigrams from the df to get just the unique ones 
unique_bigram_df = bigram_df.drop_duplicates().reset_index(drop=True)

## top scored bigrams 

**Why is it that the bigrams with highest scores look like nonense?**

In [62]:
# look at the top 20 bigrams by score
unique_bigram_df.sort_values(by='score', ascending=False).head(20)

Unnamed: 0,bigrams,score
6754,b'yaphet kotto',104250.756944
11393,b'iwo jima',100527.515625
5070,b'atul kulkarni',97481.227273
12585,b'hakuna matata',97481.227273
3330,b'wishy washy',97481.227273
1986,b'alok nath',97481.227273
13618,b'brysomme mannen',95314.977778
2943,b'tilda swinton',89982.671329
6917,b'fausa aurvaag',89357.791667
7973,b'nessun dorma',87534.163265


**These looks pretty reasonable. Most of them are actor/actresses names**

In [63]:
unique_bigram_df.sort_values(by='score', ascending=False).head(1000).tail(10)

Unnamed: 0,bigrams,score
9535,b'ronald colman',6058.155367
4421,b'seth macfarlane',6058.155367
12002,b'ching wan',6058.155367
11827,b'evan almighty',6052.894268
2534,b'eugene hutz',6043.522142
11766,b'bo svenson',6041.090141
8480,b'amanda bynes',6036.052226
4251,b'cillian murphy',6020.94964
9954,b'homer hickam',6020.738349
12944,b'hou hsiao',6007.246499


**Look at another set of high score bigrams**

In [64]:
unique_bigram_df.sort_values(by='score', ascending=False).head(3000).tail(30)

Unnamed: 0,bigrams,score
11415,b'dan blocker',865.450767
5014,b'portobello road',865.450767
9416,b'rukh khan',865.101654
10787,b'dorian gray',863.360306
3487,b'stuart gordon',862.702257
1716,b'olivia newton',862.665728
11493,b'emily mortimer',862.665728
4210,b'skimpy outfit',862.578984
2871,b'buster babs',862.203297
9623,b'tom servo',858.464341


**These are looking pretty good**

In [65]:
unique_bigram_df.sort_values(by='score', ascending=False).head(5000).tail(30)

Unnamed: 0,bigrams,score
12617,b'dick hickock',222.329152
10114,b'figuratively literally',222.329152
7435,b'jimmy wang',222.236995
1743,b'adam burt',222.232389
7106,b'john schlesinger',222.075903
169,b'internal logic',222.006936
13051,b'johnny weismuller',222.006936
1049,b'david bradley',221.869129
10342,b'jonathan hale',221.662739
10941,b'rant rave',221.609293


## lowest scored bigrams 

**Some of these look pretty good like `extremely sexy` `incredible job` `nice guy`.**

Some of these make less sense and I wonder if its because the bigram model was applied AFTER stopwords had already been removed so the model puts words together that would not have ocurred together in the original text. 

In [66]:
# what do the bigrams with the lowest scores look like? 
unique_bigram_df.sort_values(by='score').head(30)

Unnamed: 0,bigrams,score
3203,b'nice guy',10.001282
8239,b'c scott',10.003251
639,b'order avoid',10.004371
4995,b'gradually reveal',10.004604
620,b'script weak',10.004634
5872,b'large budget',10.005795
11494,b'try cope',10.005841
7018,b'fight legion',10.007312
8120,b'red paint',10.007627
840,b'character likable',10.010405
