# Predicting a Good Movie Based on Plot

#### Executive Summary 
Aakash Tandel 

Predicting Movie Quality from Plot Summaries through Natural Language Processing

Overview: My project attempts to use Wikipedia plot summaries to predict Metacritic scores. The goal is to use natural language processing to figure out what makes a movie score high on Metacritic. Theoretically, the attempt is to find out what plot points (people, subjects, etc.) make a movie “good” (according to Metacritic reviews). 

Acknowledgements: I wanted to acknowledge a group of people who assisted (virtually and in person) in the completion of this project. I watched and followed Patrick Harrison's Modern NLP in Python lecture at PyData in DC 2016 and Bhargav Desikan's Topic Modeling with NLP Framework in Gensim lecture from PyData Berlin 2017. I listened to and modeled my code after Evann Smith, Ph.D., senior data scientist at Thresher’s lecture on LDA. Mark Mummert, Matt Speck, and Matt Brems Data Scientists at General Assembly in DC assisted with the work as well.. 

Data: I scraped the website Metacritic for movie titles, scores, cast, director, and genre. The scraping was batched into groups by genre because attempting to scrape the website in one swoop was problematic. The Wikipedia plots were borrowed from the 
Github user Markriedl (https://github.com/markriedl/WikiPlots). This data was found from the Data is Plural archive. spaCy was used to tokenize the plots. It was also used for EDA with sentence sementation, doing parts of speech tagging, and named entity recognition. Please email me or message me if you would like the data.

Processing: I attempted to use fuzzy matching to match names between Metacritic titles and Wikipedia titles (h/t to Roland Jennier). This wasn’t as fruitful as I had hoped. But I matched plots and titles with hard matching. The plots were tokenized, stemmed, lemminized, stripped of stop words, punctuation and words less than 3 characters. This was helpful in breaking down the relatively large corpus. This was done with NLTK because spaCy ended up causing problems. 

Modeling: All movies with Metacritic scores greater than 75 were labeled as “Good” movies. I used a TF-IDF to parse and model the plots. I used SVMs, Random Forest, and XGBoost to model the binary classifier. The ROC-AUC score ended up at a 61%. 

Conclusion: Not Predictive
Based on the relatively low score, I can say that the Wikipedia plot summaries are not good at predicting whether or not a film will be a hit on Metacritic.

Additional Processing: I used LDA and HDP topic modeling techniques to dive into the data a bit more. HDP was able to group specific movies and series (like Transformers and The Hunger Games) together. Additionally, I used Word2Vec to see if word embeddings would yield interesting results. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from tqdm import tqdm
tqdm.pandas(desc='progress-bar')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
%pylab inline
sns.set_style("whitegrid")

Populating the interactive namespace from numpy and matplotlib


# Part 0: Webscraping Metacritic
This was originally done in another jupyter notebook but I have recreated it here as a markdown cell for completeness. 

```Python

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = 'http://www.metacritic.com/browse/movies/genre/date/action?view=condensed'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
site = requests.get(url, headers=headers)
html = BeautifulSoup(site.text, 'lxml')

# Genre needs to be specified. 
genres = [  'thriller' ]

def get_movie(url):
    
    name = None
    year = None
    score = None
    cast = None
    summary = None
    director = None

    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
    site = requests.get(url, headers=headers)
    html = BeautifulSoup(site.text, 'lxml')


    try:
        title = html.find(class_='product_page_title')
    except:
        pass
    try:
        name = title.find('h1').text
    except:
        pass
    try:
        year = title.find(class_='release_year').text
    except:
        pass


    try:
        score = html.find('span', class_='metascore_w').text
    except:
        pass  
    
    try:
        summary = html.find(class_='blurb_expanded').text
    except:
        pass
    try:
        director = [item.text for item in html.find(class_='director').findAll('a')]
    except:
        pass
    
#     print('Name:', name)
#     print('Year:', year)
#     print('Score:', score)
#     print('Cast:', cast)
#     print('Summary:', summary)
#     print('Director:', director)
    
    movie_dict = {'Title': name, 'Year': year, 'Score': score, 'Cast': cast, 'Director': director}
#     df = pd.DataFrame().from_dict(movie_dict)
    
#     print(movie_dict)
    return movie_dict

get_movie('http://www.metacritic.com/movie/john-wick-chapter-2')
df = pd.DataFrame(columns=['Title', 'Year', 'Score', 'Cast', 'Director'])

url = 'http://www.metacritic.com/browse/movies/genre/date/'
i = 0

for genre in genres:
    page = 0
    url_temp = url + genre + '?view=detailed&page=' + str(page)
    site_temp = requests.get(url, headers=headers)
    
    while site_temp.status_code == 200:
        url_temp = url + genre + '?view=detailed&page=' + str(page)
#         print(url_temp)
        site_temp = requests.get(url_temp, headers=headers)
        html = BeautifulSoup(site_temp.text, 'html.parser')
    
        movies = [item['href'] for item in html.findAll('a', class_='product_score')]
        for movie in movies:
            movie_link = 'http://www.metacritic.com' + movie
#             print(movie_link)
#             df.append(get_movie(movie_link), ignore_index=True)
            movie_dict = get_movie(movie_link)
            df.loc[i] = movie_dict
            i += 1
            print(movie_dict['Title'])
        
        page += 1
        
        if not page % 1:
            print('Just hit page', page, '!')
        
        if not page % 2:
            df.to_csv('metacritic_thriler.csv', encoding='utf-8')

```

# Part 1: Understanding Wikipedia Plot Data with spaCy

In this part, we will preprocess the data from Wikipedia and do some basic natural language processing on the text in order to get insights. In part two of the analysis, we will build a model that can predict whether or not the movie plot is good. 

Part one contains all of the Wikipedia plots pulled from Kaggle.com. Part two only includes the plots that had Metacritic scores associated with them. This first part will have a fair amount of extra data. 

In [2]:
wiki = pd.read_csv('/Users/aakashtandel/Documents/Data/Wikipedia Plots/wiki.csv', index_col=0)
wiki.head()

Unnamed: 0,Title,Plot,Title Short,Extra Info
0,Animal Farm,"Old Major, the old boar on the Manor Farm, sum...",Animal Farm,
1,A Clockwork Orange (novel),Alex is a 15-year-old living in near-future dy...,A Clockwork Orange,novel
2,The Plague,The text of The Plague is divided into five pa...,The Plague,
3,Actaeon,"Among others, John Heath has observed, ""The un...",Actaeon,
4,A Fire Upon the Deep,"An expedition from Straumli Realm, an ambitiou...",A Fire Upon the Deep,


In [3]:
wiki.Plot[5129] #For example the hit HBO series, Game of Thrones...

'A Game of Thrones follows three principal storylines simultaneously. At the beginning of the story, Lord Eddard "Ned" Stark executes a deserter from the Night\'s Watch, who has betrayed his vows and fled from the Wall. On the way back, his children adopt six direwolf pups, the animal of his sigil. There are three male and two female direwolf pups, as well as an albino runt, which aligns with his three trueborn sons, two trueborn daughters, and one bastard son. That night, Ned receives word of the death of his mentor, Lord Jon Arryn, the principal advisor to Ned\'s childhood friend, King Robert Baratheon. During his own visit to Ned\'s castle of Winterfell, Robert recruits Ned to replace Arryn as the King\'s Hand. Ned is reluctant, but agrees to go when he learns that Arryn\'s widow Lysa believes Queen Cersei Lannister and her family poisoned Arryn. Shortly thereafter, Ned\'s son Bran inadvertently discovers Cersei in coitus with her twin brother Jaime Lannister, who throws Bran from t

We have data from over 112,000 plots from Wikipedia.com. These include movies, novels, and television shows. The analysis in part two of this project will focus in on movies. 

The following analysis was built with the help of Patrick Harrison's Modern NLP in Python lecture at PyData in DC 2016. His jupyter notebook can be found at https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb

Additionally, Bhargav Desikan's Topic Modeling with NLP framework Gensim from PyData Berlin 2017 was also helpful. His jupyter notebook can be found at https://github.com/bhargavvader/personal/blob/master/notebooks/text_analysis_tutorial/topic_modelling.ipynb

In [4]:
import spacy
nlp = spacy.load('en')  # Loading the English model and assigning it to NLP.

In [5]:
parsed_got = nlp(wiki.Plot[5129].decode('utf-8'))  # spaCy expects a unicode object.
parsed_got # spaCy has removed a lot of the filler stuff and left only the text. It's now easy to read. Text segmentation with spaCY.

A Game of Thrones follows three principal storylines simultaneously. At the beginning of the story, Lord Eddard "Ned" Stark executes a deserter from the Night's Watch, who has betrayed his vows and fled from the Wall. On the way back, his children adopt six direwolf pups, the animal of his sigil. There are three male and two female direwolf pups, as well as an albino runt, which aligns with his three trueborn sons, two trueborn daughters, and one bastard son. That night, Ned receives word of the death of his mentor, Lord Jon Arryn, the principal advisor to Ned's childhood friend, King Robert Baratheon. During his own visit to Ned's castle of Winterfell, Robert recruits Ned to replace Arryn as the King's Hand. Ned is reluctant, but agrees to go when he learns that Arryn's widow Lysa believes Queen Cersei Lannister and her family poisoned Arryn. Shortly thereafter, Ned's son Bran inadvertently discovers Cersei in coitus with her twin brother Jaime Lannister, who throws Bran from the towe

Text segmentation with spaCY. This could be very helpful if you had a large corpus within the same subject. For example, if you had all of the Harry Potter series books in a single corpus, it may be worth while to sentence segment the text. 

In [6]:
got_sentences = []
for num, sentence in enumerate(parsed_got.sents):
    print 'Sentence {}:'.format(num + 1)
    print sentence
    got_sentences.append(sentence)
    print ''

Sentence 1:
A Game of Thrones follows three principal storylines simultaneously.

Sentence 2:
At the beginning of the story, Lord Eddard "Ned" Stark executes a deserter from the Night's Watch, who has betrayed his vows and fled from the Wall.

Sentence 3:
On the way back, his children adopt six direwolf pups, the animal of his sigil.

Sentence 4:
There are three male and two female direwolf pups, as well as an albino runt, which aligns with his three trueborn sons, two trueborn daughters, and one bastard son.

Sentence 5:
That night, Ned receives word of the death of his mentor, Lord Jon Arryn, the principal advisor to Ned's childhood friend, King Robert Baratheon.

Sentence 6:
During his own visit to Ned's castle of Winterfell, Robert recruits Ned to replace Arryn as the King's Hand.

Sentence 7:
Ned is reluctant, but agrees to go when he learns that Arryn's widow Lysa believes Queen Cersei Lannister and her family poisoned Arryn.

Sentence 8:
Shortly thereafter, Ned's son Bran inadve

Named entity detection is used to pull out important nouns and such. This allows us to correctly identitify things like Lord Eddard as a person and Winterfell as a organization (though it would really be a location). This would be really important in certain use cases. The problem with the current corpus is that words like "Eddark" will ultimately be dropped because of its infrequency across the entire corpus. 

In [7]:
for num, entity in enumerate(parsed_got.ents):
    print 'Entity {}:'.format(num + 1), entity, '-', entity.label_
    print ''

Entity 1: three - CARDINAL

Entity 2: Lord Eddard - PERSON

Entity 3: Ned" Stark - PERSON

Entity 4: Night - ORG

Entity 5: six - CARDINAL

Entity 6: three - CARDINAL

Entity 7: two - CARDINAL

Entity 8: three - CARDINAL

Entity 9: two - CARDINAL

Entity 10: one - CARDINAL

Entity 11: Ned - PERSON

Entity 12: Lord Jon Arryn - PERSON

Entity 13: Ned - PERSON

Entity 14: Robert Baratheon - PERSON

Entity 15: Ned - PERSON

Entity 16: Winterfell - ORG

Entity 17: Robert - PERSON

Entity 18: Ned - PERSON

Entity 19: Arryn - PERSON

Entity 20: Ned - PERSON

Entity 21: Arryn - PERSON

Entity 22: Lysa - PERSON

Entity 23: Queen Cersei Lannister - PERSON

Entity 24: Arryn - PERSON

Entity 25: Ned - PERSON

Entity 26: Bran - PERSON

Entity 27: Cersei - PERSON

Entity 28: Jaime Lannister - PERSON

Entity 29: Bran - PERSON

Entity 30: Ned - PERSON

Entity 31: Sansa - PERSON

Entity 32: Arya - PERSON

Entity 33: King's Landing - ORG

Entity 34: Catelyn - PERSON

Entity 35: Robb - PERSON

Entity 36:

We can also do Parts of Speech Tagging in order to understand syntax. "Eddard" is a proper noun. That's great. This could be very benificial in different use cases. spaCy makes this very easy with the attribute .pos_. 

In [8]:
token_text = [token.orth_ for token in parsed_got]  # Each word becomes a token. 
token_pos = [token.pos_ for token in parsed_got]

pd.DataFrame(zip(token_text, token_pos),
             columns=['token_text', 'part_of_speech'])

Unnamed: 0,token_text,part_of_speech
0,A,DET
1,Game,NOUN
2,of,ADP
3,Thrones,NOUN
4,follows,VERB
5,three,NUM
6,principal,ADJ
7,storylines,NOUN
8,simultaneously,ADV
9,.,PUNCT


spaCy can also do text normalization and preprocess our text data by lemmatizing. We can see the effect of lemmatization below. This is important because we need to make sure words like "Game" and "game" are interpreted as the same thing by the computer. 

In [9]:
token_lemma = [token.lemma_ for token in parsed_got]
token_shape = [token.shape_ for token in parsed_got]

pd.DataFrame(zip(token_text, token_lemma, token_shape),
             columns=['token_text', 'token_lemma', 'token_shape'])

Unnamed: 0,token_text,token_lemma,token_shape
0,A,a,X
1,Game,game,Xxxx
2,of,of,xx
3,Thrones,throne,Xxxxx
4,follows,follow,xxxx
5,three,three,xxxx
6,principal,principal,xxxx
7,storylines,storyline,xxxx
8,simultaneously,simultaneously,xxxx
9,.,.,.


Token/word level entity analysis can tell us what entity the word is (person, organization, etc.) and inside-outside-beginning can tell us where in the sentence that token/word occured. 

In [10]:
token_entity_type = [token.ent_type_ for token in parsed_got]
token_entity_iob = [token.ent_iob_ for token in parsed_got]

pd.DataFrame(zip(token_text, token_entity_type, token_entity_iob),
             columns=['token_text', 'entity_type', 'inside_outside_begin'])

Unnamed: 0,token_text,entity_type,inside_outside_begin
0,A,,O
1,Game,,O
2,of,,O
3,Thrones,,O
4,follows,,O
5,three,CARDINAL,B
6,principal,,O
7,storylines,,O
8,simultaneously,,O
9,.,,O


We can use spaCy to pull stopwords, punctuation, white space, whether it's a number and check the token probability. The smaller (more negative) a log-probability is, the more rare a token is. 

In [11]:
token_attributes = [(token.orth_,
                     token.prob,
                     token.is_stop,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov)
                    for token in parsed_got]

df = pd.DataFrame(token_attributes,
                  columns=['text',
                           'log_probability',
                           'stop?',
                           'punctuation?',
                           'whitespace?',
                           'number?',
                           'out of vocab.?'])

df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
                                       .applymap(lambda x: u'Yes' if x else u''))
                                               
df

Unnamed: 0,text,log_probability,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,A,-7.385418,Yes,,,,
1,Game,-10.371019,,,,,
2,of,-4.275874,Yes,,,,
3,Thrones,-12.142232,,,,,
4,follows,-11.162002,,,,,
5,three,-8.723103,Yes,,,Yes,
6,principal,-12.080091,,,,,
7,storylines,-13.209061,,,,,
8,simultaneously,-11.605341,,,,,
9,.,-3.067898,,Yes,,,


# Part 2: Matching Metacritic Scores and Wikipedia Plots

The code below was derived seperately in a notebook called 'Wikipedia Plot Data' but was incorporated here for completeness.

In [12]:
plot_data = open('/Users/aakashtandel/Documents/Data/Wikipedia Plots/plots.txt', 'r')
text = plot_data.read()
split_text = text.split('<EOS>')
cleanish = [' '.join(each.split()) for each in split_text]
titles_data = open('/Users/aakashtandel/Documents/Data/Wikipedia Plots/titles.txt', 'r')
titles = titles_data.read()
split_titles = titles.split('\n')
tups = zip(split_titles, cleanish)
df = pd.DataFrame(tups, columns=['Title', 'Plot'])

In [13]:
df['Title Short'] = df['Title']
df['Title Short'] = df['Title Short'].str.replace(r"\(.*\)","")
df['Extra Info'] = df['Title']
df['Extra Info'] = df['Extra Info'].str.replace(r'[^(]*\(|\)[^)]*', '')
for n, each in enumerate(df.Title):
    if df['Extra Info'][n] == df['Title'][n]:
        df['Extra Info'][n] = None
df.head(20)  # This cell takes a bit to run. 

Unnamed: 0,Title,Plot,Title Short,Extra Info
0,Animal Farm,"Old Major, the old boar on the Manor Farm, sum...",Animal Farm,
1,A Clockwork Orange (novel),Alex is a 15-year-old living in near-future dy...,A Clockwork Orange,novel
2,The Plague,The text of The Plague is divided into five pa...,The Plague,
3,Actaeon,"Among others, John Heath has observed, ""The un...",Actaeon,
4,A Fire Upon the Deep,"An expedition from Straumli Realm, an ambitiou...",A Fire Upon the Deep,
5,All Quiet on the Western Front,"The book tells the story of Paul Bäumer, a Ger...",All Quiet on the Western Front,
6,Anyone Can Whistle,The story is set in an imaginary American town...,Anyone Can Whistle,
7,A Funny Thing Happened on the Way to the Forum,"In ancient Rome, some neighbors live in three ...",A Funny Thing Happened on the Way to the Forum,
8,Army of Darkness,"Being transported to the Middle Ages, Ash Will...",Army of Darkness,
9,The Birth of a Nation,The film follows two juxtaposed families. One ...,The Birth of a Nation,


The code below was developed in a jupyter notebook titled 'Fuzzy Matching' but was included here for completeness. 

In [14]:
titles = df
titles['Title Short'] = titles['Title Short'].str.lower().str.replace('&', ' ')
titles['Title Short'] = titles['Title Short'].astype(str)
titles['Title Short'] = titles['Title Short'].apply(str.strip)
meta_action = pd.read_csv('/Users/aakashtandel/Documents/Data/Metacritic Scrape/metacritic_action_adventure_drama.csv', index_col=0)
meta_fantasy = pd.read_csv('/Users/aakashtandel/Documents/Data/Metacritic Scrape/metacritic_fantasy.csv', index_col=0)
meta_history = pd.read_csv('/Users/aakashtandel/Documents/Data/Metacritic Scrape/metacritic_filmnoir_history.csv', index_col=0)
meta_horror = pd.read_csv('/Users/aakashtandel/Documents/Data/Metacritic Scrape/metacritic_horror.csv', index_col=0)
meta_mystery = pd.read_csv('/Users/aakashtandel/Documents/Data/Metacritic Scrape/metacritic_mystery.csv', index_col=0)
meta_romance = pd.read_csv('/Users/aakashtandel/Documents/Data/Metacritic Scrape/metacritic_romance.csv', index_col=0)
meta_scifi = pd.read_csv('/Users/aakashtandel/Documents/Data/Metacritic Scrape/metacritic_scifi_sport.csv', index_col=0)
meta_thriller = pd.read_csv('/Users/aakashtandel/Documents/Data/Metacritic Scrape/metacritic_thriller.csv', index_col=0)
meta_western = pd.read_csv('/Users/aakashtandel/Documents/Data/Metacritic Scrape/metacritic_western.csv', index_col=0)
meta = pd.concat([meta_action, meta_fantasy, meta_history, meta_horror, meta_mystery, meta_romance, meta_thriller,
                 meta_scifi, meta_western], axis=0)
meta = meta[pd.notnull(meta['Score'])]
print (meta.shape)
meta.head()

(19486, 5)


Unnamed: 0,Title,Year,Score,Cast,Director
0,Blade of the Immortal,2017.0,71.0,"[Chiaki Kuriyama, Erika Toda, Hana Sugisaki, H...",[Takashi Miike]
1,American Made,2017.0,63.0,"[April Billingsley, Benito Martinez, Caleb Lan...",[Doug Liman]
2,Unlocked,2017.0,40.0,"[Adelayo Adedayo, Akshay Kumar, Aymen Hamdouch...",[Michael Apted]
3,Death Note,2017.0,41.0,"[Christopher Britton, Jack Ettlinger, Jason Li...",[Adam Wingard]
4,The Villainess,2017.0,61.0,"[Eun-ji Jo, Ha-kyun Shin, Ok-bin Kim, Seo-hyeo...",[Byung-gil Jung]


In [15]:
def match_name(name, list_names, min_score=0, opt=90):
    # -1 score incase we don't get any matches
    max_score = -1
    # Returning empty name for no match as well
    max_name = ""
    # Iternating over all names in the other
    for name2 in list_names:
        #Finding fuzzy match score
        score = fuzz.ratio(name, name2)
        # Checking if we are above our threshold and have a better score
        if (score > min_score) & (score > max_score):
            max_name = name2
            max_score = score
            if max_score == opt:
                break 
    return (max_name, max_score)

In [16]:
meta.Title = meta.Title.str.lower().str.replace('&', ' ')
dict_list = []
import time 
title_short = set(titles['Title Short'])
for name in meta.Title:
    t0 = time.time()
    dict_ = {}
    match = name in title_short 
    dict_.update({"movie" : name})
    dict_.update({"movie match" : match})
    dict_list.append(dict_)
    #print (time.time() - t0)
merge_table = pd.DataFrame(dict_list)
toge = pd.concat([meta.set_index('Title'), merge_table.set_index('movie')], axis = 1)
toge = toge.reset_index()
toge.columns = (['Title', 'Year', 'Score', 'Cast', 'Director', 'movie match'])
titles = titles[['Plot', 'Title Short', 'Extra Info']]
titles['Title'] = titles['Title Short']
results = pd.merge(toge, titles, on=['Title'])
results.head()

Unnamed: 0,Title,Year,Score,Cast,Director,movie match,Plot,Title Short,Extra Info
0,american made,2017.0,63.0,"[April Billingsley, Benito Martinez, Caleb Lan...",[Doug Liman],True,American Made tells the story of Barry Seal (T...,american made,film
1,american made,2017.0,63.0,"[April Billingsley, Benito Martinez, Caleb Lan...",[Doug Liman],True,American Made tells the story of Barry Seal (T...,american made,film
2,american made,2017.0,63.0,"[April Billingsley, Benito Martinez, Caleb Lan...",[Doug Liman],True,American Made tells the story of Barry Seal (T...,american made,film
3,american made,2017.0,63.0,"[April Billingsley, Benito Martinez, Caleb Lan...",[Doug Liman],True,American Made tells the story of Barry Seal (T...,american made,film
4,american made,2017.0,63.0,"['April Billingsley', 'Benito Martinez', 'Cale...",['Doug Liman'],True,American Made tells the story of Barry Seal (T...,american made,film


In [17]:
results = results.drop_duplicates()
results = results.reset_index()
for n, each in enumerate(results['Year']):
    #print (type(keep.Year[n]))
    if str(results['Year'][n]) == 'nan':
        #print ('Missing')
        results['Year'][n] = 2007
results['Cast'] = results['Cast'].str.strip('[]').str.strip("'")
results['Director'] = results['Director'].str.strip('[]').str.strip("'")
results.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


Unnamed: 0,index,Title,Year,Score,Cast,Director,movie match,Plot,Title Short,Extra Info
0,0,american made,2017.0,63.0,"April Billingsley, Benito Martinez, Caleb Land...",Doug Liman,True,American Made tells the story of Barry Seal (T...,american made,film
1,4,american made,2017.0,63.0,"April Billingsley', 'Benito Martinez', 'Caleb ...",Doug Liman,True,American Made tells the story of Barry Seal (T...,american made,film
2,5,american made,2017.0,63.0,,Doug Liman,True,American Made tells the story of Barry Seal (T...,american made,film
3,6,unlocked,2017.0,40.0,"Adelayo Adedayo, Akshay Kumar, Aymen Hamdouchi...",Michael Apted,True,A CIA interrogator is lured into a ruse that p...,unlocked,2017 film
4,7,unlocked,2017.0,45.0,"Adelayo Adedayo', 'Akshay Kumar', 'Aymen Hamdo...",Michael Apted,True,A CIA interrogator is lured into a ruse that p...,unlocked,2017 film


In [18]:
export = results[['Title', 'Score', 'Plot']]
export = export.drop_duplicates(subset='Title')
export = export.reset_index()
export = export.drop(['index'], axis=1)
export = export.drop_duplicates()
print (export.shape)
export.head(20)

(7187, 3)


Unnamed: 0,Title,Score,Plot
0,american made,63.0,American Made tells the story of Barry Seal (T...
1,unlocked,40.0,A CIA interrogator is lured into a ruse that p...
2,death note,41.0,Light Yagami is a genius high school student w...
3,birth of the dragon,35.0,"In 1965 in San Francisco, Bruce Lee, spurred b..."
4,the dark tower,34.0,The story deals with an early rendition of int...
5,kidnap,44.0,"Sonia (Minissha Lamba) lives with her mother, ..."
6,atomic blonde,63.0,"The film takes place in Berlin, 1989, on the e..."
7,dunkirk,94.0,The film relates the story of Operation Dynamo...
8,despicable me 3,49.0,"Gru faces off against Balthazar Bratt, a forme..."
9,once upon a time in venice,28.0,When a Los Angeles-based private detective tri...


In [19]:
export.Title.value_counts()

the adventures of pluto nash                1
passion play                                1
the colony                                  1
london fields                               1
the lone ranger                             1
spring breakers                             1
50/50                                       1
theeb                                       1
rocky road                                  1
bellflower                                  1
everyone else                               1
the story of the weeping camel              1
uncle nino                                  1
elizabethtown                               1
captain underpants: the first epic movie    1
sinister 2                                  1
a bug's life                                1
unfriended                                  1
the patriot                                 1
admission                                   1
aloha                                       1
mystery, alaska                   

# Part 3: Predicting Movie Reviews

In [39]:
data = export[['Title', 'Score', 'Plot']]
data_seg = export[['Title', 'Score', 'Plot']]

In [40]:
data.shape  

(7187, 3)

In [41]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7187 entries, 0 to 7186
Data columns (total 3 columns):
Title    7187 non-null object
Score    7187 non-null float64
Plot     7187 non-null object
dtypes: float64(1), object(2)
memory usage: 544.6+ KB


In [42]:
bins = [0, 50, 100]  # Going to bin the movies into two different ordinal categories. 
group_names = ['Bad', 'Good']  # These names made the most sense. 
data['Categories'] = pd.cut(data['Score'], bins, labels=group_names)
data.head()

Unnamed: 0,Title,Score,Plot,Categories
0,american made,63.0,American Made tells the story of Barry Seal (T...,Good
1,unlocked,40.0,A CIA interrogator is lured into a ruse that p...,Bad
2,death note,41.0,Light Yagami is a genius high school student w...,Bad
3,birth of the dragon,35.0,"In 1965 in San Francisco, Bruce Lee, spurred b...",Bad
4,the dark tower,34.0,The story deals with an early rendition of int...,Bad


In [43]:
bins_ = [0, 25, 50, 75, 100]  # Going to bin the movies into four different ordinal categories. 
group_names_ = ['Bad', 'Okay', 'Good', 'Great']  # These names made the most sense. 
data_seg['Categories'] = pd.cut(data_seg['Score'], bins_, labels=group_names_)
data_seg.head()

Unnamed: 0,Title,Score,Plot,Categories
0,american made,63.0,American Made tells the story of Barry Seal (T...,Good
1,unlocked,40.0,A CIA interrogator is lured into a ruse that p...,Okay
2,death note,41.0,Light Yagami is a genius high school student w...,Okay
3,birth of the dragon,35.0,"In 1965 in San Francisco, Bruce Lee, spurred b...",Okay
4,the dark tower,34.0,The story deals with an early rendition of int...,Okay


In [44]:
# data['Good'] = pd.Series(data['Score']/100.0 + .25).apply(int)
# data['Good'].value_counts()  #h/t to Roland for this cool trick. 

In [45]:
# print (6203.0/(7187.1))

In [46]:
def str_conv(x):
    for each in x:
        each = str(each)
    return x

In [47]:
data.Plot = data.Plot.progress_map(str_conv)
data_seg.Plot = data_seg.Plot.progress_map(str_conv)

progress-bar: 100%|██████████| 7187/7187 [00:03<00:00, 2095.13it/s]
progress-bar: 100%|██████████| 7187/7187 [00:03<00:00, 1889.86it/s]


In [48]:
data.Plot.head()

0    American Made tells the story of Barry Seal (T...
1    A CIA interrogator is lured into a ruse that p...
2    Light Yagami is a genius high school student w...
3    In 1965 in San Francisco, Bruce Lee, spurred b...
4    The story deals with an early rendition of int...
Name: Plot, dtype: object

In [49]:
print (pd.value_counts(data['Categories']))
print (pd.value_counts(data_seg['Categories']))

Good    4258
Bad     2929
Name: Categories, dtype: int64
Good     3367
Okay     2534
Great     891
Bad       395
Name: Categories, dtype: int64


In [50]:
# With help from 
# Evann Smith, Ph.D.
# Senior Data Scientist - Thresher 
# evann@thresher.io 
# http://www.evannsmith.com

In [51]:
import random
import string
import re
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.svm import LinearSVC
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
import gensim
from gensim import corpora, models
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel
from gensim.models.wrappers import LdaMallet
from gensim.corpora import Dictionary
import pyLDAvis.gensim

In [52]:
stop = set(stopwords.words('english') + list(string.punctuation))
#stemmer = PorterStemmer()
lemma = WordNetLemmatizer() # Attempted both stemming and lemmatizing. 
re_punct = re.compile('[' + ''.join(string.punctuation) + ']')

In [53]:
def preprocess(text):  # Courtesy of Evann Smith.
    try:
        text = text.lower()
        tokens = word_tokenize(text)
        tokens = [t for t in tokens if not t in stop]
        tokens = [re.sub(re_punct, '', t) for t in tokens]
        tokens = [t for t in tokens if len(t) > 2]
        tokens = [lemma.lemmatize(t) for t in tokens]
        if len(tokens) == 0:
            return None
        else:
            return ' '.join(tokens)
    except:
        return None

In [54]:
data['Tokenized'] = data['Plot'].progress_map(preprocess)  # Similar to the apply function except with a progress bar.
data = data[data['Tokenized'].notnull()]
data.reset_index(inplace=True)
data.drop('index', inplace=True, axis=1)

progress-bar: 100%|██████████| 7187/7187 [01:10<00:00, 101.39it/s]


In [55]:
print('{} reviews'.format(len(data)))
data.head()

4588 reviews


Unnamed: 0,Title,Score,Plot,Categories,Tokenized
0,american made,63.0,American Made tells the story of Barry Seal (T...,Good,american made tell story barry seal tom cruise...
1,unlocked,40.0,A CIA interrogator is lured into a ruse that p...,Bad,cia interrogator lured ruse put london risk bi...
2,birth of the dragon,35.0,"In 1965 in San Francisco, Bruce Lee, spurred b...",Bad,1965 san francisco bruce lee spurred student s...
3,the dark tower,34.0,The story deals with an early rendition of int...,Bad,story deal early rendition interdimensional tr...
4,kidnap,44.0,"Sonia (Minissha Lamba) lives with her mother, ...",Bad,sonia minissha lamba life mother mallika vidya...


In [56]:
data_seg['Tokenized'] = data_seg['Plot'].progress_map(preprocess)  # Similar to the apply function except with a progress bar.
data_seg = data_seg[data_seg['Tokenized'].notnull()]
data_seg.reset_index(inplace=True)
data_seg.drop('index', inplace=True, axis=1)

progress-bar: 100%|██████████| 7187/7187 [01:11<00:00, 99.83it/s] 


In [57]:
print('{} reviews'.format(len(data_seg)))
data_seg.head()

4588 reviews


Unnamed: 0,Title,Score,Plot,Categories,Tokenized
0,american made,63.0,American Made tells the story of Barry Seal (T...,Good,american made tell story barry seal tom cruise...
1,unlocked,40.0,A CIA interrogator is lured into a ruse that p...,Okay,cia interrogator lured ruse put london risk bi...
2,birth of the dragon,35.0,"In 1965 in San Francisco, Bruce Lee, spurred b...",Okay,1965 san francisco bruce lee spurred student s...
3,the dark tower,34.0,The story deals with an early rendition of int...,Okay,story deal early rendition interdimensional tr...
4,kidnap,44.0,"Sonia (Minissha Lamba) lives with her mother, ...",Okay,sonia minissha lamba life mother mallika vidya...


In [58]:
data[data.Categories=='Good'][:20]

Unnamed: 0,Title,Score,Plot,Categories,Tokenized
0,american made,63.0,American Made tells the story of Barry Seal (T...,Good,american made tell story barry seal tom cruise...
5,dunkirk,94.0,The film relates the story of Operation Dynamo...,Good,film relates story operation dynamo evacuation...
9,captain underpants: the first epic movie,69.0,"Two elementary school students, George Beard a...",Good,two elementary school student george beard har...
10,god of war,54.0,"Kim Jun is the son of an escaped palace slave,...",Good,kim jun son escaped palace slave get raised mo...
12,alien: covenant,65.0,Bound for a remote planet on the far side of t...,Good,bound remote planet far side galaxy crew colon...
14,sleight,62.0,A young street magician (Jacob Latimore) is le...,Good,young street magician jacob latimore left care...
16,the fate of the furious,56.0,"With Dom and Letty on their honeymoon, Brian a...",Good,dom letty honeymoon brian mia retired game res...
17,colossal,70.0,After losing her job and boyfriend in New York...,Good,losing job boyfriend new york gloria anne hath...
18,ghost in the shell,52.0,The plot follows the members of Public Securit...,Good,plot follows member public security section ma...
20,logan,77.0,"With all his memories back, Wolverine has retu...",Good,memory back wolverine returned japan one first...


#### Analysis of Movie Words by Category

In [61]:
texts_bad = data_seg[data_seg.Categories=='Bad'].Tokenized.tolist() 
texts_okay = data_seg[data_seg.Categories=='Okay'].Tokenized.tolist() 
texts_good = data_seg[data_seg.Categories=='Good'].Tokenized.tolist() 
texts_great = data_seg[data_seg.Categories=='Great'].Tokenized.tolist() 

In [67]:
cvec = CountVectorizer(min_df=1, max_df=500, max_features=10000)

In [68]:
X_bad = cvec.fit_transform(texts_bad)
freqs_bad = [(word, X_bad.getcol(idx).sum()) for word, idx in cvec.vocabulary_.items()]
#sort from largest to smallest
print sorted (freqs_bad, key = lambda x: -x[1])[:20]  # Looking at the top 20 most frequent words in "Bad" movies. 

[(u'find', 389), (u'one', 279), (u'get', 274), (u'take', 264), (u'tell', 249), (u'back', 229), (u'friend', 214), (u'father', 203), (u'two', 197), (u'kill', 197), (u'later', 188), (u'life', 186), (u'new', 184), (u'house', 181), (u'make', 180), (u'man', 176), (u'day', 170), (u'come', 156), (u'child', 155), (u'try', 153)]


In [69]:
X_okay = cvec.fit_transform(texts_okay)
freqs_okay = [(word, X_okay.getcol(idx).sum()) for word, idx in cvec.vocabulary_.items()]
#sort from largest to smallest
print sorted (freqs_okay, key = lambda x: -x[1])[:20]  # Looking at the top 20 most frequent words in "Okay" movies. 

[(u'kill', 1175), (u'family', 995), (u'mother', 921), (u'police', 902), (u'car', 891), (u'escape', 862), (u'night', 800), (u'child', 746), (u'death', 742), (u'next', 742), (u'love', 739), (u'film', 730), (u'woman', 712), (u'killed', 711), (u'away', 710), (u'call', 698), (u'school', 688), (u'reveals', 666), (u'leaf', 664), (u'way', 659)]


In [70]:
X_good = cvec.fit_transform(texts_good)
freqs_good = [(word, X_good.getcol(idx).sum()) for word, idx in cvec.vocabulary_.items()]
#sort from largest to smallest
print sorted (freqs_good, key = lambda x: -x[1])[:20]  # Looking at the top 20 most frequent words in "Good" movies. 

[(u'police', 967), (u'wife', 922), (u'child', 838), (u'escape', 825), (u'school', 823), (u'car', 822), (u'call', 818), (u'john', 797), (u'son', 774), (u'money', 760), (u'asks', 737), (u'killed', 730), (u'order', 728), (u'room', 715), (u'show', 712), (u'group', 698), (u'girl', 698), (u'leave', 693), (u'decides', 690), (u'run', 683)]


In [71]:
X_great = cvec.fit_transform(texts_great)
freqs_great = [(word, X_great.getcol(idx).sum()) for word, idx in cvec.vocabulary_.items()]
#sort from largest to smallest
print sorted (freqs_great, key = lambda x: -x[1])[:20]  # Looking at the top 20 most frequent words in "Great" movies. 

[(u'find', 537), (u'one', 528), (u'take', 467), (u'tell', 449), (u'home', 413), (u'two', 410), (u'family', 391), (u'back', 379), (u'life', 377), (u'later', 359), (u'return', 351), (u'time', 336), (u'day', 334), (u'friend', 327), (u'mother', 325), (u'new', 325), (u'father', 322), (u'get', 320), (u'man', 314), (u'house', 312)]


We have found something rather infortunate here. The most frequent words by category are not very interesting. Words like 'find', 'take', 'family', 'tell' are among the most common. I will need to change the parameters of our tokenization and remove some extra words. Additionally, when we control for max_df with the Count Vectorizer, we find a lot of names poping up in the most frequent word lists. It may be a good idea to remove these. 

In [73]:
just_dummies = pd.get_dummies(data['Categories'])
data = pd.concat([data, just_dummies['Good']], axis=1)    
data.head()

Unnamed: 0,Title,Score,Plot,Categories,Tokenized,Good
0,american made,63.0,American Made tells the story of Barry Seal (T...,Good,american made tell story barry seal tom cruise...,1
1,unlocked,40.0,A CIA interrogator is lured into a ruse that p...,Bad,cia interrogator lured ruse put london risk bi...,0
2,birth of the dragon,35.0,"In 1965 in San Francisco, Bruce Lee, spurred b...",Bad,1965 san francisco bruce lee spurred student s...,0
3,the dark tower,34.0,The story deals with an early rendition of int...,Bad,story deal early rendition interdimensional tr...,0
4,kidnap,44.0,"Sonia (Minissha Lamba) lives with her mother, ...",Bad,sonia minissha lamba life mother mallika vidya...,0


In [74]:
texts = data.Tokenized.tolist() 
y = data.Good.tolist()
vectorizer = TfidfVectorizer() # Term Frequency Inverse Document Frequency Vectorizer
X = vectorizer.fit_transform(texts)

In [75]:
texts[:3]

[u'american made tell story barry seal tom cruise twa pilot recruited cia help counter emerging communist threat central america seal role major cia covert operation led turn involvement medellin cartel ultimately embarrassed reagan white house irancontra scandal became public',
 u'cia interrogator lured ruse put london risk biological attack',
 u'1965 san francisco bruce lee spurred student steve mckee challenge shaolin monk kung master wong jack man martial art fight']

In [76]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=58) # Train Test split

In [77]:
from sklearn.metrics import roc_auc_score

In [78]:
classifier_ln = LinearSVC()
%time classifier_ln.fit(X_train, y_train)
print('Accuracy: {}'.format(round(classifier_ln.score(X_test, y_test), 5)))

CPU times: user 113 ms, sys: 7.49 ms, total: 121 ms
Wall time: 124 ms
Accuracy: 0.58715


In [79]:
classifier_rf = RandomForestClassifier(class_weight={1:6})
%time classifier_rf.fit(X_train, y_train)
print('Accuracy: {}'.format(round(classifier_rf.score(X_test, y_test), 5)))

CPU times: user 2.46 s, sys: 11 ms, total: 2.47 s
Wall time: 2.47 s
Accuracy: 0.5305


In [82]:
from xgboost.sklearn import XGBClassifier

In [87]:
xgb = XGBClassifier(max_depth=2)
%time xgb.fit(X_train.toarray(), y_train, eval_metric='auc')
print('Accuracy: {}'.format(round(xgb.score(X_test.toarray(), y_test), 5)))

CPU times: user 4min 37s, sys: 2.33 s, total: 4min 40s
Wall time: 4min 40s
Accuracy: 0.6122


In [88]:
preds = xgb.predict_proba(X_test.todense())
roc_auc_score(y_test, [item[1] for item in preds])

0.60466165340199751

In [89]:
classifier_nb = GaussianNB()
%time classifier_nb.fit(X_train.toarray(), y_train)
print('Accuracy: {}'.format(round(classifier_nb.score(X_test.toarray(), y_test), 5)))

CPU times: user 2.83 s, sys: 2.19 s, total: 5.02 s
Wall time: 5.09 s
Accuracy: 0.53486


In [90]:
X_test[:20]

<20x50903 sparse matrix of type '<type 'numpy.float64'>'
	with 4106 stored elements in Compressed Sparse Row format>

In [91]:
xgb.predict(X_test)

array([1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 0,

# Latent Dirichlet Allocation 

In [92]:
# stringz = [each.encode('utf-8') for each in texts ]
# stringz

In [93]:
# # we add some words to the stop word list
# strangs, article = [], []
# for w in strangs:
#     # if it's not a stop word or punctuation mark, add it to our article!
#     if w.text != '\n' and not w.is_stop and not w.is_punct and not w.like_num:
#         # we add the lematized version of the word
#         article.append(w.lemma_)
#     # if it's a new line, it means we're onto our next document
#     if w.text == '\n':
#         strangs.append(article)
#         article = []

In [94]:
stringz = [each.split(" ") for each in texts]
stringz = [each for sublist in stringz for each in sublist]
stringz[:5]

[u'american', u'made', u'tell', u'story', u'barry']

In [95]:
vocab = vectorizer.vocabulary_
texts = [[token for token in tokens.split() if token in vocab] for tokens in data.Tokenized]
# myarray = np.asarray(stringz)
# myarray

In [96]:
id2word = corpora.Dictionary(texts)
corpus = [id2word.doc2bow(text) for text in texts]

In [97]:
lsimodel = LsiModel(corpus=corpus, num_topics=10, id2word=id2word)
lsimodel.show_topics()

[(0,
  u'0.204*"find" + 0.181*"tell" + 0.168*"one" + 0.158*"take" + 0.151*"back" + 0.150*"get" + 0.128*"house" + 0.127*"two" + 0.120*"home" + 0.120*"father"'),
 (1,
  u'0.202*"soldier" + 0.185*"escape" + 0.184*"kill" + 0.168*"suraj" + -0.156*"tell" + 0.134*"camp" + -0.133*"mother" + -0.128*"father" + 0.125*"indian" + -0.125*"house"'),
 (2,
  u'-0.927*"john" + -0.124*"sarah" + -0.106*"sam" + 0.067*"jack" + 0.065*"billy" + -0.064*"max" + -0.057*"mark" + -0.054*"house" + -0.054*"kate" + -0.053*"henri"'),
 (3,
  u'0.599*"jack" + -0.219*"suraj" + -0.161*"indian" + -0.159*"pakistani" + -0.147*"billy" + -0.142*"george" + 0.141*"kill" + 0.131*"baker" + -0.131*"gurtu" + -0.122*"kabir"'),
 (4,
  u'-0.607*"jack" + -0.214*"suraj" + 0.164*"kill" + -0.157*"indian" + -0.156*"pakistani" + -0.143*"baker" + -0.143*"john" + -0.128*"gurtu" + -0.120*"kabir" + 0.117*"police"'),
 (5,
  u'0.428*"charlie" + 0.379*"billy" + 0.264*"george" + -0.196*"father" + -0.189*"mother" + -0.134*"family" + 0.129*"jack" + 0.

In [98]:
hdpmodel = HdpModel(corpus=corpus, id2word=id2word)
hdpmodel.show_topics()

[(0,
  u'0.005*find + 0.004*one + 0.004*tell + 0.004*take + 0.003*get + 0.003*back + 0.003*two + 0.003*life + 0.003*friend + 0.003*father + 0.003*home + 0.003*house + 0.003*new + 0.003*time + 0.003*later + 0.002*family + 0.002*day + 0.002*see + 0.002*return + 0.002*mother'),
 (1,
  u'0.004*find + 0.003*one + 0.003*take + 0.003*tell + 0.003*get + 0.003*back + 0.003*father + 0.003*kill + 0.002*house + 0.002*two + 0.002*friend + 0.002*later + 0.002*time + 0.002*man + 0.002*life + 0.002*new + 0.002*home + 0.002*day + 0.002*family + 0.002*return'),
 (2,
  u'0.004*find + 0.003*one + 0.003*take + 0.003*tell + 0.003*back + 0.003*get + 0.003*life + 0.003*two + 0.002*return + 0.002*home + 0.002*kill + 0.002*friend + 0.002*later + 0.002*time + 0.002*house + 0.002*father + 0.002*try + 0.002*max + 0.002*new + 0.002*police'),
 (3,
  u'0.003*find + 0.003*one + 0.002*jerry + 0.002*get + 0.002*two + 0.002*mireu + 0.002*take + 0.002*new + 0.002*kill + 0.002*also + 0.002*back + 0.002*man + 0.002*escape +

In [99]:
ldamodel = LdaModel(corpus=corpus, num_topics=10, id2word=id2word)
ldamodel.show_topics()

[(0,
  u'0.005*"life" + 0.005*"one" + 0.005*"take" + 0.004*"find" + 0.004*"man" + 0.004*"father" + 0.004*"home" + 0.003*"new" + 0.003*"story" + 0.003*"mother"'),
 (1,
  u'0.004*"find" + 0.004*"one" + 0.004*"david" + 0.003*"two" + 0.003*"back" + 0.003*"time" + 0.003*"man" + 0.003*"life" + 0.003*"tell" + 0.003*"take"'),
 (2,
  u'0.005*"johnny" + 0.004*"one" + 0.004*"tom" + 0.004*"tell" + 0.003*"later" + 0.003*"take" + 0.003*"time" + 0.003*"find" + 0.003*"back" + 0.003*"home"'),
 (3,
  u'0.004*"one" + 0.004*"later" + 0.003*"life" + 0.003*"find" + 0.003*"new" + 0.003*"max" + 0.003*"film" + 0.003*"michael" + 0.003*"two" + 0.003*"time"'),
 (4,
  u'0.005*"bobby" + 0.005*"marty" + 0.004*"find" + 0.003*"one" + 0.003*"tell" + 0.003*"take" + 0.003*"time" + 0.003*"back" + 0.003*"two" + 0.002*"diana"'),
 (5,
  u'0.004*"find" + 0.004*"house" + 0.004*"one" + 0.004*"back" + 0.003*"take" + 0.003*"family" + 0.003*"two" + 0.003*"david" + 0.003*"kill" + 0.003*"home"'),
 (6,
  u'0.006*"find" + 0.005*"tell"

In [101]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(hdpmodel, corpus, id2word)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  topic_term_dists = topic_term_dists.ix[topic_order]


The HDP model shows that some of the movies can be grouped together. Above, Topic 67 is most likely a topic containing movies from the Hunger Game series. One problem with the topics is that words like "find", "tell", and "kill" end up in a lot of topics. 

# Word 2Vec

In [102]:
y = []
doc_vectors = []
for i, row in data.iterrows():
    doc_vectors.append(TaggedDocument(row['Tokenized'].split(), ['doc_' + str(i)]))
    y.append(row['Good'])
print(len(y), len(doc_vectors))

(4588, 4588)


In [103]:
def shuffle_docs(docs):
    random.shuffle(docs)
    return docs

In [104]:
token_count = sum([len(doc_vector) for doc_vector in doc_vectors])
model = Doc2Vec(size=100, window=10, min_count=1, workers=4)
model.build_vocab(doc_vectors)
for epoch in range(20):
    print('Epoch {}'.format(epoch))
    model.train(shuffle_docs(doc_vectors), total_examples=token_count, epochs=1)
d2v = {d: vec for d, vec in zip(model.docvecs.offset2doctag, model.docvecs.doctag_syn0)}
X = []
for d in range(len(doc_vectors)):
    X.append(d2v['doc_' + str(d)])
X = np.array(X)

Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
Epoch 6
Epoch 7
Epoch 8
Epoch 9
Epoch 10
Epoch 11
Epoch 12
Epoch 13
Epoch 14
Epoch 15
Epoch 16
Epoch 17
Epoch 18
Epoch 19


In [105]:
d2v = {d: vec for d, vec in zip(model.docvecs.offset2doctag, model.docvecs.doctag_syn0)}
X = []
for d in range(len(doc_vectors)):
    X.append(d2v['doc_' + str(d)])
X = np.array(X)

In [106]:
model.most_similar('war')

[(u'civil', 0.5927691459655762),
 (u'b29s', 0.5464341640472412),
 (u'invasion', 0.5267268419265747),
 (u'1945', 0.5152638554573059),
 (u'iraq', 0.5118180513381958),
 (u'mastroianni', 0.48624444007873535),
 (u'stuttgart', 0.47948622703552246),
 (u'1943', 0.47625666856765747),
 (u'waging', 0.47615715861320496),
 (u'marcello', 0.4733803868293762)]

In [107]:
model.most_similar('truck')

[(u'vehicle', 0.6240042448043823),
 (u'jeep', 0.611855685710907),
 (u'suv', 0.5873789191246033),
 (u'hitch', 0.5663658380508423),
 (u'pickup', 0.5577775239944458),
 (u'van', 0.5465189814567566),
 (u'newsstand', 0.5439530611038208),
 (u'car', 0.5401383638381958),
 (u'highway', 0.5352294445037842),
 (u'cab', 0.5229488611221313)]

In [108]:
model.most_similar('king')

[(u'lear', 0.5927168130874634),
 (u'throne', 0.565933883190155),
 (u'crown', 0.5155367851257324),
 (u'kingdom', 0.5120627880096436),
 (u'savon', 0.5081352591514587),
 (u'crimson', 0.5009835362434387),
 (u'sparta', 0.4960322380065918),
 (u'idealization', 0.48955875635147095),
 (u'richelieu', 0.477540522813797),
 (u'reemerge', 0.4741384983062744)]

In [112]:
model.most_similar('katniss')

[(u'peeta', 0.8815475106239319),
 (u'haymitch', 0.7171944379806519),
 (u'tribute', 0.708436131477356),
 (u'odair', 0.6481788754463196),
 (u'finnick', 0.6466540694236755),
 (u'rue', 0.6211423873901367),
 (u'defiance', 0.6206035614013672),
 (u'cinna', 0.6032858490943909),
 (u'capitol', 0.5518869161605835),
 (u'mellark', 0.5256476998329163)]

In [113]:
model.most_similar('fake')

[(u'id', 0.5129965543746948),
 (u'underdog', 0.46679624915122986),
 (u'scolex', 0.45160406827926636),
 (u'gadget', 0.451467365026474),
 (u'popie', 0.43373537063598633),
 (u'disc', 0.42792826890945435),
 (u'setup', 0.41971760988235474),
 (u'undeniable', 0.41077888011932373),
 (u'vault', 0.40518733859062195),
 (u'utilizing', 0.40414512157440186)]

In [114]:
model.most_similar('news')

[(u'medium', 0.5657176375389099),
 (u'newspaper', 0.4859614968299866),
 (u'report', 0.46371129155158997),
 (u'anchorman', 0.4636768698692322),
 (u'coverage', 0.4575899839401245),
 (u'radio', 0.4483267366886139),
 (u'bergman', 0.43578487634658813),
 (u'adjourns', 0.41432374715805054),
 (u'actuarial', 0.4046815037727356),
 (u'corningstone', 0.4033523201942444)]