# Project 6: IMDB

This project involves NLP, decision trees, bagging, boosting, and more!

---

## Load packages

You are likely going to need to install the `imdbpie` package:

    > pip install imdbpie

---

In [1]:
import os
import subprocess
import collections
import re
import csv
import json

import pandas as pd
import numpy as np
import scipy

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

import psycopg2
import requests
from imdbpie import Imdb
import nltk

import urllib
from bs4 import BeautifulSoup
import nltk

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

---

## Part 1: Acquire the Data

You will connect to the IMDB API to query for movies. 

See here for documentation on how to use the package:

https://github.com/richardasaurus/imdb-pie

#### 1. Connect to the IMDB API

In [2]:
imdb = Imdb()
imdb = Imdb(anonymize=True)

#### 2. Query the top 250 rated movies in the database

In [3]:
top_250 = imdb.top_250()
top_250

[{u'can_rate': True,
  u'image': {u'height': 1388,
   u'url': u'http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg',
   u'width': 933},
  u'num_votes': 1661769,
  u'rating': 9.3,
  u'tconst': u'tt0111161',
  u'title': u'The Shawshank Redemption',
  u'type': u'feature',
  u'year': u'1994'},
 {u'can_rate': True,
  u'image': {u'height': 500,
   u'url': u'http://ia.media-imdb.com/images/M/MV5BMjEyMjcyNDI4MF5BMl5BanBnXkFtZTcwMDA5Mzg3OA@@._V1_.jpg',
   u'width': 333},
  u'num_votes': 1137373,
  u'rating': 9.2,
  u'tconst': u'tt0068646',
  u'title': u'The Godfather',
  u'type': u'feature',
  u'year': u'1972'},
 {u'can_rate': True,
  u'image': {u'height': 500,
   u'url': u'http://ia.media-imdb.com/images/M/MV5BNDc2NTM3MzU1Nl5BMl5BanBnXkFtZTcwMTA5Mzg3OA@@._V1_.jpg',
   u'width': 333},
  u'num_votes': 776871,
  u'rating': 9,
  u'tconst': u'tt0071562',
  u'title': u'The Godfather: Part II',
  u'type': u'feature',
  u'year': u'1974'},
 {u'can_rate': True,
 

#### 3. Make a dataframe from the movie data

Keep the fields:

    num_votes
    rating
    tconst
    title
    year
    
And discard the rest

In [4]:
df = pd.DataFrame(top_250).drop(['can_rate', 'image' , 'type'], axis=1)
df.head(10)

Unnamed: 0,num_votes,rating,tconst,title,year
0,1661769,9.3,tt0111161,The Shawshank Redemption,1994
1,1137373,9.2,tt0068646,The Godfather,1972
2,776871,9.0,tt0071562,The Godfather: Part II,1974
3,1647436,9.0,tt0468569,The Dark Knight,2008
4,850185,8.9,tt0108052,Schindler's List,1993
5,438417,8.9,tt0050083,12 Angry Men,1957
6,1302412,8.9,tt0110912,Pulp Fiction,1994
7,1195893,8.9,tt0167260,The Lord of the Rings: The Return of the King,2003
8,495355,8.9,tt0060196,"The Good, the Bad and the Ugly",1966
9,1323154,8.9,tt0137523,Fight Club,1999


#### 3. Select only the top 100 movies

In [5]:
df[:101]

Unnamed: 0,num_votes,rating,tconst,title,year
0,1661769,9.3,tt0111161,The Shawshank Redemption,1994
1,1137373,9.2,tt0068646,The Godfather,1972
2,776871,9.0,tt0071562,The Godfather: Part II,1974
3,1647436,9.0,tt0468569,The Dark Knight,2008
4,850185,8.9,tt0108052,Schindler's List,1993
5,438417,8.9,tt0050083,12 Angry Men,1957
6,1302412,8.9,tt0110912,Pulp Fiction,1994
7,1195893,8.9,tt0167260,The Lord of the Rings: The Return of the King,2003
8,495355,8.9,tt0060196,"The Good, the Bad and the Ugly",1966
9,1323154,8.9,tt0137523,Fight Club,1999


#### 4. Get the genres and runtime for each movie and add them to the dataframe

There can be multiple genres per movie, so this will need some finessing.

In [6]:
# Using bs4 to scrape the genre, etc..

url_1 = 'http://www.imdb.com/chart/top'
response = requests.get(url_1)
soup = BeautifulSoup(response.text, 'lxml')

In [7]:
# Creating a list of links so we can navigate from the first page...to the pages w/ relevant data:

links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
links

['/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=01VDMWF05KZTQ04AXTYW&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1',
 '/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=01VDMWF05KZTQ04AXTYW&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_2',
 '/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=01VDMWF05KZTQ04AXTYW&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_3',
 '/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=01VDMWF05KZTQ04AXTYW&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_4',
 '/title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=01VDMWF05KZTQ04AXTYW&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_5',
 '/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=01VDMWF05KZTQ04AXTYW&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_6',
 '/title/tt0110912/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=01VDMWF05KZTQ04AXT

In [8]:
# Notice the 'links' above are the extensions, not the entire link; we'll have to add the 'http://www.imdb...

base_url = 'http://www.imdb.com'

# Here we're adding the base url to the extensions from 'links' and calling it next_page
next_page = [base_url + x for x in links]

In [9]:
# Here we're iterating through 'next_page' and returning all the 'genre' names
# Issues: .string to remove tags, etc..from 'genre'; find/find_all=finding tags etc/ ;
# declaring genres outside of the loop ;      



genres = []

for i in next_page:
    url = i
    response1 = requests.get(url)
    soup3 = BeautifulSoup(response1.text, 'lxml')
    scrape = soup3.body.find('span', attrs={'class': 'itemprop'}).string
    genres.append(scrape)
    

print genres

[u'Crime', u'Crime', u'Crime', u'Action', u'Biography', u'Crime', u'Crime', u'Adventure', u'Western', u'Drama', u'Adventure', u'Action', u'Drama', u'Action', u'Action', u'Drama', u'Biography', u'Action', u'Action', u'Action', u'Crime', u'Crime', u'Crime', u'Drama', u'Crime', u'Comedy', u'Crime', u'Western', u'Animation', u'Action', u'Crime', u'Adventure', u'Drama', u'Horror', u'Comedy', u'Action', u'Biography', u'Mystery', u'Comedy', u'Crime', u'Action', u'Biography', u'Crime', u'Adventure', u'Drama', u'Action', u'Mystery', u'Drama', u'Drama', u'Animation', u'Comedy', u'Drama', u'Horror', u'Comedy', u'Drama', u'Drama', u'Drama', u'Drama', u'Drama', u'Animation', u'Animation', u'Action', u'Drama', u'Action', u'Animation', u'Drama', u'Drama', u'Adventure', u'Crime', u'Adventure', u'Mystery', u'Action', u'Crime', u'Crime', u'Crime', u'Comedy', u'Biography', u'Drama', u'Crime', u'Crime', u'Crime', u'Animation', u'Adventure', u'Crime', u'Drama', u'Drama', u'Biography', u'Comedy', u'Adventur

In [10]:
df["genres"] = genres
df

Unnamed: 0,num_votes,rating,tconst,title,year,genres
0,1661769,9.3,tt0111161,The Shawshank Redemption,1994,Crime
1,1137373,9.2,tt0068646,The Godfather,1972,Crime
2,776871,9.0,tt0071562,The Godfather: Part II,1974,Crime
3,1647436,9.0,tt0468569,The Dark Knight,2008,Action
4,850185,8.9,tt0108052,Schindler's List,1993,Biography
5,438417,8.9,tt0050083,12 Angry Men,1957,Crime
6,1302412,8.9,tt0110912,Pulp Fiction,1994,Crime
7,1195893,8.9,tt0167260,The Lord of the Rings: The Return of the King,2003,Adventure
8,495355,8.9,tt0060196,"The Good, the Bad and the Ugly",1966,Western
9,1323154,8.9,tt0137523,Fight Club,1999,Drama


#### 4. Write the Results to a csv

---

## Part 2: Wrangle the text data

#### 1. Scrape the reviews for the top 100 movies

*Hint*: Use a loop to scrape each page at once

In [11]:
titles = df['tconst'][:101]
titles

0      tt0111161
1      tt0068646
2      tt0071562
3      tt0468569
4      tt0108052
5      tt0050083
6      tt0110912
7      tt0167260
8      tt0060196
9      tt0137523
10     tt0120737
11     tt0080684
12     tt0109830
13     tt1375666
14     tt0167261
15     tt0073486
16     tt0099685
17     tt0133093
18     tt0047478
19     tt0076759
20     tt0317248
21     tt0114369
22     tt0102926
23     tt0038650
24     tt0114814
25     tt0118799
26     tt0110413
27     tt0064116
28     tt0245429
29     tt0120815
         ...    
71     tt0086190
72     tt0051201
73     tt0022100
74     tt0105236
75     tt0211915
76     tt0112573
77     tt0180093
78     tt0066921
79     tt0075314
80     tt0036775
81     tt0435761
82     tt0056172
83     tt0056592
84     tt0338013
85     tt0093058
86     tt0086879
87     tt0070735
88     tt0062622
89     tt0040522
90     tt0045152
91     tt0208092
92     tt0114709
93     tt0071853
94     tt0361748
95     tt0012349
96     tt0119488
97     tt0059578
98     tt00536

In [19]:
all_reviews = []

for i in titles:
    reviews = imdb.get_title_reviews(i, max_results=1000)
    all_reviews.append(reviews)
    

#### 2. Extract the reviews and the rating per review for each movie

*Note*: "soup" from BeautifulSoup is the html returned from all 25 pages. You'll need to either address each page individually or break them down by elements.

In [26]:
len(all_reviews)

101

In [25]:
for i in all_reviews:
    print len(i)

1000
1000
650
1000
1000
877
1000
1000
770
1000
1000
890
1000
1000
1000
752
982
1000
596
1000
740
1000
908
703
1000
1000
836
562
894
1000
1000
1000
1000
1000
216
767
362
628
210
1000
976
758
1000
801
718
1000
1000
977
1000
648
835
451
1000
195
406
388
1000
1000
344
500
1000
1000
1000
1000
564
803
1000
508
489
425
682
643
199
285
923
1000
1000
1000
1000
877
318
727
555
464
1000
687
544
250
1000
225
566
725
380
652
1000
114
621
216
237
262
474


In [37]:
ratings = [i.text for i in soup.find_all('td', attrs={'class':"ratingColumn imdbRating"})]
df['ratings'] = ratings

In [39]:
# Need to go bak here to strip the '/n.../n' from the ratings
df

Unnamed: 0,num_votes,rating,tconst,title,year,genres,ratings
0,1661769,9.3,tt0111161,The Shawshank Redemption,1994,Crime,\n9.2\n
1,1137373,9.2,tt0068646,The Godfather,1972,Crime,\n9.2\n
2,776871,9.0,tt0071562,The Godfather: Part II,1974,Crime,\n9.0\n
3,1647436,9.0,tt0468569,The Dark Knight,2008,Action,\n8.9\n
4,850185,8.9,tt0108052,Schindler's List,1993,Biography,\n8.9\n
5,438417,8.9,tt0050083,12 Angry Men,1957,Crime,\n8.9\n
6,1302412,8.9,tt0110912,Pulp Fiction,1994,Crime,\n8.9\n
7,1195893,8.9,tt0167260,The Lord of the Rings: The Return of the King,2003,Adventure,\n8.9\n
8,495355,8.9,tt0060196,"The Good, the Bad and the Ugly",1966,Western,\n8.9\n
9,1323154,8.9,tt0137523,Fight Club,1999,Drama,\n8.8\n


#### 3. Remove the non AlphaNumeric characters from reviews

#### 4. Calculate the top 200 ngrams from the user reviews

Use the `TfidfVectorizer` in sklearn.

Recommended parameters:

    ngram_range = (1, 2)
    stop_words = 'english'
    binary = False
    max_features = 200

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

#### 5. Merge the user reviews and ratings

#### 6. Save this merged dataframe as a csv

---

## Part 3: Combine Tables in PostgreSQL

#### 1. Import your two .csv data files into your Postgre Database as two different tables

For ease, we can call these table1 and table2

#### 2. Connect to database and query the joined set

#### 3. Join the two tables 

#### 4. Select the newly joined table and save two copies of the into dataframes

---

## Part 4: Parsing and Exploratory Data Analysis

#### 1. Rename any columns you think should be renamed for clarity

#### 2. Describe anything interesting or suspicious about your data (quality assurance)

#### 3. Make four visualizations of interest to you using the data

---

## Part 5: Decision Tree Classifiers and Regressors

#### 1. What is our target attribute? 

Choose a target variable for the decision tree regressor and the classifier. 

#### 2. Prepare the X and Y matrices and preprocess data as you see fit

#### 3. Build and cross-validate your decision tree classifier

#### 4. Gridsearch optimal parameters for your classifier. Does the performance improve?

#### 5. Build and cross-validate your decision tree regressor

#### 6. Gridsearch the optimal parameters for your classifier. Does performance improve?

---

## Part 6: Elastic Net


#### 1. Gridsearch optimal parameters for an ElasticNet using the regression target and predictors you used for the decision tree regressor.


#### 2. Is cross-validated performance better or worse than with the decision trees? 

#### 3. Explain why the elastic net may have performed best at that particular l1_ratio and alpha

---

## Part 7: Bagging and Boosting: Random Forests, Extra Trees, and AdaBoost

#### 1. Load the random forest regressor, extra trees regressor, and adaboost regressor from sklearn

#### 2. Gridsearch optimal parameters for the three different ensemble methods.

#### 3. Evaluate the performance of the two bagging and one boosting model. Which performs best?

#### 4. Extract the feature importances from the Random Forest regressor and make a DataFrame pairing variable names with their variable importances.

#### 5. Plot the ranked feature importances.

#### 6.1 [BONUS] Gridsearch an optimal Lasso model and use it for variable selection (make a new predictor matrix with only the variables not zeroed out by the Lasso). 

#### 6.2 [BONUS] Gridsearch your best performing bagging/boosting model from above with the features retained after the Lasso. Does the score improve?

#### 7.1. [BONUS] Select a threshold for variable importance from your Random Forest regressor and use that to perform feature selection, creating a new subset predictor matrix.

#### 7.2 [BONUS] Using BaggingRegressor with a base estimator of your choice, test a model using the feature-selected dataset you made in 7.1

---

## [VERY BONUS] Part 8: PCA

#### 1. Perform a PCA on your predictor matrix

#### 2. Examine the variance explained and determine what components you want to keep based on them.

#### 3. Plot the cumulative variance explained by the ordered principal components.

#### 4. Gridsearch an elastic net using the principal components you selected as your predictors. Does this perform better than the elastic net you fit earlier?

#### 5. Gridsearch a bagging ensemble estimator that you fit before, this time using the principal components as predictors. Does this perform better or worse than the original? 

#### 6. Look at the loadings of the original predictor columns on the first 3 principal components. Is there any kind of intuitive meaning here?

Hint, you will probably want to sort by absolute value of magnitude of loading, and also only look at the obviously important (larger) ones!

# [Extremely Bonus] Part 9:  Clustering

![](https://snag.gy/jPSZ6U.jpg)

 ***Bonus Bonus:***
This extended bonus question is asking to do something we never really talked about but would like for you to attempt based on the assumptions that we learned during this weeks clustering lesson(s).

#### 1. Import your favorite clustering module

#### 2. Encode categoricals

#### 3. Evaluate cluster metics solely based on a range of K
If K-Means:  SSE/Inertia vs Silhouette (ie: Elbow), silhouette average, etc

#### 4.  Look at your data based on the subset of your predicted clusters.
Assign the cluster predictions back to your dataframe in order to see them in context.  This is great to be able to group by cluster to get a sense of the data that clumped together.

#### 5. Describe your findings based on the predicted clusters 
_How well did it do?  What's good or bad?  How would you improve this? Does any of it make sense?_