## AC209b / CS109b Final Project - Milestone 3 Part 01
Yujiao Chen, Brian Ho, Jonathan Jay // 04/12/2017


**Traditional statistical and machine learning methods, due Wednesday, April 19, 2017**

Think about how you would address the genre prediction problem with traditional statistical or machine learning methods. This includes everything you learned about modeling in this course before the deep learning part. Implement your ideas and compare different classifiers. Report your results and discuss what challenges you faced and how you overcame them. What works and what does not? If there are parts that do not work as expected, make sure to discuss briefly what you think is the cause and how you would address this if you would have more time and resources. 

You do not necessarily need to use the movie posters for this step, but even without a background in computer vision, there are very simple features you can extract from the posters to help guide a traditional machine learning model. Think about the PCA lecture for example, or how to use clustering to extract color information. In addition to considering the movie posters it would be worthwhile to have a look at the metadata that IMDb provides. 

You could use Spark and the [ML library](https://spark.apache.org/docs/latest/ml-features.html#word2vec) to build your model features from the data. This may be especially beneficial if you use additional data, e.g., in text form.

You also need to think about how you are going to evaluate your classifier. Which metrics or scores will you report to show how good the performance is?

The notebook to submit this week should at least include:

- Detailed description and implementation of two different models
- Description of your performance metrics
- Careful performance evaluations for both models
- Visualizations of the metrics for performance evaluation
- Discussion of the differences between the models, their strengths, weaknesses, etc. 
- Discussion of the performances you achieved, and how you might be able to improve them in the future

In [86]:
## Import libraries
import pandas as pd
import numpy as np

import imdb
import requests
from ast import literal_eval
from xgboost import XGBClassifier
from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer

In [125]:
## Read in the data
movies = pd.read_csv("Movie subset for poster analysis_990 movies_cleaned.csv")
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 990 entries, 0 to 989
Data columns (total 22 columns):
Unnamed: 0           990 non-null int64
Unnamed: 0.1         990 non-null int64
X                    990 non-null int64
adult                990 non-null bool
backdrop_path        981 non-null object
genre_ids            990 non-null object
id                   990 non-null int64
original_language    990 non-null object
original_title       990 non-null object
overview             986 non-null object
popularity           990 non-null float64
poster_path          990 non-null object
release_date         990 non-null object
title                990 non-null object
video                990 non-null bool
vote_average         990 non-null float64
vote_count           990 non-null int64
genre_names          990 non-null object
date                 990 non-null object
year                 990 non-null int64
genres               990 non-null object
decade               990 non-null int64
dt

In [126]:
## Cleanup column names
movies.rename(columns={"Unnamed: 0":"result_id"}, inplace = True)
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 990 entries, 0 to 989
Data columns (total 22 columns):
result_id            990 non-null int64
Unnamed: 0.1         990 non-null int64
X                    990 non-null int64
adult                990 non-null bool
backdrop_path        981 non-null object
genre_ids            990 non-null object
id                   990 non-null int64
original_language    990 non-null object
original_title       990 non-null object
overview             986 non-null object
popularity           990 non-null float64
poster_path          990 non-null object
release_date         990 non-null object
title                990 non-null object
video                990 non-null bool
vote_average         990 non-null float64
vote_count           990 non-null int64
genre_names          990 non-null object
date                 990 non-null object
year                 990 non-null int64
genres               990 non-null object
decade               990 non-null int64
dt

In [128]:
## Filter out movies with invalid information 
valid_overview = [type(i) is str for i in movies["overview"]]
movies = movies[valid_overview]
# valid_title_filter = [type(i) is str for i in movies["title"]]
# movies = movies[valid_title_filter]

movies = movies.reset_index()

movies.tail()

Unnamed: 0,level_0,index,result_id,Unnamed: 0.1,X,adult,backdrop_path,genre_ids,id,original_language,...,release_date,title,video,vote_average,vote_count,genre_names,date,year,genres,decade
981,981,985,985,8478,15,False,/4liSXBZZdURI0c1Id1zLJo6Z3Gu.jpg,"[878, 14, 28, 12]",76757,en,...,2015-02-04,Jupiter Ascending,False,5.2,2206,"[Science Fiction, Fantasy, Action, Adventure]",2015-02-04,2015,"878, 14, 28, 12",2010
982,982,986,986,8165,2,False,/cfVoH243KjWXV6JoLzwxqWNb23i.jpg,"[878, 12, 9648]",70981,en,...,2012-05-30,Prometheus,False,6.2,4135,"[Science Fiction, Adventure, Mystery]",2012-05-30,2012,"878, 12, 9648",2010
983,983,987,987,7964,1,False,/jxdSxqAFrdioKgXwgTs5Qfbazjq.jpg,"[12, 28, 878]",10138,en,...,2010-04-28,Iron Man 2,False,6.6,5601,"[Adventure, Action, Science Fiction]",2010-04-28,2010,"12, 28, 878",2010
984,984,988,988,8167,4,False,/cZkPJ0noQvcR3oCCZ4pwYZeWUYi.jpg,"[28, 53, 878]",59967,en,...,2012-09-26,Looper,False,6.6,4053,"[Action, Thriller, Science Fiction]",2012-09-26,2012,"28, 53, 878",2010
985,985,989,989,8380,17,False,/oZY3DOlEZbEZvRxWynWkFTe4UgE.jpg,"[53, 878, 18, 9648]",157353,en,...,2014-04-16,Transcendence,False,5.9,1861,"[Thriller, Science Fiction, Drama, Mystery]",2014-04-16,2014,"53, 878, 18, 9648",2010


In [129]:
## Create bag-of-words feature representation from movie summaries
corpus = movies["overview"].tolist()
vectorizer = CountVectorizer(min_df=20, stop_words="english")
words = vectorizer.fit_transform(corpus)
print words.toarray().shape
print vectorizer.get_feature_names()

## Convert to data frame
words = pd.DataFrame(words.A, columns=vectorizer.get_feature_names())
words

(986, 149)
[u'alien', u'american', u'away', u'based', u'battle', u'beautiful', u'begin', u'begins', u'best', u'black', u'body', u'boy', u'brother', u'business', u'car', u'century', u'child', u'children', u'city', u'come', u'comes', u'control', u'crew', u'dark', u'daughter', u'day', u'dead', u'deadly', u'death', u'destroy', u'discover', u'discovers', u'doctor', u'dr', u'earth', u'end', u'escape', u'evil', u'face', u'falls', u'family', u'father', u'fight', u'film', u'finds', u'forces', u'friend', u'friends', u'future', u'gets', u'girl', u'girlfriend', u'goes', u'good', u'government', u'group', u'help', u'high', u'home', u'horror', u'house', u'human', u'husband', u'including', u'island', u'job', u'john', u'just', u'killer', u'later', u'learns', u'left', u'life', u'like', u'lives', u'living', u'local', u'love', u'make', u'man', u'married', u'meet', u'meets', u'men', u'mission', u'mother', u'murder', u'mysterious', u'named', u'new', u'night', u'old', u'order', u'past', u'people', u'place', 

Unnamed: 0,alien,american,away,based,battle,beautiful,begin,begins,best,black,...,way,wife,woman,women,work,world,year,years,york,young
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2,0
9,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0


In [130]:
## Create predictors from metadata and  
X = movies[[u'adult', u'id', u'popularity',
            u'year',
            u'vote_average', u'vote_count']]

X = X.join(X_, how="left", rsuffix="_word")

In [131]:
## A function to add a label for the response variable (which genre within our three categories of interest)
def classify(ids):
    if "10749" in ids:
        return 0
    elif "27" in ids:
        return 1
    elif "878" in ids:
        return 2

## Create response from complete genre labels
movies["label"] = movies.apply(lambda x: classify(x["genre_ids"]), axis=1)
Y = movies["label"].values

In [123]:
'''
## Encode labels
le = LabelEncoder()
le.fit(movies["genre_names"])
Y = le.transform(movies["genre_names"])
'''

In [133]:
## Set up K-fold cross validation
kf = StratifiedKFold(Y, n_folds=5, shuffle=True)
print len(kf)

for train_index, test_index in kf:
    x_train, x_test = X.iloc[train_index,:], X.iloc[test_index,:]
    y_train, y_test = Y[train_index,], Y[test_index,]
    
    model = XGBClassifier()
    model.fit(x_train, y_train)
    
    y_pred = model.predict(x_test)
    predictions = [round(value) for value in y_pred]
    
    #evaluate predictions
    accuracy = accuracy_score(y_test, predictions)
    print("Accuracy: %.2f%%" % (accuracy * 100.0))

5
[  0   1   2   4   5   6   9  10  11  12  13  15  16  17  18  19  21  22
  23  25  26  27  28  29  30  31  32  33  34  35  36  37  39  40  41  42
  44  45  46  47  49  50  51  52  53  54  56  58  59  60  61  63  64  65
  66  67  68  69  71  72  73  74  75  76  77  78  79  82  83  85  86  87
  88  89  91  92  94  95  96  97  98  99 100 101 104 105 106 108 110 111
 112 113 114 115 117 118 119 120 121 122 124 125 126 129 131 132 134 135
 137 138 139 140 141 142 143 144 146 148 149 150 152 153 155 156 157 158
 159 161 162 163 164 165 166 167 168 169 170 171 172 174 175 177 178 179
 180 181 184 185 186 187 188 189 190 191 192 194 196 197 198 199 200 201
 203 204 205 206 207 209 210 211 212 213 214 215 216 217 218 219 220 222
 223 224 226 227 228 229 230 232 233 234 235 236 237 238 239 240 242 244
 246 247 249 251 252 253 254 255 257 259 260 261 263 264 265 266 267 268
 269 270 271 272 273 274 275 276 278 279 280 283 284 285 286 287 288 290
 291 292 293 294 296 297 298 299 300 301 302 303 