## AC209b / CS109b Final Project - Milestone 3 Part 01
Yujiao Chen, Brian Ho, Jonathan Jay // 04/12/2017


**Traditional statistical and machine learning methods, due Wednesday, April 19, 2017**

Think about how you would address the genre prediction problem with traditional statistical or machine learning methods. This includes everything you learned about modeling in this course before the deep learning part. Implement your ideas and compare different classifiers. Report your results and discuss what challenges you faced and how you overcame them. What works and what does not? If there are parts that do not work as expected, make sure to discuss briefly what you think is the cause and how you would address this if you would have more time and resources. 

You do not necessarily need to use the movie posters for this step, but even without a background in computer vision, there are very simple features you can extract from the posters to help guide a traditional machine learning model. Think about the PCA lecture for example, or how to use clustering to extract color information. In addition to considering the movie posters it would be worthwhile to have a look at the metadata that IMDb provides. 

You could use Spark and the [ML library](https://spark.apache.org/docs/latest/ml-features.html#word2vec) to build your model features from the data. This may be especially beneficial if you use additional data, e.g., in text form.

You also need to think about how you are going to evaluate your classifier. Which metrics or scores will you report to show how good the performance is?

The notebook to submit this week should at least include:

- Detailed description and implementation of two different models
- Description of your performance metrics
- Careful performance evaluations for both models
- Visualizations of the metrics for performance evaluation
- Discussion of the differences between the models, their strengths, weaknesses, etc. 
- Discussion of the performances you achieved, and how you might be able to improve them in the future

In [94]:
import pandas as pd
import numpy as np

import imdb
import requests
from ast import literal_eval
from xgboost import XGBClassifier
from sklearn.cross_validation import KFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

In [2]:
## Get the genre codes from IMDB
payload = {'api_key': '9290a6fe9125b32e7bbe5512036be0d0'}
r = requests.get('https://api.themoviedb.org/3/genre/movie/list', params=payload)

genres = pd.DataFrame.from_dict(r.json()["genres"])
genres = genres.set_index("id")

genres = genres["name"].to_dict()
genres

{12: u'Adventure',
 14: u'Fantasy',
 16: u'Animation',
 18: u'Drama',
 27: u'Horror',
 28: u'Action',
 35: u'Comedy',
 36: u'History',
 37: u'Western',
 53: u'Thriller',
 80: u'Crime',
 99: u'Documentary',
 878: u'Science Fiction',
 9648: u'Mystery',
 10402: u'Music',
 10749: u'Romance',
 10751: u'Family',
 10752: u'War',
 10770: u'TV Movie'}

In [36]:
# a function to add names to data rows
def get_names(existing, ids):
    if str(existing).lower() == "nan":
        ids = literal_eval(ids)
        names = []
        for id_ in ids:
            if id_ in genres.keys():
                names.append(genres[id_])
            else:
                print id_, " not found"
        return names
    
    else:
        return existing
    
get_names("NaN", str([14, 28, 878, 10769]))

10769  not found


[u'Fantasy', u'Action', u'Science Fiction']

In [37]:
movies = pd.read_csv("Movie subset for poster analysis_990 movies.csv")
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 990 entries, 0 to 989
Data columns (total 21 columns):
Unnamed: 0           990 non-null int64
X                    990 non-null int64
adult                990 non-null bool
backdrop_path        981 non-null object
genre_ids            990 non-null object
id                   990 non-null int64
original_language    990 non-null object
original_title       990 non-null object
overview             986 non-null object
popularity           990 non-null float64
poster_path          990 non-null object
release_date         990 non-null object
title                990 non-null object
video                990 non-null bool
vote_average         990 non-null float64
vote_count           990 non-null int64
genre_names          985 non-null object
date                 990 non-null object
year                 990 non-null int64
genres               990 non-null object
decade               990 non-null int64
dtypes: bool(2), float64(2), int64(6), obj

In [38]:
movies["genre_names"] = movies.apply(lambda x: get_names(x["genre_names"], x["genre_ids"]), axis=1)

10769  not found
10769  not found
10769  not found
10769  not found
10769  not found


In [32]:
movies.to_csv("Movie subset for poster analysis_990 movies_cleaned.csv")

In [70]:
movies.rename(columns={"Unnamed: 0":"result_id"}, inplace = True)

In [73]:
movies.head()

Unnamed: 0,result_id,X,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,...,release_date,title,video,vote_average,vote_count,genre_names,date,year,genres,decade
0,3063,0,False,/haHULz59jdG4OVmqHjZ01z7HEit.jpg,"[35, 18, 10749]",164,en,Breakfast at Tiffany's,Fortune hunter Holly Golightly finds herself c...,3.978771,...,1961-10-05,Breakfast at Tiffany's,False,7.5,769,"[Comedy, Drama, Romance]",1961-10-05,1961,"35, 18, 10749",1960
1,3463,0,False,/evAe6OMQgRkrVWxjLktYy1tIARW.jpg,"[18, 10749, 10752]",907,en,Doctor Zhivago,Doctor Zhivago is the filmed adapation of the ...,3.292234,...,1965-12-22,Doctor Zhivago,False,7.3,193,"[Drama, Romance, War]",1965-12-22,1965,"18, 10749, 10752",1960
2,2964,1,False,/vF4d4hAKMcrDE4Y6zLwUSijp32g.jpg,"[35, 18, 10749]",284,en,The Apartment,Bud Baxter is a minor clerk in a huge New York...,3.270765,...,1960-06-15,The Apartment,False,8.0,383,"[Comedy, Drama, Romance]",1960-06-15,1960,"35, 18, 10749",1960
3,3663,0,False,/sEWKMIlgUCCb2n5mT0Mf71DRKLR.jpg,"[35, 18, 10749]",37247,en,The Graduate,Recent college graduate Benjamin Braddock is s...,3.257459,...,1967-12-21,The Graduate,False,7.5,637,"[Comedy, Drama, Romance]",1967-12-21,1967,"35, 18, 10749",1960
4,3363,0,False,/soKwqxI1j5rNkVfDeuarmTHetcH.jpg,"[18, 10751, 10402, 10749]",11113,en,My Fair Lady,A misogynistic and snobbish phonetics professo...,3.09274,...,1964-10-21,My Fair Lady,False,7.4,269,"[Drama, Family, Music, Romance]",1964-10-21,1964,"18, 10751, 10402, 10749",1960


In [76]:
movies.genre_names

0                               [Comedy, Drama, Romance]
1                                  [Drama, Romance, War]
2                               [Comedy, Drama, Romance]
3                               [Comedy, Drama, Romance]
4                        [Drama, Family, Music, Romance]
5                              [Drama, History, Romance]
6                        [Drama, Family, Music, Romance]
7                               [Comedy, Drama, Romance]
8                       [Comedy, Family, Music, Romance]
9                   [Comedy, Mystery, Romance, Thriller]
10                           [Adventure, Music, Romance]
11                  [Adventure, Drama, History, Romance]
12                            [Comedy, Romance, Western]
13                     [Romance, Crime, Thriller, Drama]
14                                      [Drama, Romance]
15                                      [Romance, Drama]
16             [Drama, History, Romance, War, Adventure]
17                          [Ac

In [147]:
X = movies[[u'adult', u'id', u'popularity',
            u'year',
            u'vote_average', u'vote_count']]

In [144]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 990 entries, 0 to 989
Data columns (total 7 columns):
adult           990 non-null bool
id              990 non-null int64
overview        986 non-null object
popularity      990 non-null float64
year            990 non-null int64
vote_average    990 non-null float64
vote_count      990 non-null int64
dtypes: bool(1), float64(2), int64(3), object(1)
memory usage: 47.4+ KB


In [114]:
le = LabelEncoder()
le.fit(movies["genre_names"])
Y = le.transform(movies["genre_names"])

In [81]:
kf = KFold(len(X), n_folds=2)
print len(kf)

2


In [149]:
for train_index, test_index in kf:
    x_train, x_test = X.iloc[train_index,:], X.iloc[test_index,:]
    y_train, y_test = Y[train_index,], Y[test_index,]
    
    model = XGBClassifier()
    model.fit(x_train, y_train)
    
    y_pred = model.predict(x_test)
    predictions = [round(value) for value in y_pred]
    
    #evaluate predictions
    accuracy = accuracy_score(y_test, predictions)
    print("Accuracy: %.2f%%" % (accuracy * 100.0))

KeyboardInterrupt: 