### Milestone 3: Traditional statistical and machine learning methods, due Wednesday, April 19, 2017

Think about how you would address the genre prediction problem with traditional statistical or machine learning methods. This includes everything you learned about modeling in this course before the deep learning part. Implement your ideas and compare different classifiers. Report your results and discuss what challenges you faced and how you overcame them. What works and what does not? If there are parts that do not work as expected, make sure to discuss briefly what you think is the cause and how you would address this if you would have more time and resources. 

You do not necessarily need to use the movie posters for this step, but even without a background in computer vision, there are very simple features you can extract from the posters to help guide a traditional machine learning model. Think about the PCA lecture for example, or how to use clustering to extract color information. In addition to considering the movie posters it would be worthwhile to have a look at the metadata that IMDb provides. 

You could use Spark and the [ML library](https://spark.apache.org/docs/latest/ml-features.html#word2vec) to build your model features from the data. This may be especially beneficial if you use additional data, e.g., in text form.

You also need to think about how you are going to evaluate your classifier. Which metrics or scores will you report to show how good the performance is?

The notebook to submit this week should at least include:

- Detailed description and implementation of two different models
- Description of your performance metrics
- Careful performance evaluations for both models
- Visualizations of the metrics for performance evaluation
- Discussion of the differences between the models, their strengths, weaknesses, etc. 
- Discussion of the performances you achieved, and how you might be able to improve them in the future

#### Preliminary Peer Assessment

It is important to provide positive feedback to people who truly worked hard for the good of the team and to also make suggestions to those you perceived not to be working as effectively on team tasks. We ask you to provide an honest assessment of the contributions of the members of your team, including yourself. The feedback you provide should reflect your judgment of each team member’s:

- Preparation – were they prepared during team meetings?
- Contribution – did they contribute productively to the team discussion and work?
- Respect for others’ ideas – did they encourage others to contribute their ideas?
- Flexibility – were they flexible when disagreements occurred?

Your teammate’s assessment of your contributions and the accuracy of your self-assessment will be considered as part of your overall project score.

Preliminary Peer Assessment: [https://goo.gl/forms/WOYC7pwRCSU0yV3l1](https://goo.gl/forms/WOYC7pwRCSU0yV3l1)

In [1]:
import cPickle

import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import make_scorer, hamming_loss, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

In [2]:
def load_part(file_name):
    with open(file_name, 'rb') as handle:
        return cPickle.load(handle)

In [3]:
def get_sample(text_features, y):
    n_elements = len(text_features)

    np.random.seed(109)
    sample_indices = np.random.choice(n_elements, size=10000, replace=False)

    text_features_sample = text_features[sample_indices]
    y_sample = y[sample_indices]

    return text_features_sample, y_sample

def cutoff_labels(labels, cutoff):
    mlb = MultiLabelBinarizer()
    label_df = pd.DataFrame(mlb.fit_transform(labels))
    label_df.columns = mlb.classes_
    label_number_df = pd.DataFrame({'cnt': label_df.sum(axis=0)})
    major_genres = set(label_number_df[label_number_df['cnt'] > cutoff].index)
    return major_genres

In [4]:
root_folder = '..'

In [5]:
# load TMDB movies dataset
tmdb_movies = load_part(root_folder + '/data/tmdb_info.pickle')

In [6]:
def prepare_text_data(tmdb_dict):
    plot_dict = load_part(root_folder + '/data/plot.pickle')

    # add 'overview' from TMDB to 'plot' from IMDB (it is a list)
    for tmdb_id, imdb_movie in plot_dict.iteritems():
        if ('plot' in imdb_movie and tmdb_id in tmdb_dict and 'overview' in tmdb_dict[tmdb_id].__dict__ and
                    tmdb_dict[tmdb_id].__dict__['overview'] is not None):
            imdb_movie['plot'].append(tmdb_dict[tmdb_id].__dict__['overview'])

    labels = np.array([d['genres'] for d in plot_dict.values() if 'genres' in d and 'plot' in d])
    # only leave generes mentioned in 2000 movies or more
    major_genres = cutoff_labels(labels, 2000)
    labels = np.array(
        [major_genres.intersection(d['genres']) for d in plot_dict.values() if 'genres' in d and 'plot' in d])    
    # create the labels vector with only major genres
    mlb = MultiLabelBinarizer()
    y = mlb.fit_transform(labels)
    # the plot consists of a few parts, join them together
    features = np.array([''.join(d['plot']) for d in plot_dict.values() if 'genres' in d and 'plot' in d])
    features_sample, y_sample = get_sample(features, y)
    
    vectorizer = TfidfVectorizer(
        stop_words=stopwords.words("english"),
        token_pattern='[a-zA-Z]+[0-9]*',
        max_df=0.9,
        min_df=0.0001,
        dtype=np.float32,
    )
    return (features_sample, y_sample, mlb.classes_, vectorizer)

In [7]:
def prepare_cast_data(tmdb_dict):
    columns = [
        'director',
        'cast',
        'casting director',
        'miscellaneous crew',
        'original music',
        'producer',
        'cinematographer',
        'costume designer',
        'art direction']
    
    cast_dict = load_part(root_folder + '/data/cast10K.pickle')
    
    # array of list of genres for every movie
    labels = np.array([d['genres'] for d in cast_dict.values() if 'genres' in d])
    # only leave generes mentioned in 100 movies or more
    major_genres = cutoff_labels(labels, 100)
    labels = np.array(
        [major_genres.intersection(d['genres']) for d in cast_dict.values() if 'genres' in d])    
    # create the labels vector with only major genres
    mlb = MultiLabelBinarizer()
    y = mlb.fit_transform(labels)

    # combine all names separated by '|'
    features = []
    for tmdb_id, imdb_movie in cast_dict.iteritems():
        if 'genres' not in imdb_movie:
            continue
        l = []
        for c in columns:
            if(c in imdb_movie):
                l = l + [ c['name'].encode('utf-8') for c in imdb_movie[c]]
        # add crew and cast from TMDB
        if(tmdb_id in tmdb_dict):
            tmdb_movie = tmdb_dict[tmdb_id].__dict__
            if('crew' in tmdb_movie):
                l = l + [c['name'].encode('utf-8') for c in tmdb_movie['crew']]
            if('cast' in tmdb_movie):
                l = l + [c['name'].encode('utf-8') for c in tmdb_movie['cast']]
        # remove duplicates before joiniing
        features.append('|'.join(set(l)))
        
    vectorizer = CountVectorizer( 
        max_df = 0.99,
        min_df = 0.0002,
        stop_words = stopwords.words("english"), 
        tokenizer = lambda x: x.split('|'),
        dtype = np.float32)
        
    return features, y, mlb.classes_, vectorizer

In [8]:
# get labes / features from the cast / crew data
features, y, mlb_classes, vectorizer = prepare_cast_data(tmdb_movies)

In [9]:
# get labes / features from the text data
#features, y, mlb_classes, vectorizer = prepare_text_data(tmdb_movies)

In [10]:
# split into test / train data
F_train, F_test, y_train, y_test = train_test_split(features, y, test_size=0.25, random_state=42)
# convert into bag of words
X_train = vectorizer.fit_transform(F_train)
X_test = vectorizer.transform(F_test)
print 'Train label matrix shape:', y_train.shape
print 'Train predictor matrix shape:', X_train.shape
print 'Test label matrix shape:', y_test.shape
print 'Test predictor matrix shape:', X_test.shape

Train label matrix shape: (7286L, 22L)
Train predictor matrix shape: (7286, 39143)
Test label matrix shape: (2429L, 22L)
Test predictor matrix shape: (2429, 39143)


In [None]:
def sgd(X_test, X_train, y_test, y_train, mlb_classes):
    param_grid = {
        'estimator__alpha': np.logspace(-5, -3, num=30),
    }
    model = OneVsRestClassifier(SGDClassifier(random_state=761, class_weight='balanced'))
    model_tuning = GridSearchCV(
        model,
        param_grid=param_grid,
        scoring=make_scorer(hamming_loss, greater_is_better=False),
        n_jobs=-1,
        verbose=1,
    )
    model_tuning.fit(X_train, y_train)
    print model_tuning.best_params_
    print classification_report(y_train, model_tuning.predict(X_train), target_names=mlb_classes)
    print classification_report(y_test, model_tuning.predict(X_test), target_names=mlb_classes)

    
sgd(X_test, X_train, y_test, y_train, mlb_classes)

Fitting 3 folds for each of 30 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   42.5s
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  1.8min finished


{'estimator__alpha': 0.001}
             precision    recall  f1-score   support

     Action       0.95      0.95      0.95       797
  Adventure       0.93      0.94      0.93       524
  Animation       0.96      0.85      0.90       471
  Biography       0.76      0.91      0.83       255
     Comedy       0.99      0.88      0.93      2023
      Crime       0.97      0.96      0.96       656
Documentary       0.52      1.00      0.68       880
      Drama       1.00      0.87      0.93      3006
     Family       0.94      0.95      0.94       499
    Fantasy       0.84      0.93      0.88       340
    History       0.72      0.91      0.81       211
     Horror       0.92      0.92      0.92       632
      Music       0.86      0.90      0.88       305
    Musical       0.68      0.96      0.79       218
    Mystery       0.87      0.94      0.90       338
    Romance       0.96      0.93      0.95       915
     Sci-Fi       0.85      0.94      0.90       325
      Short      

In [None]:
def random_forest(X_test, X_train, y_test, y_train, mlb_classes):
    param_grid = {
        'min_samples_leaf': (1, 2, 50),
        'max_features': ('auto', 'sqrt', 'log2', 0.2),
    }
    model = RandomForestClassifier(n_estimators=50, random_state=761,class_weight='balanced')
    model_tuning = GridSearchCV(
        model,
        param_grid=param_grid,
        scoring=make_scorer(hamming_loss, greater_is_better=False),
        cv=3,
        n_jobs=-1,
        verbose=3,
    )
    model_tuning.fit(X_train, y_train)
    print model_tuning.best_params_
    print classification_report(y_train, model_tuning.predict(X_train), target_names=mlb_classes)
    print classification_report(y_test, model_tuning.predict(X_test), target_names=mlb_classes)

    
random_forest(X_test, X_train, y_test, y_train, mlb_classes)

Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:   56.2s
