### Milestone 3: Traditional statistical and machine learning methods, due Wednesday, April 19, 2017

Think about how you would address the genre prediction problem with traditional statistical or machine learning methods. This includes everything you learned about modeling in this course before the deep learning part. Implement your ideas and compare different classifiers. Report your results and discuss what challenges you faced and how you overcame them. What works and what does not? If there are parts that do not work as expected, make sure to discuss briefly what you think is the cause and how you would address this if you would have more time and resources. 

You do not necessarily need to use the movie posters for this step, but even without a background in computer vision, there are very simple features you can extract from the posters to help guide a traditional machine learning model. Think about the PCA lecture for example, or how to use clustering to extract color information. In addition to considering the movie posters it would be worthwhile to have a look at the metadata that IMDb provides. 

You could use Spark and the [ML library](https://spark.apache.org/docs/latest/ml-features.html#word2vec) to build your model features from the data. This may be especially beneficial if you use additional data, e.g., in text form.

You also need to think about how you are going to evaluate your classifier. Which metrics or scores will you report to show how good the performance is?

The notebook to submit this week should at least include:

- Detailed description and implementation of two different models
- Description of your performance metrics
- Careful performance evaluations for both models
- Visualizations of the metrics for performance evaluation
- Discussion of the differences between the models, their strengths, weaknesses, etc. 
- Discussion of the performances you achieved, and how you might be able to improve them in the future

#### Preliminary Peer Assessment

It is important to provide positive feedback to people who truly worked hard for the good of the team and to also make suggestions to those you perceived not to be working as effectively on team tasks. We ask you to provide an honest assessment of the contributions of the members of your team, including yourself. The feedback you provide should reflect your judgment of each team member’s:

- Preparation – were they prepared during team meetings?
- Contribution – did they contribute productively to the team discussion and work?
- Respect for others’ ideas – did they encourage others to contribute their ideas?
- Flexibility – were they flexible when disagreements occurred?

Your teammate’s assessment of your contributions and the accuracy of your self-assessment will be considered as part of your overall project score.

Preliminary Peer Assessment: [https://goo.gl/forms/WOYC7pwRCSU0yV3l1](https://goo.gl/forms/WOYC7pwRCSU0yV3l1)

In [5]:
import cPickle
import time
from collections import defaultdict, Counter

import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score, accuracy_score, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler


def load_movie_df():
    start = time.time()
    with open(r"../data/tmdb_df_5k.pickle", "rb") as input_file:
        movie_df = cPickle.load(input_file)
    elapsed = time.time() - start
    print "load: %.1f secs" % elapsed
    return movie_df


movie_df = load_movie_df()

load: 2.1 secs


In [7]:
def get_reduced_movie_df(movie_df):
    movie_attribute_name_list = [
        'popularity',
        'revenue',
        'budget',
        'vote_count',
        'vote_average',
        'cast',
        'crew',

        'genres',
    ]
    return movie_df[movie_attribute_name_list]


def prepare_genre_columns(movie_df):
    num_movies = len(movie_df)
    genre_df_dict = defaultdict(lambda: np.zeros((num_movies,), dtype=np.uint8))

    for i, genre_list in enumerate(movie_df['genres']):
        for genre in genre_list:
            genre_name = genre['name']
            genre_df_dict['genre_' + genre_name][i] = 1

    new_movie_df = movie_df.drop("genres", axis=1)

    for key, column in genre_df_dict.iteritems():
        new_movie_df[key] = column

    return new_movie_df


def prepare_cast(movie_df, genre_name):
    cast_list = [cast_member['name'] for movie_cast_list in movie_df[genre_name] for cast_member in movie_cast_list]
    cast_counter = Counter(cast_list)

    appearances_limit = 3
    included_cast_list = [cast_name for cast_name, num_movies in cast_counter.iteritems()
                          if num_movies >= appearances_limit]
    included_cast_set = set(included_cast_list)

    num_movies = len(movie_df)
    print num_movies
    movie_attribute_dict = defaultdict(lambda: np.zeros((num_movies,), dtype=np.uint8))

    for i, movie_cast_list in enumerate(movie_df[genre_name]):
        for cast_member in movie_cast_list:
            cast_name = cast_member['name']
            if cast_name in included_cast_set:
                movie_attribute_dict[genre_name + '_' + cast_name][i] = 1

    new_movie_df = movie_df.drop(genre_name, axis=1)

    for key, column in movie_attribute_dict.iteritems():
        new_movie_df[key] = column

    print new_movie_df.shape
    return new_movie_df


def apply_pca(X_train, X_test):
    print "before scaling:"
    standard_scaler = StandardScaler()
    X_train_scaled = standard_scaler.fit_transform(X_train)
    X_test_scaled = standard_scaler.transform(X_test)

    print "before pca"
    pca = PCA()
    pca.fit(X_train_scaled)

    cutoff_index = np.argmin(np.cumsum(pca.explained_variance_ratio_) <= 0.9)

    pca_X_train = pca.transform(X_train_scaled)
    pca_X_test = pca.transform(X_test_scaled)

    pca_X_train = pca_X_train[:, :cutoff_index]
    pca_X_test = pca_X_test[:, :cutoff_index]

    print pca_X_train.shape
    print pca_X_test.shape

    return pca_X_train, pca_X_test


def prepare_movie_df(movie_df):
    reduced_movie_df = get_reduced_movie_df(movie_df)
    reduced_movie_df = prepare_genre_columns(reduced_movie_df)
    reduced_movie_df = prepare_cast(reduced_movie_df, 'cast')
    reduced_movie_df = prepare_cast(reduced_movie_df, 'crew')

    return reduced_movie_df


def default_score(classifier, X, y):
    y_pred = classifier.predict(X)
    return accuracy_score(y, y_pred)


def f1_score_f(classifier, X, y):
    y_pred = classifier.predict(X)
    return f1_score(y, y_pred)


def run_model_with_y_matrix(train_df, test_df, classifier, score_f):
    X_columns = [column for column in train_df.columns if not column.startswith('genre_')]
    y_columns = [column for column in train_df.columns if column.startswith('genre_')]

    X_train = train_df[X_columns]
    X_test = test_df[X_columns]

    X_train, X_test = apply_pca(X_train, X_test)

    y_train = train_df[y_columns]
    y_test = test_df[y_columns]

    classifier.fit(X_train, y_train)

    train_score = score_f(classifier, X_train, y_train)
    test_score = score_f(classifier, X_test, y_test)

    # print classifier.best_params_
    print "train: %.3f, test: %.3f" % (train_score, test_score)


def run_model(train_df, test_df, classifier, score_f):
    X_columns = [column for column in train_df.columns if not column.startswith('genre_')]
    y_columns = [column for column in train_df.columns if column.startswith('genre_')]

    X_train = train_df[X_columns]
    X_test = test_df[X_columns]

    X_train, X_test = apply_pca(X_train, X_test)

    y_train = train_df[y_columns]
    y_test = test_df[y_columns]

    print 'y_train shape:'
    print y_train.shape
    print y_test.shape

    train_score_list = []
    test_score_list = []

    print classifier

    y_train_pred_list = []
    y_test_pred_list = []

    for y_column in y_columns:
        y_train_col = train_df[y_column]
        y_test_col = test_df[y_column]

        train_score, test_score, y_train_pred, y_test_pred = run_model_one_y(y_column, X_train, X_test, y_train_col,
                                                                             y_test_col, classifier, score_f)

        train_score_list.append(train_score)
        test_score_list.append(test_score)

        y_train_pred_list.append(y_train_pred)
        y_test_pred_list.append(y_test_pred)

    y_train_pred_result = np.array(y_train_pred_list).T
    y_test_pred_result = np.array(y_test_pred_list).T

    print classification_report(y_train, y_train_pred_result)
    print 'test'
    print classification_report(y_test, y_test_pred_result)

    print "train: %.3f, test: %.3f" % (np.mean(train_score_list), np.mean(test_score_list))


def run_model_one_y(genre, X_train, X_test, y_train, y_test, classifier, score_f):
    classifier.fit(X_train, y_train)

    train_score = score_f(classifier, X_train, y_train)
    test_score = score_f(classifier, X_test, y_test)

    print classifier.best_params_
    print "train: %.3f, test: %.3f (%s)" % (train_score, test_score, genre)

    return train_score, test_score, classifier.predict(X_train), classifier.predict(X_test)


reduced_movie_df = prepare_movie_df(movie_df)
train_df, test_df = train_test_split(reduced_movie_df, random_state=109)

# run_model(train_df, test_df, classifier=DummyClassifier(strategy="most_frequent"), score_f=default_score)

# run_model_2(train_df, test_df, classifier=DummyClassifier(strategy="most_frequent"), score_f=f1_score_f)
# run_model_2(train_df, test_df, classifier=DummyClassifier(strategy="stratified"), score_f=f1_score_f)
# run_model_2(train_df, test_df, classifier=DummyClassifier(strategy="uniform"), score_f=f1_score_f)

# run_model_2(train_df, test_df, classifier=DummyClassifier(strategy="most_frequent"), score_f=default_score)
# run_model_2(train_df, test_df, classifier=DummyClassifier(strategy="stratified"), score_f=default_score)
# run_model_2(train_df, test_df, classifier=DummyClassifier(strategy="uniform"), score_f=default_score)

# run_model(train_df, test_df, classifier=SVC(class_weight='balanced', kernel='linear'), score_f=f1_score_f)

estimator = GridSearchCV(
    estimator=SGDClassifier(class_weight='balanced'),
    param_grid={
        'alpha': np.logspace(-6, 3, num=10),
    },
    scoring=f1_score_f,
    cv=3,
    verbose=1,
)

# estimator = SGDClassifier(class_weight='balanced', alpha=10 ** -2)

# run_model_2(train_df, test_df, classifier=estimator, score_f=default_score)
run_model(train_df, test_df, classifier=estimator, score_f=f1_score_f)

# run_model(train_df, test_df, classifier=SVC(class_weight='balanced', kernel='linear'), score_f=f1_score_f)

5000
(5000, 1492)
5000
(5000, 2553)
before scaling:
before pca
(3750L, 1058L)
(1250L, 1058L)
y_train shape:
(3750, 20)
(1250, 20)
GridSearchCV(cv=3, error_score='raise',
       estimator=SGDClassifier(alpha=0.0001, average=False, class_weight='balanced',
       epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'alpha': array([  1.00000e-06,   1.00000e-05,   1.00000e-04,   1.00000e-03,
         1.00000e-02,   1.00000e-01,   1.00000e+00,   1.00000e+01,
         1.00000e+02,   1.00000e+03])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=<function f1_score_f at 0x000000001A147668>, verbose=1)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.4s finished


{'alpha': 10.0}
train: 0.432, test: 0.120 (genre_TV Movie)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.3s finished


{'alpha': 1.0}
train: 0.494, test: 0.177 (genre_Mystery)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.4s finished


{'alpha': 1.0}
train: 0.504, test: 0.163 (genre_Fantasy)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.4s finished


{'alpha': 1.0}
train: 0.134, test: 0.128 (genre_Family)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.4s finished


{'alpha': 1.0000000000000001e-05}
train: 0.511, test: 0.168 (genre_Horror)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.3s finished


{'alpha': 1.0}
train: 0.145, test: 0.119 (genre_Crime)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.5s finished


{'alpha': 0.001}
train: 0.298, test: 0.142 (genre_Adventure)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.8s finished


{'alpha': 0.001}
train: 0.253, test: 0.193 (genre_Music)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.5s finished


{'alpha': 1.0}
train: 0.299, test: 0.068 (genre_History)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.4s finished


{'alpha': 1.0000000000000001e-05}
train: 0.369, test: 0.179 (genre_Thriller)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.6s finished


{'alpha': 0.001}
train: 0.283, test: 0.188 (genre_Romance)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.4s finished


{'alpha': 0.10000000000000001}
train: 0.480, test: 0.098 (genre_Science Fiction)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.6s finished


{'alpha': 0.001}
train: 0.633, test: 0.462 (genre_Drama)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.4s finished


{'alpha': 100.0}
train: 0.667, test: 0.411 (genre_Western)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.6s finished


{'alpha': 0.10000000000000001}
train: 0.416, test: 0.140 (genre_Foreign)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.4s finished


{'alpha': 9.9999999999999995e-07}
train: 0.494, test: 0.377 (genre_Comedy)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.4s finished


{'alpha': 0.10000000000000001}
train: 0.579, test: 0.346 (genre_Action)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.5s finished


{'alpha': 10.0}
train: 0.120, test: 0.134 (genre_Animation)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.8s finished


{'alpha': 0.01}
train: 0.419, test: 0.380 (genre_Documentary)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
{'alpha': 1.0}
train: 0.366, test: 0.126 (genre_War)
             precision    recall  f1-score   support

          0       0.41      0.46      0.43        94
          1       0.40      0.66      0.49       129
          2       0.44      0.60      0.50       151
          3       0.07      1.00      0.13       228
          4       0.53      0.49      0.51       333
          5       0.08      1.00      0.15       243
          6       0.19      0.68      0.30       208
          7       0.15      0.94      0.25       290
          8       0.20      0.61      0.30        92
          9       0.23      0.87      0.37       359
         10       0.17      0.97      0.28       399
         11       0.38      0.66      0.48       157
         12       0.74      0.56      0.63      1367
         13       0.59      0.77      0.67        94
         14       0.33      

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.5s finished
