# Classification: Predicting a movie genre

Statistical classification technicues automatically find rules that assign data instances to one of predefined categories.

Example cases for classification:
* Detecting spam messages
* Automatically labeling news articles with a topic
* Recognizing if an X-ray scan contains an anomaly
* Optical character recognition

Our goal here is to find way to guess movie's genre based on budget, viewer rating, and other information about the movie.

## Classification on a computer

Let's load the dataset from the Internet and preprocess it.

In [None]:
import graphviz
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder, Imputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
def load_data():
    return pd.read_csv('https://github.com/hadley/ggplot2movies/blob/master/data-raw/movies.csv?raw=true')

def preprocess(movies, genre='Drama'):
    # These are the ouput variables
    genres = ['Action', 'Animation', 'Comedy', 'Drama', 'Documentary', 'Romance', 'Short']

    # All other columns are predictor variables
    input_columns = [x for x in movies.columns.values if x != 'title' and x not in genres]

    mpaa_one_hot = pd.get_dummies(movies.mpaa, prefix='mpaa', dummy_na=True)

    non_mpaa_predictor_columns = ['year', 'length', 'budget', 'rating', 'votes']
    X = pd.concat([movies[non_mpaa_predictor_columns], mpaa_one_hot], axis='columns')

    imputer = Imputer(strategy='mean')
    X_imputed = imputer.fit_transform(X)
    X_imputed = pd.DataFrame(X_imputed, columns=X.columns)

    y = pd.Series(np.where(movies[genre], genre, 'Non-' + genre))

    X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2)
    
    return (X_train, X_test, y_train, y_test)

def draw_decision_tree(clf, feature_names):
    dot_data = tree.export_graphviz(clf, out_file=None, feature_names=feature_names,
                                    class_names=clf.classes_, filled=True, rounded=True,
                                    proportion=False)
    return graphviz.Source(dot_data)

def evaluate_classification_results(clf, X_test, y_test):
    y_pred = clf.predict(X_test)

    print('Accuracy: {:.2f}'.format(
        accuracy_score(y_test, y_pred)))
    print()

    C = confusion_matrix(y_test, y_pred)
    cm_row_labels = ['True ' + x for x in clf.classes_]
    cm_column_labels = ['Predicted ' + x for x in clf.classes_]
    print(pd.DataFrame(C, index=cm_row_labels, columns=cm_column_labels))

In [None]:
raw_data = load_data()

In [None]:
X_train, X_test, y_train, y_test = preprocess(raw_data)

Here is a few sample rows from the dataset. `x_train` are the input features and `y_train` are the corresponding class labels.

In [None]:
X_train.head()

In [None]:
y_train.head()

Next, we fit a single layer decision tree classifier and print out the learned model to see how it looks like.

In [None]:
clf = DecisionTreeClassifier(max_depth=1)
clf.fit(X_train, y_train)

draw_decision_tree(clf, X_train.columns.values)

We can also compute evaluation statistics on test data:

In [None]:
evaluate_classification_results(clf, X_test, y_test)

Because that wasn't a very good classifier, we increase the number of layer in the decision tree and try again.

In [None]:
clf = DecisionTreeClassifier(max_depth=3)
clf.fit(X_train, y_train)

draw_decision_tree(clf, X_train.columns.values)

In [None]:
evaluate_classification_results(clf, X_test, y_test)