In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.model_selection import train_test_split

np.random.seed(42)

## Train-Validation Split

To get a sense of how our models will generalize, we separate the data into the training set (`train_data`) and a validation set (`val_data`). We _only_ use the training set to compute the TF-IDF vectorization, and then use the statistics computed to get a transform on the validation data.

We use approximately 20% of the dataset towards validation. Further, we use stratified sampling which
ensures that our dataset split has roughly the same proportion of Genres (the output variable) as in the original dataset so as to not skew the original dataset further.

In [2]:
all_data = pd.read_csv('movie-plots-student.csv')[['Genre', 'Plot']]
train_data, val_data = train_test_split(all_data, test_size=.2, stratify=all_data.Genre)

## Feature generation

We generate the features using TF-IDF vectorizer as the simplest approach and try to extract the maximum predictive power out of this featurization.

In [3]:
vectorizer = TfidfVectorizer(strip_accents='unicode', lowercase=True, stop_words='english')
train_X = vectorizer.fit_transform(train_data.Plot.values.tolist())
train_y = train_data.Genre.values.tolist()

val_X = vectorizer.transform(val_data.Plot.values.tolist())
val_y = val_data.Genre.values.tolist()

## Classification Models

In each of the following models, we report the "Training Accuracy" on `train_data` as defined earlier, "Validation Accuracy" on `val_data` as defined earlier, the macro F-1 score and the confusion matrix to get a sense of the classes where the classifier fails, with each row representing the true label and each column representing the predicted column. This means in a confusion matrix for a K-way classification task, $C \in \mathbb{Z}^{K\times K}$, $C[i,j]$ represents the total number of inputs of class $i$ predicted as class $j$.

### Random Forests

In [None]:
model = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=100)
model.fit(train_X, train_y)
print(f'Training Accuracy: {model.score(train_X, train_y)}')
print(f'Validation Accuracy: {model.score(val_X, val_y)}')
print(f'Macro F-1 Score: {f1_score(val_y, model.predict(val_X), average="macro")}')
print(f'Confusion Matrix:\n{confusion_matrix(val_y, model.predict(val_X))}')

We first try a Random Forest classifier with a maximum tree depth of 100 using the `gini` criterion. While we are able to get significant train accuracy, we see that the classifier falters on misclassifying a lot of labels with class index 0 as class index 2.

### Gradient Boosting Classifier

In [None]:
model = GradientBoostingClassifier(n_estimators=100, max_depth=10, learning_rate=1.0)
model.fit(train_X, train_y)
print(f'Training Accuracy: {model.score(train_X, train_y)}')
print(f'Validation Accuracy: {model.score(val_X, val_y)}')
print(f'Macro F-1 Score: {f1_score(val_y, model.predict(val_X), average="macro")}')
print(f'Confusion Matrix:\n{confusion_matrix(val_y, model.predict(val_X))}')

The gradient boosting classifier is able to improve upon the random forest classifier, especially in terms of the macro F-1 score, since it now makes more varied errors, instead of just in a single class. This is a consequence of the fact that the gradient boosting classifier focuses of creating subsequent estimators that improve misclassifications explicitly.

### Support Vector Classifier

In [None]:
model = SVC(C=1.0, kernel='rbf')
model.fit(train_X, train_y)
print(f'Training Accuracy: {model.score(train_X, train_y)}')
print(f'Validation Accuracy: {model.score(val_X, val_y)}')
print(f'Macro F-1 Score: {f1_score(val_y, model.predict(val_X), average="macro")}')
print(f'Confusion Matrix:\n{confusion_matrix(val_y, model.predict(val_X))}')

The support vector classifier trains one-vs-all classifiers, and is costly. Nevertheless, it is able to predict the train data well, while retaining much of the performance from gradient boosting classifiers. While, it may appear the we may have overfit, different settings of the margin parameter $C$ did not seem to improve the score.

### Nearest Neighbors Classifier

In [None]:
model = KNeighborsClassifier(n_neighbors=30)
model.fit(train_X, train_y)
print(f'Training Accuracy: {model.score(train_X, train_y)}')
print(f'Validation Accuracy: {model.score(val_X, val_y)}')
print(f'Macro F-1 Score: {f1_score(val_y, model.predict(val_X), average="macro")}')
print(f'Confusion Matrix:\n{confusion_matrix(val_y, model.predict(val_X))}')

The nearest-neighbor classifier is among the worst performing classifiers. This is potentially due to the fact that the inputs are very high-dimensional. Since, in a high-dimensional space, the Euclidean distance suffers from the curse of dimensionality, and most vectors are very close to each other, a nearest neighbor classifier is not able to distinguish signficantly between the various vectors. This is the reason for a signficantly lower training and validation accuracy.

### Logistic Regression

In [None]:
model = LogisticRegression(C=2.)
model.fit(train_X, train_y)
print(f'Training Accuracy: {model.score(train_X, train_y)}')
print(f'Validation Accuracy: {model.score(val_X, val_y)}')
print(f'Macro F-1 Score: {f1_score(val_y, model.predict(val_X), average="macro")}')
print(f'Confusion Matrix:\n{confusion_matrix(val_y, model.predict(val_X))}')

Logistic Regression is our best-performing model.

In [None]:
def test_model(test_data):
    '''Test the model.
    We assume `test_data` is a list of strings.
    '''
    return model.predict(vectorizer.transform(test_data))

In [None]:
data=pd.read_csv("movie-plots-test.csv",index_col=0)
test_y=data["Genre"]
preds=test_model(data["Plot"])

In [None]:
from sklearn.metrics import confusion_matrix as cm
from sklearn.metrics import classification_report as cr
cm(test_y,preds)

In [None]:
print(cr(test_y,preds))