In [30]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import nltk
import math
import pickle
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
porter=PorterStemmer()

In [31]:
def remove_proper_nouns(text):
    tagged_sentence = nltk.tag.pos_tag(text.split())
    edited_sentence = [word if tag != 'NNP' and tag != 'NNPS' else "NNP" for word,tag in tagged_sentence]
    return ' '.join(edited_sentence)

In [36]:
def test_model(texts):
    texts = [remove_proper_nouns(i) for i in texts]
    vectorizers = pickle.load(open("full-vectorizers.pkl", "rb"))
    models = pickle.load(open("full-models.pkl", "rb"))
    tokens=[[token for token in nltk.tokenize.word_tokenize(text) if token.isalpha()] for text in texts]
    tokens=[[token for token in doc if token not in stopwords.words("english")  and token!='NNP'] for doc in tokens]
    #tokens=[[porter.stem(token) for token in doc] for doc in tokens]
    vec_X={}
    for name, vectorizer in vectorizers.items():
        vec_X[name]=vectorizer.transform([" ".join(doc) for doc in tokens])
    predictions={}
    for name,model in models.items():
        print(name, model)
        predictions[name]=model.predict(vec_X[name])
    return predictions['bow']

In [37]:
data=pd.read_csv("movie-plots-test.csv",index_col=0)
test_y=data["Genre"]

In [None]:
from sklearn.metrics import confusion_matrix as cm
from sklearn.metrics import classification_report as cr
preds=test_model(data["Plot"])
cm(test_y,preds)

In [None]:
print(cr(test_y,preds))

### Approach

To start the training process, I divide the reviews randomly in 4:1 ratio for a train-validation split to evaluate the training process.

Then the train movie reviews have pretty high standard deviation in length so we select those reviews that fall within two standard deviations of the mean in terms of length so as to remove some outliers. This doesn't discard many in any case. 

The next step is to convert the labels from categorical strings i.e. horror, comedy etc. to integer values. This is done with a simple dictionary but it's important that we retain this map even at inference time so I persistently store the same. 

Looking at the plots, a vast number of words within them are names of people/places and these add a lot of different words to the vocabulary but don't necessarily convey much additional information. In lieu of this, I use nltk to identify proper nouns within them i.e. POS tag == NNP or NNPS, and then replace all of these with a single NNP token. (I validated with and without this modification and noticed an improvement). This could potentially lose some information but the goal is to build a robust classifier so perhaps it is not so best to learn from such proper nouns because they can lead to some spurious correlations in classification. 

The next step is to tokenize and filter out stopwords as we had done in the lab. Again NLTK helps here. I also filter out the NNP token here because it is clearly the most common token. 

I experimented with porter stemming/lemmatization but it didn't obtain an improvement and since we only have ~10^5 examples we can train without this step. 

Once we have this, we can move on to vectorization. As we had done in the lab I try out three approaches: binary, bag of words and tfidf. For this I used a sparse matrix setup similar to the lab. 

For classification from these features, I try out multiple models from logistic regression, SVM, Random forest and the Gaussian Naive Bayes and the Gaussian NB was the one that obtained the highest validation accuracy/F1 score so I chose to proceed with this. For each I experimented with a few variations in parameters using the validation set and ultimately went with GNB.

Interestingly, the bag of words representation obtained the best performance while tfidf was unable to classify the two minority classes well and the binary approach could not match the bow in performance. 

We persistently store both the models as well as vectorizers needed to perform classification so that it can be used in inference.

### Inference

At inference time, we accept a list of strings and labels. First we load the dictionary map for labels from the persistent storage and encode the labels appropriately. Then first we tokenize and remove stopwords/NNP tokens as we had in the train phase. Then we load the persistently stored vectorizers from the train phase to convert the tokens to the vectorized representations. The models are loaded as well and classification is performed to obtain the predictions as a simple array.