### Approach

To start the training process, I divide the reviews randomly in 4:1 ratio for a train-validation split to evaluate the training process.

Then the train movie reviews have pretty high standard deviation in length so we select those reviews that fall within two standard deviations of the mean in terms of length so as to remove some outliers. This doesn't discard many in any case. 

The next step is to convert the labels from categorical strings i.e. horror, comedy etc. to integer values. This is done with a simple dictionary but it's important that we retain this map even at inference time so I persistently store the same. 

Looking at the plots, a vast number of words within them are names of people/places and these add a lot of different words to the vocabulary but don't necessarily convey much additional information. In lieu of this, I use nltk to identify proper nouns within them i.e. POS tag == NNP or NNPS, and then replace all of these with a single NNP token. (I validated with and without this modification and noticed an improvement). This could potentially lose some information but the goal is to build a robust classifier so perhaps it is not so best to learn from such proper nouns because they can lead to some spurious correlations in classification. 

The next step is to tokenize and filter out stopwords as we had done in the lab. Again NLTK helps here. I also filter out the NNP token here because it is clearly the most common token. 

I experimented with porter stemming/lemmatization but it didn't obtain an improvement and since we only have ~10^5 examples we can train without this step. 

Once we have this, we can move on to vectorization. As we had done in the lab I try out three approaches: binary, bag of words and tfidf. For this I used a sparse matrix setup similar to the lab. 

For classification from these features, I try out multiple models from logistic regression, SVM, Random forest and the Gaussian Naive Bayes and the Gaussian NB was the one that obtained the highest validation accuracy/F1 score so I chose to proceed with this. For each I experimented with a few variations in parameters using the validation set and ultimately went with GNB.

Interestingly, the bag of words representation obtained the best performance while tfidf was unable to classify the two minority classes well and the binary approach could not match the bow in performance. 

We persistently store both the models as well as vectorizers needed to perform classification so that it can be used in inference.

### Inference

At inference time, we accept a list of strings and labels. First we load the dictionary map for labels from the persistent storage and encode the labels appropriately. Then first we tokenize and remove stopwords/NNP tokens as we had in the train phase. Then we load the persistently stored vectorizers from the train phase to convert the tokens to the vectorized representations. The models are loaded as well and classification is performed to obtain the predictions as a simple array.

In [271]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import nltk
import math
import pickle

In [272]:
df = pd.read_csv("movie-plots-student.csv")

In [273]:
mask = np.random.rand(len(df)) < 0.8
train = df[mask]
val = df[~mask]

In [274]:
train.shape, val.shape

((8546, 3), (2170, 3))

In [275]:
train.head()

Unnamed: 0.1,Unnamed: 0,Genre,Plot
0,0,drama,A Bill of Divorcement describes a day in the l...
1,1,comedy,Dr. Clitterhouse (Edward G. Robinson) is a wea...
2,2,comedy,"Three young couples, all having financial stru..."
3,3,comedy,Hollywood studio mogul Joe Mulholland (Matthau...
4,4,drama,In a working class South London district lives...


In [276]:
# Retrieve the shortest and the longest lyrics:
min_plot = train[train["Plot"].apply(lambda x: len(x))==train["Plot"].apply(lambda x: len(x)).min()]
max_plot = train[train["Plot"].apply(lambda x: len(x))==train["Plot"].apply(lambda x: len(x)).max()]
print("Min ", len(" ".join(min_plot["Plot"].iloc[0])))
print("Max ", len(" ".join(max_plot["Plot"].iloc[0])))
lengths = [len(row["Plot"]) for i, row in train.iterrows()]
mean_len = np.mean(lengths)
std_len = np.std(lengths)
print("Mean ", mean_len, " Std ", std_len)

Min  39
Max  33033
Mean  2104.7892581324595  Std  1738.3117201057998


In [277]:
train=train[train["Plot"].apply(lambda x: len(x)<=mean_len+2*std_len)]
train=train[train["Plot"].apply(lambda x: len(x)>=mean_len-2*std_len)]
print(train.shape)

(8251, 3)


In [278]:
train_texts = train["Plot"]
train_labels = train["Genre"]
ref = pickle.load(open("ref.pkl", "rb"))
train_labels = [ref[label] for label in list(train["Genre"])]
print(ref)

{'action': 0, 'drama': 1, 'horror': 2, 'comedy': 3}


In [279]:
val_texts = list(val["Plot"])
val_labels = list(val["Genre"])
val_labels = [ref[label] for label in list(val["Genre"])]

In [280]:
def remove_proper_nouns(text):
    tagged_sentence = nltk.tag.pos_tag(text.split())
    edited_sentence = [word if tag != 'NNP' and tag != 'NNPS' else "NNP" for word,tag in tagged_sentence]
    return ' '.join(edited_sentence)

In [281]:
train_texts = [remove_proper_nouns(i) for i in train_texts]
val_texts = [remove_proper_nouns(i) for i in val_texts]

In [282]:
train_tokens=[[token for token in nltk.tokenize.word_tokenize(text) if token.isalpha()] for text in train_texts]

In [283]:
val_tokens=[[token for token in nltk.tokenize.word_tokenize(text) if token.isalpha()] for text in val_texts]

In [285]:
from nltk.corpus import stopwords
train_tokens=[[token for token in doc if token not in stopwords.words("english") and token!='NNP'] for doc in train_tokens]
val_tokens=[[token for token in doc if token not in stopwords.words("english") and token!='NNP'] for doc in val_tokens]

In [286]:
from nltk import WordNetLemmatizer
from nltk.stem import PorterStemmer,LancasterStemmer
porter=PorterStemmer()
lancaster=LancasterStemmer()

In [288]:
# Vectorize text in documents in three different ways:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizers={'binary':TfidfVectorizer(analyzer='word',binary=True),'bow':CountVectorizer(analyzer='word',binary=False),'tfidf':TfidfVectorizer(analyzer='word',binary=False)}
vec_train_X,vec_val_X={},{}
for name,vectorizer in vectorizers.items():
    vec_train_X[name]=vectorizer.fit_transform([" ".join(doc) for doc in train_tokens])
    vec_val_X[name]=vectorizer.transform([" ".join(doc) for doc in val_tokens])

In [289]:
# Note the type of vectorization:
print(type(vec_train_X['binary']))
print(type(vec_train_X['bow']))
print(type(vec_train_X['tfidf']))

<class 'scipy.sparse.csr.csr_matrix'>
<class 'scipy.sparse.csr.csr_matrix'>
<class 'scipy.sparse.csr.csr_matrix'>


In [290]:
# Create and fit three NB models:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
models={'binary':BernoulliNB(),'bow':MultinomialNB(),'tfidf':MultinomialNB()}
predictions={}
for name,model in models.items():
    print(name, model)
    model.fit(vec_train_X[name],train_labels)
    predictions[name]=model.predict(vec_val_X[name])

binary BernoulliNB()
bow MultinomialNB()
tfidf MultinomialNB()


In [291]:
from sklearn.metrics import classification_report
print(classification_report(val_labels, predictions['binary'], digits=4))
print(classification_report(val_labels, predictions['bow'], digits=4))
print(classification_report(val_labels, predictions['tfidf'], digits=4))

              precision    recall  f1-score   support

           0     0.6132    0.3779    0.4676       172
           1     0.6224    0.7894    0.6961      1040
           2     0.5321    0.7015    0.6052       201
           3     0.7188    0.4557    0.5578       757

    accuracy                         0.6323      2170
   macro avg     0.6216    0.5811    0.5817      2170
weighted avg     0.6469    0.6323    0.6213      2170

              precision    recall  f1-score   support

           0     0.6510    0.5640    0.6044       172
           1     0.7263    0.7808    0.7525      1040
           2     0.7639    0.8209    0.7914       201
           3     0.7234    0.6565    0.6884       757

    accuracy                         0.7240      2170
   macro avg     0.7162    0.7055    0.7092      2170
weighted avg     0.7228    0.7240    0.7220      2170

              precision    recall  f1-score   support

           0     0.0000    0.0000    0.0000       172
           1     0.51

  _warn_prf(average, modifier, msg_start, len(result))


In [292]:
import pickle
pickle.dump(vectorizers, open("train-vectorizers.pkl", "wb"))
pickle.dump(models, open("train-models.pkl", "wb"))

In [294]:
def predict(texts):
    texts = [remove_proper_nouns(i) for i in texts]
    vectorizers = pickle.load(open("full-vectorizers.pkl", "rb"))
    models = pickle.load(open("full-models.pkl", "rb"))
    tokens=[[token for token in nltk.tokenize.word_tokenize(text) if token.isalpha()] for text in texts]
    tokens=[[token for token in doc if token not in stopwords.words("english")] for doc in tokens]
    #tokens=[[porter.stem(token) for token in doc] for doc in tokens]
    vec_X={}
    for name, vectorizer in vectorizers.items():
        vec_X[name]=vectorizer.transform([" ".join(doc) for doc in tokens])
    predictions={}
    for name,model in models.items():
        print(name, model)
        predictions[name]=model.predict(vec_X[name])
    return predictions

In [296]:
preds = predict(val_texts)
print(classification_report(val_labels, preds['bow']))

binary BernoulliNB()
bow MultinomialNB()
tfidf MultinomialNB()
              precision    recall  f1-score   support

           0       0.71      0.59      0.64       172
           1       0.79      0.75      0.77      1040
           2       0.84      0.79      0.81       201
           3       0.70      0.79      0.74       757

    accuracy                           0.75      2170
   macro avg       0.76      0.73      0.74      2170
weighted avg       0.75      0.75      0.75      2170



In [297]:
from sklearn.linear_model import LogisticRegression as LR

In [298]:
log_model = LR(multi_class='multinomial', penalty='l2', max_iter=2000)
log_model.fit(vec_train_X['bow'], train_labels)
preds = log_model.predict(vec_val_X['bow'])
print(classification_report(val_labels, preds, digits=4))

              precision    recall  f1-score   support

           0     0.6377    0.5116    0.5677       172
           1     0.6937    0.7644    0.7274      1040
           2     0.8242    0.7463    0.7833       201
           3     0.6676    0.6209    0.6434       757

    accuracy                         0.6926      2170
   macro avg     0.7058    0.6608    0.6804      2170
weighted avg     0.6923    0.6926    0.6906      2170



In [299]:
from sklearn.svm import SVC
svm = SVC(C=5)
svm.fit(vec_train_X['bow'], train_labels)
preds = svm.predict(vec_val_X['bow'])
print(classification_report(val_labels, preds, digits=4))

              precision    recall  f1-score   support

           0     0.6667    0.3837    0.4871       172
           1     0.6761    0.8067    0.7356      1040
           2     0.8873    0.6269    0.7347       201
           3     0.6846    0.6222    0.6519       757

    accuracy                         0.6922      2170
   macro avg     0.7287    0.6099    0.6523      2170
weighted avg     0.6979    0.6922    0.6866      2170



In [300]:
from sklearn.ensemble import RandomForestClassifier as RF
rf = RF()
rf.fit(vec_train_X['bow'], train_labels)
preds = rf.predict(vec_val_X['bow'])
print(classification_report(val_labels, preds, digits=4))

              precision    recall  f1-score   support

           0     1.0000    0.1163    0.2083       172
           1     0.5961    0.9096    0.7202      1040
           2     0.8738    0.4478    0.5921       201
           3     0.7370    0.4478    0.5571       757

    accuracy                         0.6429      2170
   macro avg     0.8017    0.4804    0.5194      2170
weighted avg     0.7030    0.6429    0.6109      2170

