## Text Classification of Rotten Tomato Movie Reviews
Author: James Fung

In [62]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
import xgboost as xgb

from sklearn import metrics
from sklearn.metrics import classification_report

In [2]:
#Load in the data.

reviews = pd.read_csv("/Users/james.fung/Desktop/Test Scripts/Rotten Tomato Reviews/rotten_tomatoes_reviews.csv")

In [3]:
#Let's check out an example of a bad and good review.

#Good review.
print("Good review:" + reviews.iloc[0][1])
print("")
#Bad review.
print("Bad review:" + reviews.iloc[2][1])

Good review: Manakamana doesn't answer any questions, yet makes its point: Nepal, like the rest of our planet, is a picturesque but far from peaceable kingdom.

Bad review: It would be difficult to imagine material more wrong for Spade than Lost & Found.


It seems here that there isn't a clear distinction between a positive and negative review - however I do notice that the bad review contained more "bad" words, such as difficult and wrong.

Let's clean up the reviews a bit by removing stop words and punctuation with the TfidVectorizer.

In [4]:
#Split the data.

Y = reviews.iloc[:,0]
X = reviews.iloc[:,1]

X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=.3)

## TF-IDF and Count Vectorizers

In [5]:
#Tfid Vectorizer. This function utilizes the following parameters:
#Min_df = 3, 3 is the minimum frequency of the word to appear inthe matrix
#Strip_accents = Removes accents and other character normalization
#Analyzer = Whole words
#ngram_range = Uses uni, bi, and trigrams
#Use-idf - enables inverse-document frequency reweighting?
#Stop_words = Removes stopwords

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
tfv = TfidfVectorizer(min_df=3,  max_features=None, 
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
            stop_words = 'english')

In [6]:
#Fit the TFV and then transform the training and testing set.
tfv.fit(list(X_train))
X_train_tfv = tfv.transform(X_train)
X_test_tfv = tfv.transform(X_test)

### Logistic Regression

In [7]:
#Fit the model and make predictions.
clf = LogisticRegression(C=1.0)
clf.fit(X_train_tfv, y_train)
y_pred = clf.predict(X_test_tfv)



In [8]:
#What is the accuracy?
print('Accuracy:',metrics.accuracy_score(y_test,y_pred))

print(classification_report(y_test, y_pred))

Accuracy: 0.8289861111111111
              precision    recall  f1-score   support

           0       0.83      0.82      0.83     72140
           1       0.82      0.83      0.83     71860

   micro avg       0.83      0.83      0.83    144000
   macro avg       0.83      0.83      0.83    144000
weighted avg       0.83      0.83      0.83    144000



Nearly a 83% accuracy in reviews! Can we do better?

In [60]:
#What about a simpler text model, such as a simple count of words?

ctv = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), stop_words = 'english')

# Fitting Count Vectorizer to both training and test sets (semi-supervised learning)
ctv.fit(list(X_train))
X_train_ctv = ctv.transform(X_train)
X_test_ctv = ctv.transform(X_test)

In [61]:
#Fit the model on logistic, predict.
clf.fit(X_train_ctv, y_train)
y_pred = clf.predict(X_test_ctv)

print('Accuracy:',metrics.accuracy_score(y_test,y_pred))

print(classification_report(y_test, y_pred))

Accuracy: 0.8802430555555556
              precision    recall  f1-score   support

           0       0.87      0.89      0.88     72140
           1       0.89      0.87      0.88     71860

   micro avg       0.88      0.88      0.88    144000
   macro avg       0.88      0.88      0.88    144000
weighted avg       0.88      0.88      0.88    144000



Seems simplicity wins out here. 88% vs 83% accuracy. But...can we do better?

### XGBoost

In [None]:
#Parameters:
#max_depth = 7, limit depth of trees to 7
#n_estimators = 

clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)

clf.fit(X_train_tfv, y_train)
y_pred = clf.predict(X_test_tfv)

In [None]:
#What is the accuracy?
print('Accuracy:',metrics.accuracy_score(y_test,y_pred))

print(classification_report(y_test, y_pred))