# IMDB Review Sentiment Classification

This is a demonstration of how to use Natural Language Processing (NLP) techniques to help us classify IMDB movie reviews as either positive or negative. This follows a similar structure to a guide by "The PyCoach" on Towards Data Science. 

We are going to use multiple supervised classification techniques in order to determine which best discerns positive/negative movie reviews using Term Frequency - Inverse Document Frequency (TF-IDF).

In [2]:
# Firstly, import the necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import time

In [3]:
# We will import the csv file containing the reviews 
df = pd.read_csv('IMDB Dataset.csv')
# Print info to see what we are working with
print(df.info())
df.head()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB
None


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [6]:
# Check if the sentiments are equally distributed
print('Imported CSV review counts:')
print(df.groupby(['sentiment']).count())

# Luckily the sentiments are equally distributed. 
# This is important as an unbalanced training set can produce very different results from the test set. 
# It would take some extra time to process all 25000 reviews for each sentiment. 
# We are just going to make a 10% subset from which we will draw our test/train sets. 

print('\nSubset review counts:')
sample_df = df.groupby('sentiment').apply(lambda x: x.sample(frac=0.1))
sample_df.reset_index(inplace=True, drop = True)
print(sample_df.groupby(['sentiment']).count())

Imported CSV review counts:
           review
sentiment        
negative    25000
positive    25000

Subset review counts:
           review
sentiment        
negative     2500
positive     2500


In [7]:
# Split our train/test sets into an 80/20 ratio respectively.
train, test = train_test_split(sample_df, test_size = 0.2, random_state = 5)
train_x, train_y = train['review'], train['sentiment']
test_x, test_y = test['review'], test['sentiment']

In [16]:
# While we could use some more NLP techniques like tokenizsation to improve our predictions, 
# lets simply start off with TF-IDF. This will create a sparse matrix of the features (words) within the reviews
# with the approriate weights applied for their freuqency. We'll throw out the generic stop words as well. 
TfIdf = TfidfVectorizer(stop_words = 'english')
train_TfIdf = TfIdf.fit_transform(train_x)
test_TfIdf = TfIdf.transform(test_x)
print(f'Training set TF-IDF sparse matrix shape: {train_TfIdf.shape}')
print(f'Test set TF-IDF sparse matrix shape: {test_TfIdf.shape}')

Training set TF-IDF sparse matrix shape: (4000, 35311)
Test set TF-IDF sparse matrix shape: (1000, 35311)


# Model Selection

Now that we have our reviews broken down into TF-IDF weighted sparse matricies, its time to make a model with the training set. 
In order to cross-validate we are going to use multiple classification models to see which fits best. 
While there are many supervised/un-supervised options, some of the most common which we will test out are:

- Support Vector Machines (SVM)
- Decision Tree Classifiers
- Naive Bayesian Classifiers
- Logistic Regressions

In [9]:
svc = SVC()
start = time.time()
svc.fit(train_TfIdf, train_y)
end = time.time()

print(f'SVM fit time elapsed: {end - start}s')

SVM fit time elapsed: 9.277125358581543s


In [11]:
regression = LogisticRegression()
start = time.time()
regression.fit(train_TfIdf, train_y)
end = time.time()

print(f'Logistic Regression fit time elapsed: {end - start}s')

Logistic Regression fit time elapsed: 0.2298266887664795s


In [12]:
Tree = DecisionTreeClassifier()
start = time.time()
Tree.fit(train_TfIdf, train_y)
end = time.time()

print(f'Decision Tree fit time elapsed: {end - start}s')

Decision Tree fit time elapsed: 1.339094638824463s


In [13]:
NB = GaussianNB()
start = time.time()
NB.fit(train_TfIdf.toarray(), train_y)
end = time.time()

print(f'Naive Bayes fit time elapsed: {end - start}s')

Naive Bayes fit time elapsed: 2.965023994445801s


In [14]:
# Lets check on the mean accuracy value of each model when the test set is fed in
print(f'Support Vector Machine Score: {svc.score(test_TfIdf, test_y)}')
print(f'Decision Tree Score: {Tree.score(test_TfIdf, test_y)}')
print(f'Naive Bayes Score: {NB.score(test_TfIdf.toarray(), test_y)}')
print(f'Logistic-regression: {regression.score(test_TfIdf, test_y)}')

Support Vector Machine Score: 0.86
Decision Tree Score: 0.674
Naive Bayes Score: 0.611
Logistic-regression: 0.862


In [15]:
# We can also get a little more info using the classification_report object from sklearn. 
print("SVM Classificaion Report")
print(classification_report(test_y, svc.predict(test_TfIdf), labels=['positive', 'negative']))

print("\nDecision Tree Classificaion Report")
print(classification_report(test_y, Tree.predict(test_TfIdf), labels=['positive', 'negative']))

print("\nNaive Bayes Classificaion Report")
print(classification_report(test_y, NB.predict(test_TfIdf.toarray()), labels=['positive', 'negative']))

print("\nLogistic-regression Classificaion Report")
print(classification_report(test_y, regression.predict(test_TfIdf), labels=['positive', 'negative']))

SVM Classificaion Report
              precision    recall  f1-score   support

    positive       0.83      0.91      0.87       495
    negative       0.90      0.81      0.85       505

    accuracy                           0.86      1000
   macro avg       0.86      0.86      0.86      1000
weighted avg       0.86      0.86      0.86      1000


Decision Tree Classificaion Report
              precision    recall  f1-score   support

    positive       0.67      0.66      0.67       495
    negative       0.67      0.69      0.68       505

    accuracy                           0.67      1000
   macro avg       0.67      0.67      0.67      1000
weighted avg       0.67      0.67      0.67      1000


Naive Bayes Classificaion Report
              precision    recall  f1-score   support

    positive       0.60      0.62      0.61       495
    negative       0.62      0.60      0.61       505

    accuracy                           0.61      1000
   macro avg       0.61      0.61

Based on our results the SVM and logistic regression models are neck and neck for top performance. The decision tree and naive Bayes classifiers suffer alot more in comparison. This isn't too suprising since the latter two are data hungry for getting good predictive scores. You can also see though that these two models may be more desirable for large data set scenarios due to their fitting speed. The decision tree model was almost 7x faster at executing a fit compared to the SVM.

Based on the score and speed though the logistic regression is the winner for our case. The speed of this model owes to the fact its only fitting a line using maximum likelihood as a cost function.

Since we are only dealing with two classification options (pos/neg) the accuracy values are pretty representative of how well the models perform. However, we can get more information out of the classification report with respect to true positive rates (recall) and positive predictive values (precision). The f1-score also provides a harmonic mean between these two values to compare the performance of each model using one metric. 

From here we could grid search say the SVM model to fine tune the hyper parameters, or introduce some more NLP pre-processing steps if we weren't happy with the performance. This is kind of a general guide to how I would approach these types of problems though. Hope you enjoyed!