# Movie review sentiment analysis

Given a text of a movie review, predict the sentiment of it. For the following task we are going to use NLTK package and 'movie_reviews' data in particular.

Movie reviews dataset contains positive and negative movie reviews, and might be downloaded using the following code:

In [1]:
""""

import nltk
nltk.download('movie_reviews')
nltk.download('stopwords')

"""

'"\n\nimport nltk\nnltk.download(\'movie_reviews\')\n\n'

In [58]:
# import libraries for data exprolation

import numpy as np
import pandas as pd

## Data import and exploration

In [164]:
import nltk
from nltk.corpus import movie_reviews

In [3]:
movie_reviews.fileids()[:5]

['neg/cv000_29416.txt',
 'neg/cv001_19502.txt',
 'neg/cv002_17424.txt',
 'neg/cv003_12683.txt',
 'neg/cv004_12641.txt']

In [4]:
# selecting negative and positive reviews

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
print("Positive reviews: {} \nNegative reviews: {}".format(len(posids), len(negids)))

Positive reviews: 1000 
Negative reviews: 1000


In [14]:
# Creating a list of words of positive/negative reviews
negfeats = [list(movie_reviews.words(fileids=[f])) for f in negids]
posfeats = [list(movie_reviews.words(fileids=[f])) for f in posids]

In [33]:
totalreviews = negfeats + posfeats
labels = len(negfeats) * [0] + len(posfeats) * [1]

In [34]:
print("Total reviews in dataset: {}".format(len(totalreviews)))

Total reviews in dataset: 2000


As we can see, our dataset is perfectly balanced. Half of our reviews are positive, and others are negative.

## Model building

Let's first try to build a simple model without any preprocessing of review words.

In [35]:
from sklearn.feature_extraction.text import CountVectorizer

In [36]:
# transform our review representation to strings instead of list of words in order to use in vectorizer
totalreviews = [" ".join(review) for review in totalreviews]

In [49]:
vectorizer = CountVectorizer()
vectorizer.fit(totalreviews)
print("Feature count: {}".format(len(vectorizer.get_feature_names())))

Feature count: 39659


Our vectorizer creates a matrix *documents X tokens*, where each cell represents token frequency

Let's build a pipeline with CountVectorizer and LogisticRegression

In [55]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

In [141]:
clf_pipeline = Pipeline(steps=[
    ('vectorizer', CountVectorizer()),
    ('estimator', LogisticRegression())
])

In [61]:
accuracy_scores = cross_val_score(clf_pipeline, totalreviews, labels, scoring='accuracy')
roc_auc_scores = cross_val_score(clf_pipeline, totalreviews, labels, scoring='roc_auc')

In [62]:
print("Cross-validation accuracy score: {}".format(np.mean(accuracy_scores)))
print("Cross-validation roc_auc score: {}".format(np.mean(roc_auc_scores)))

Cross-validation accuracy score: 0.8360216503929078
Cross-validation roc_auc score: 0.9107764937833774


## Model interpretation

Let's have a look at top-5 most important parameters (words) according to our model.

In [142]:
clf_pipeline.fit(totalreviews, labels)
print("Top 5 important words according to coefficients:\n")

# look for both negative and positive coefficients

indicies = np.argsort(np.abs(clf_pipeline.named_steps['estimator'].coef_[0]))
feature_names = clf_pipeline.named_steps['vectorizer'].get_feature_names()

for index in reversed(indicies[-5:]):
    print(feature_names[index])

Top 5 important words according to coefficients:

bad
unfortunately
worst
fun
waste


As we can see so far, coefficients of our model make sense.

## Parameters setting

Insted of Count Vectorizer we can use TF-IDF method

In [145]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [146]:
tfidf_pipeline = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer()),
    ('estimator', LogisticRegression())
])

In [147]:
# simple function for estimating pipeline on our data

def estimate_pipeline(pipeline, X=totalreviews, y=labels):
    scores = cross_val_score(pipeline, X, y, cv=5)
    print("CV mean score: {} Standard deviation: {}".format(np.mean(scores), np.std(scores)))

In [148]:
print("Estimating pipeline with count vectorizer:")
estimate_pipeline(clf_pipeline)

print("Estimating pipeline with Tfidf vectorizer:")
estimate_pipeline(tfidf_pipeline)

Estimating pipeline with count vectorizer:
CV mean score: 0.8415000000000001 Standard deviation: 0.01677796173556255
Estimating pipeline with Tfidf vectorizer:
CV mean score: 0.8210000000000001 Standard deviation: 0.004062019202317978


Changing vectorizer to TFID does not have much effect on performance

Let's try to vary cut-off parameter

In [150]:
# min_df parameter is responsible for setting a threshhold for ignoring terms with lower frequency

df10_pipeline = Pipeline(steps=[
    ('vectorizer', CountVectorizer(min_df=10)),
    ('estimator', LogisticRegression())
])

df50_pipeline = Pipeline(steps=[
    ('vectorizer', CountVectorizer(min_df=50)),
    ('estimator', LogisticRegression())
])

In [151]:
print("Estimating pipeline with count vectorizer and min_df = 10:")
estimate_pipeline(df10_pipeline)

print("Estimating pipeline with count vectorizer and min_df = 50:")
estimate_pipeline(df50_pipeline)

Estimating pipeline with count vectorizer and min_df = 10:
CV mean score: 0.8390000000000001 Standard deviation: 0.011895377253370336
Estimating pipeline with count vectorizer and min_df = 50:
CV mean score: 0.813 Standard deviation: 0.013453624047073712


Estimating different classifiers:

In [161]:
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
import warnings
warnings.filterwarnings('ignore')

In [162]:
classifiers = {"Logistic Regression": LogisticRegression(),
               "SVM Classifier": LinearSVC(),
               "SGD Classifier": SGDClassifier()}

for name, clf in classifiers.items():
    pipeline = Pipeline(steps=[
        ('vectorizer', CountVectorizer()),
        ('estimator', clf)
    ])
    print("Estimating {}".format(name))
    estimate_pipeline(pipeline)

Estimating Logistic Regression
CV mean score: 0.8415000000000001 Standard deviation: 0.01677796173556255
Estimating SVM Classifier
CV mean score: 0.8325000000000001 Standard deviation: 0.0162788205960997
Estimating SGD Classifier
CV mean score: 0.7505 Standard deviation: 0.06777536425575299


For better model performance we should include stop-words. Lists of stop words are available both in sklearn and nltk packages.

We will estimate model performance on both lists.

In [170]:
stop_words_nltk = nltk.corpus.stopwords.words('english')
print(stop_words_nltk[:5])

['i', 'me', 'my', 'myself', 'we']


In [172]:
stopwords_nltk_pipeline = Pipeline(steps=[
        ('vectorizer', CountVectorizer(stop_words=stop_words_nltk)),
        ('estimator', LogisticRegression())
])

stopwords_sklearn_pipeline = Pipeline(steps=[
        ('vectorizer', CountVectorizer(stop_words='english')),
        ('estimator', LogisticRegression())
])

In [173]:
print("Estimating pipeline with NLTK stopwords:")
estimate_pipeline(stopwords_nltk_pipeline)

print("Estimating pipeline with SKLearn stopwords list:")
estimate_pipeline(stopwords_sklearn_pipeline)

Estimating pipeline with NLTK stopwords:
CV mean score: 0.8414999999999999 Standard deviation: 0.010440306508910566
Estimating pipeline with SKLearn stopwords list:
CV mean score: 0.8385 Standard deviation: 0.009823441352194272


Next we will estimate the performance of classifier with different N-grams ranges:

In [175]:
ngram_words_pipeline = Pipeline(steps=[
        ('vectorizer', CountVectorizer(ngram_range=(1, 2))),
        ('estimator', LogisticRegression())
])

ngram_chars_pipeline = Pipeline(steps=[
        ('vectorizer', CountVectorizer(ngram_range=(3, 5), analyzer='char_wb')),
        ('estimator', LogisticRegression())
])

In [176]:
print("Estimating pipeline with word bigrams:")
estimate_pipeline(ngram_words_pipeline)

print("Estimating pipeline with 3-5 character n-grams:")
estimate_pipeline(ngram_chars_pipeline)

Estimating pipeline with word bigrams:
CV mean score: 0.852 Standard deviation: 0.016537835408541222
Estimating pipeline with 3-5 character n-grams:
CV mean score: 0.82 Standard deviation: 0.010606601717798201
