# Movie review sentiment analysis

Given a text of a movie review, predict the sentiment of it. For the following task we are going to use NLTK package and 'movie_reviews' data in particular.

Movie reviews dataset contains positive and negative movie reviews, and might be downloaded using the following code:

In [1]:
""""

import nltk
nltk.download('movie_reviews')

"""

'"\n\nimport nltk\nnltk.download(\'movie_reviews\')\n\n'

In [58]:
# import libraries for data exprolation

import numpy as np
import pandas as pd

## Data import and exploration

In [2]:
from nltk.corpus import movie_reviews

In [3]:
movie_reviews.fileids()[:5]

['neg/cv000_29416.txt',
 'neg/cv001_19502.txt',
 'neg/cv002_17424.txt',
 'neg/cv003_12683.txt',
 'neg/cv004_12641.txt']

In [4]:
# selecting negative and positive reviews

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
print("Positive reviews: {} \nNegative reviews: {}".format(len(posids), len(negids)))

Positive reviews: 1000 
Negative reviews: 1000


In [14]:
# Creating a list of words of positive/negative reviews
negfeats = [list(movie_reviews.words(fileids=[f])) for f in negids]
posfeats = [list(movie_reviews.words(fileids=[f])) for f in posids]

In [33]:
totalreviews = negfeats + posfeats
labels = len(negfeats) * [0] + len(posfeats) * [1]

In [34]:
print("Total reviews in dataset: {}".format(len(totalreviews)))

Total reviews in dataset: 2000


As we can see, our dataset is perfectly balanced. Half of our reviews are positive, and others are negative.

Let's first try to build a simple model without any preprocessing of review words.

In [35]:
from sklearn.feature_extraction.text import CountVectorizer

In [36]:
# transform our review representation to strings instead of list of words in order to use in vectorizer
totalreviews = [" ".join(review) for review in totalreviews]

In [49]:
vectorizer = CountVectorizer()
vectorizer.fit(totalreviews)
print("Feature count: {}".format(len(vectorizer.get_feature_names())))

Feature count: 39659


Our vectorizer creates a matrix *documents X tokens*, where each cell represents token frequency

Let's build a pipeline with CountVectorizer and LogisticRegression

In [55]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

In [54]:
clf_pipeline = Pipeline(steps=[
    ('vectorizer', CountVectorizer()),
    ('estimator', LogisticRegression())
])

In [61]:
accuracy_scores = cross_val_score(clf_pipeline, totalreviews, labels, scoring='accuracy')
roc_auc_scores = cross_val_score(clf_pipeline, totalreviews, labels, scoring='roc_auc')

In [62]:
print("Cross-validation accuracy score: {}".format(np.mean(accuracy_scores)))
print("Cross-validation roc_auc score: {}".format(np.mean(roc_auc_scores)))

Cross-validation accuracy score: 0.8360216503929078
Cross-validation roc_auc score: 0.9107764937833774


Let's have a look at top-5 most important parameters (words) according to our model.

In [138]:
clf_pipeline.fit(totalreviews, labels)
print("Top 5 important words according to coefficients:\n")

# look for both negative and positive coefficients

indicies = np.argsort(np.abs(clf_pipeline.named_steps['estimator'].coef_[0]))
feature_names = clf_pipeline.named_steps['vectorizer'].get_feature_names()

for index in reversed(indicies[-5:]):
    print(feature_names[index])

Top 5 important words according to coefficients: 
bad
unfortunately
worst
fun
waste
