# Movies classification exercise

NLTK contains a "Movie Review" corpus containing 2000 movie reviews classified in "positive" and "negative" reviews.

In the demo presented during the class, and partially copied below, we used a simple vectorizer and Naive Bayes classifier to learn to classify good and bad movies form reviews.

For this exercises, give your best shot at improving the score from the demo (0.83 precision on test data), using every mean you can think about! (Apart from cheating ;-) ).

In [3]:
# A few useful imports.
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

from nltk.corpus import movie_reviews

In [4]:
# Collect all reviews and their label ("pos" or "neg") from the corpus.
reviews = []
labels = []
for file_id in movie_reviews.fileids():
    reviews.append(movie_reviews.raw(file_id))
    labels.append(movie_reviews.categories([file_id])[0])

In [5]:
# We set aside 1/3 of the data set to test the classifier. The ranfom state is
# fixed, so that everybody will have the same test set.

# Divide the data into a training and a test set.
from sklearn.cross_validation import train_test_split
# Fixed to make the notebook reproducible.
random_state = np.random.RandomState(3939)

test_set_fraction = 0.3
reviews_train, reviews_test, t_train, t_test = train_test_split(
    reviews, labels, test_size=test_set_fraction, random_state=random_state)

print '# train data points = {}, # test = {}'.format(len(reviews_train), len(reviews_test))

# train data points = 1400, # test = 600


In [8]:
# Here we give the vectorizer / classifier combination shown in the demo.
# *** SUBSTITUTE THIS WITH YOUR OWN WORKFLOW ***

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

vectorizer = CountVectorizer(
    stop_words='english', lowercase=True, binary=False,
    min_df=0.01)
features_train = vectorizer.fit_transform(reviews_train)
classifier = MultinomialNB()
classifier.fit(features_train, t_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)




In [9]:
# Transform test data and display performance.
from sklearn.metrics import classification_report, confusion_matrix

feature_test = vectorizer.transform(reviews_test)
print 'Precision', classifier.score(feature_test, t_test)
y_test = classifier.predict(feature_test)
print confusion_matrix(t_test, y_test)

Precision 0.83
[[240  46]
 [ 56 258]]
