Large Movie Review Dataset v1.0

Overview

This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.

Dataset

The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.

In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.

Files

There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset.

We also include the IMDb URLs for each review in a separate [urls_[pos, neg, unsup].txt] file. A review with unique id 200 will have its URL on line 200 of this file. Due the ever-changing IMDb, we are unable to link directly to the review, but only to the movie's review page.

In addition to the review text files, we include already-tokenized bag of words (BoW) features that were used in our experiments. These are stored in .feat files in the train/test directories. Each .feat file is in LIBSVM format, an ascii sparse-vector format for labeled data. The feature indices in these files start from 0, and the text tokens corresponding to a feature index is found in [imdb.vocab]. So a line with 0:7 in a .feat file means the first word in [imdb.vocab] (the) appears 7 times in that review.

LIBSVM page for details on .feat file format: http://www.csie.ntu.edu.tw/~cjlin/libsvm/

We also include [imdbEr.txt] which contains the expected rating for each token in [imdb.vocab] as computed by (Potts, 2011). The expected rating is a good way to get a sense for the average polarity of a word in the dataset.

Citing the dataset

When using this dataset please cite our ACL 2011 paper which introduces it. This paper also contains classification results which you may want to compare against.

@InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = {http://www.aclweb.org/anthology/P11-1015} }

References

Potts, Christopher. 2011. On the negativity of negation. In Nan Li and David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20, 636-659.

Contact

For questions/comments/corrections please contact Andrew Maas amaas@cs.stanford.edu

    1  Data Import
    2  Data Preprocessing
    3  Feature Engineering
    4  Feature Selection
    5  Model Architecture
    6  Model Training
    7  Model Evaluation
    7.1  Accuracy & Loss
    7.2  Error Analysis
    8  Model Application
    8.1  Test Predictions
    8.2  Custom Reviews

In [1]:
import pandas as pd
import os

In [18]:
path = "./data_movie_reviews/"
positiveFiles = [x for x in os.listdir(path+"train/pos/") if x.endswith(".txt")]
negativeFiles = [x for x in os.listdir(path+"train/neg/") if x.endswith(".txt")]
# testFiles = [x for x in os.listdir(path+"test/") if x.endswith(".txt")]

In [19]:
test_positiveFiles = [x for x in os.listdir(path+"test/pos/") if x.endswith(".txt")]
test_negativeFiles = [x for x in os.listdir(path+"test/neg/") if x.endswith(".txt")]
# test_testFiles = [x for x in os.listdir(path+"test/") if x.endswith(".txt")]

In [20]:
positiveReviews, negativeReviews, testReviews = [], [], []
for pfile in positiveFiles:
    with open(path+"train/pos/"+pfile, encoding="latin1") as f:
        positiveReviews.append(f.read())
for nfile in negativeFiles:
    with open(path+"train/neg/"+nfile, encoding="latin1") as f:
        negativeReviews.append(f.read())
# for tfile in testFiles:
#     with open(path+"test/"+tfile, encoding="latin1") as f:
#         testReviews.append(f.read())

In [None]:
# positiveReviews, negativeReviews, testReviews = [], [], []
# for pfile in positiveFiles:
#     with open(path+"train/pos/"+pfile, encoding="latin1") as f:
#         positiveReviews.append(f.read())
# for nfile in negativeFiles:
#     with open(path+"train/neg/"+nfile, encoding="latin1") as f:
#         negativeReviews.append(f.read())
# for tfile in testFiles:
#     with open(path+"test/"+tfile, encoding="latin1") as f:
#         testReviews.append(f.read())

In [21]:
test_positiveReviews, test_negativeReviews, test_testReviews = [], [], []
for pfile in test_positiveFiles:
    with open(path+"test/pos/"+pfile, encoding="latin1") as f:
        test_positiveReviews.append(f.read())
for nfile in test_negativeFiles:
    with open(path+"test/neg/"+nfile, encoding="latin1") as f:
        test_negativeReviews.append(f.read())
# for tfile in test_testFiles:
#     with open(path+"test/"+tfile, encoding="latin1") as f:
#         testReviews.append(f.read())

In [4]:
# reviews = pd.concat([
#     pd.DataFrame({"review":positiveReviews, "label":1, "file":positiveFiles}),
#     pd.DataFrame({"review":negativeReviews, "label":0, "file":negativeFiles}),
#     pd.DataFrame({"review":testReviews, "label":-1, "file":testFiles})
# ], ignore_index=True).sample(frac=1, random_state=1)
# reviews.head()

Unnamed: 0,file,label,review
21939,7246_4.txt,0,Gwyneth Paltrow is absolutely great in this mo...
24113,9202_1.txt,0,"I own this movie. Not by choice, I do. I was r..."
4633,2920_8.txt,1,Well I guess it supposedly not a classic becau...
17240,3016_1.txt,0,"I am, as many are, a fan of Tony Scott films. ..."
4894,3155_10.txt,1,"I wish ""that '70s show"" would come back on tel..."


In [72]:
reviews = pd.concat([
    pd.DataFrame({"review":positiveReviews, "label":1, "file":positiveFiles}),
    pd.DataFrame({"review":negativeReviews, "label":0, "file":negativeFiles}),
    pd.DataFrame({"review":test_positiveReviews, "label":1, "file":test_positiveFiles}),
    pd.DataFrame({"review":test_negativeReviews, "label":0, "file":test_negativeFiles})
], ignore_index=True).sample(frac=1, random_state=1)
reviews.head()

Unnamed: 0,file,label,review
26247,11122_8.txt,1,Fame is one of the best movies I've seen about...
35067,7811_10.txt,1,This movie fully deserves to be one of the top...
34590,7382_10.txt,1,"in a time of predictable movies, in which abou..."
16668,2501_1.txt,0,I saw this on TV the other nightÂ or rather I...
12196,9728_7.txt,1,I am a huge fan of Simon Pegg and have watched...


In [73]:
reviews.shape

(50000, 3)

In [23]:
# test_reviews = pd.concat([
#     pd.DataFrame({"review":test_positiveReviews, "label":1, "file":test_positiveFiles}),
#     pd.DataFrame({"review":test_negativeReviews, "label":0, "file":test_negativeFiles})
# ], ignore_index=True).sample(frac=1, random_state=1)
# test_reviews.head()

Unnamed: 0,file,label,review
21492,6844_2.txt,0,A movie theater with a bad history of past gru...
9488,7290_10.txt,1,"""Here On Earth"" is a surprising beautiful roma..."
16933,2740_3.txt,0,I just watched Descent. Gawds what an awful mo...
12604,10094_4.txt,0,In a nutshell the movie is about a gang war in...
8222,6150_7.txt,1,"Instead of watching the recycled history of ""P..."


### Feature engineering

Text preprocessing, tokenizing and filtering of stopwords are included in a high level component that is able to build a dictionary of features and transform documents to feature vectors.

We remove the stop words in order to improve the analytics

In [117]:
from sklearn import feature_extraction
import numpy as np

    print "Creating the bag of words...\n"
    from sklearn.feature_extraction.text import CountVectorizer

    # Initialize the "CountVectorizer" object, which is scikit-learn's
    # bag of words tool.  
    vectorizer = CountVectorizer(analyzer = "word",   \
                                 tokenizer = None,    \
                                 preprocessor = None, \
                                 stop_words = None,   \
                                 max_features = 5000) 


    # fit_transform() does two functions: First, it fits the model
    # and learns the vocabulary; second, it transforms our training data
    # into feature vectors. The input to fit_transform should be a list of 
    # strings.
    train_data_features = vectorizer.fit_transform(clean_train_reviews)

    # Numpy arrays are easy to work with, so convert the result to an 
    # array
    train_data_features = train_data_features.toarray()

In [118]:
f = feature_extraction.text.CountVectorizer(stop_words = 'english', ngram_range=(1, 1),max_features=5000 )
X = f.fit_transform(reviews["review"])


In [119]:
np.shape(X)

(50000, 5000)

In [120]:
# f.get_feature_names()

In [121]:
y = reviews['label']

In [122]:
X.shape

(50000, 5000)

In [123]:
y.shape

(50000,)

In [124]:
from sklearn.model_selection import train_test_split

In [125]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)


In [126]:
print([np.shape(X_train), np.shape(X_test)])

[(25000, 5000), (25000, 5000)]


In [127]:
y_train.shape, y_test.shape

((25000,), (25000,))

In [128]:
from sklearn.naive_bayes import MultinomialNB 
model = MultinomialNB()
model.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [129]:
model.predict(X_train)

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [130]:
model.score(X_train, y_train)

0.85224

In [131]:
model.predict(X_test)

array([0, 1, 1, ..., 1, 1, 1], dtype=int64)

In [132]:
model.score(X_test, y_test)

0.84784000000000004

## Testing Data

In [133]:
from sklearn import metrics

In [134]:
list_alpha = np.arange(1/100000, 20, 0.11)
score_train = np.zeros(len(list_alpha))
score_test = np.zeros(len(list_alpha))
recall_test = np.zeros(len(list_alpha))
precision_test= np.zeros(len(list_alpha))
count = 0
for alpha in list_alpha:
    bayes = MultinomialNB(alpha=alpha)
    bayes.fit(X_train, y_train)
    score_train[count] = bayes.score(X_train, y_train)
    score_test[count]= bayes.score(X_test, y_test)
    recall_test[count] = metrics.recall_score(y_test, bayes.predict(X_test))
    precision_test[count] = metrics.precision_score(y_test, bayes.predict(X_test))
    count = count + 1 

In [135]:
matrix = np.matrix(np.c_[list_alpha, score_train, score_test, recall_test, precision_test])
models = pd.DataFrame(data = matrix, columns = 
             ['alpha', 'Train Accuracy', 'Test Accuracy', 'Test Recall', 'Test Precision'])
models.head(n=10)

Unnamed: 0,alpha,Train Accuracy,Test Accuracy,Test Recall,Test Precision
0,1e-05,0.85236,0.84776,0.84714,0.849305
1,0.11001,0.85232,0.84784,0.84722,0.849385
2,0.22001,0.85228,0.84784,0.84714,0.849441
3,0.33001,0.8522,0.84796,0.8473,0.849533
4,0.44001,0.85212,0.848,0.8473,0.849601
5,0.55001,0.8522,0.848,0.84722,0.849656
6,0.66001,0.8522,0.848,0.84714,0.849712
7,0.77001,0.85208,0.84796,0.846981,0.849756
8,0.88001,0.85212,0.84792,0.846901,0.849744
9,0.99001,0.8522,0.84788,0.846822,0.849732


Lets select the model with the most test precision

In [136]:
best_index = models['Test Precision'].idxmax()
models.iloc[best_index, :]

alpha             19.910010
Train Accuracy     0.851280
Test Accuracy      0.846760
Test Recall        0.836785
Test Precision     0.854969
Name: 181, dtype: float64

In [137]:
m_confusion_test = metrics.confusion_matrix(y_test, bayes.predict(X_test))
m_confusion_test

array([[10664,  1782],
       [ 2049, 10505]], dtype=int64)

In [138]:
pd.DataFrame(data = m_confusion_test, columns = ['Predicted 0', 'Predicted 1'],
            index = ['Actual 0', 'Actual 1'])

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,10664,1782
Actual 1,2049,10505


In [None]:
def review_to_words( raw_review ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))   

In [None]:
clean_review = review_to_words( train["review"][0] )
print clean_review

it should give you exactly the same output as all of the individual steps we did in preceding tutorial sections. Now let's loop through and clean all of the training set at once (this might take a few minutes depending on your computer):

In [None]:
# Get the number of reviews based on the dataframe column size
num_reviews = train["review"].size

# Initialize an empty list to hold the clean reviews
clean_train_reviews = []

# Loop over each review; create an index i that goes from 0 to the length
# of the movie review list 
for i in xrange( 0, num_reviews ):
    # Call our function for each one, and add the result to the list of
    # clean reviews
    clean_train_reviews.append( review_to_words( train["review"][i] ) )

Sometimes it can be annoying to wait for a lengthy piece of code to run. It can be helpful to write code so that it gives status updates. To have Python print a status update after every 1000 reviews that it processes, try adding a line or two to the code above:

In [None]:
print "Cleaning and parsing the training set movie reviews...\n"
clean_train_reviews = []
for i in xrange( 0, num_reviews ):
    # If the index is evenly divisible by 1000, print a message
    if( (i+1)%1000 == 0 ):
        print "Review %d of %d\n" % ( i+1, num_reviews )                                                                    
    clean_train_reviews.append( review_to_words( train["review"][i] ))

Creating Features from a Bag of Words (Using scikit-learn)
Now that we have our training reviews tidied up, how do we convert them to some kind of numeric representation for machine learning? One common approach is called a Bag of Words. The Bag of Words model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears. For example, consider the following two sentences:

Sentence 1: "The cat sat on the hat"

Sentence 2: "The dog ate the cat and the hat"

From these two sentences, our vocabulary is as follows:

{ the, cat, sat, on, hat, dog, ate, and }

To get our bags of words, we count the number of times each word occurs in each sentence. In Sentence 1, "the" appears twice, and "cat", "sat", "on", and "hat" each appear once, so the feature vector for Sentence 1 is:

{ the, cat, sat, on, hat, dog, ate, and }

Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }

Similarly, the features for Sentence 2 are: { 3, 1, 0, 0, 1, 1, 1, 1}

In the IMDB data, we have a very large number of reviews, which will give us a large vocabulary. To limit the size of the feature vectors, we should choose some maximum vocabulary size. Below, we use the 5000 most frequent words (remembering that stop words have already been removed).

We'll be using the feature_extraction module from scikit-learn to create bag-of-words features. If you did the Random Forest tutorial in the Titanic competition, you should already have scikit-learn installed; otherwise you will need to install it.

In [None]:
print "Creating the bag of words...\n"
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)

# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_features = train_data_features.toarray()