# Working with Data: Movie Review Classification

This program trains a classifier on movie reviews and predicts sentiment. 
Input: -> "movie review text"
Output: -> "positive review" OR "negative review" 

We'll use this example to get familiar with how we use data to build a machine learning model.
Try exploring how modifying the training and test data sets affects the behavior of the  classifier.

Data from: http://www.nltk.org/nltk_data/

Example based on: http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html

In [1]:
import sklearn
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
import nltk

### Import Training Data

In [2]:
# Path to the training data
moviedir = '../data/movie_reviews'
# loading all files as training data. 
movie_train = load_files(moviedir, shuffle=True)
print("Loaded " + str(len(movie_train.data)) + " training examples")

Loaded 2000 training examples


In [50]:
###### Optional:
# Make sure that the data is loaded correctly

print("Inspecting data: \n")
# target names ("classes") are automatically generated from subfolder names
print("Target names: " + str(movie_train.target_names) + "\n")
# First file seems to be about a Schwarzenegger movie. 
print("First line: " + movie_train.data[0][:500] + "\n")
# first file is in "neg" folder
print("First file: " + movie_train.filenames[0] + "\n")
# first file is a negative review and is mapped to 0 index 'neg' in target_names
print("First file mapped to class: " + str(movie_train.target[0]) + "\n")

Inspecting data: 

Target names: ['neg', 'pos']

First line: arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . 
it's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? 
once again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . 
in this so cal

First file: movie_reviews/neg/cv405_21868.txt

First file mapped to class: 0



### Format Training Data

In [51]:
# initialize movie_vector object, and then turn movie train data into a vector 
movie_vec = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize)         # use all 25K words. 82.2% acc.
# movie_vec = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, max_features = 3000) # use top 3000 words only. 78.5% acc.
movie_counts = movie_vec.fit_transform(movie_train.data)

In [52]:
# If the above breaks, uncomment and run the following line to fix the missing packages
# nltk.download('punkt')
# or, download manually from http://nltk.org/nltk_data/

In [53]:
###### Optional:
# Examine formatted training data, what does it look like now?

# 'screen' is found in the corpus, mapped to index 19637
print("index of word 'screen' " + str(movie_vec.vocabulary_.get('screen')))
# huge dimensions! 2,000 documents, 25K unique terms. 
print("dimensions: " + str(movie_counts.shape))

index of word 'screen' 19603
dimensions: (2000, 25279)


In [54]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

# Convert raw frequency counts into TF-IDF values
tfidf_transformer = TfidfTransformer()
movie_tfidf = tfidf_transformer.fit_transform(movie_counts)


# Same dimensions, now with tf-idf values instead of raw frequency counts
print(movie_tfidf.shape)

(2000, 25279)


### Training and testing a Naive Bayes classifier

In [55]:
# Now ready to build a classifier. 
# We will use Multinominal Naive Bayes as our model
from sklearn.naive_bayes import MultinomialNB

In [56]:
# Split data into training and test sets
# from sklearn.cross_validation import train_test_split  # deprecated in 0.18
from sklearn.model_selection import train_test_split
docs_train, docs_test, y_train, y_test = train_test_split(
    movie_tfidf, movie_train.target, test_size = 0.20, random_state = 12)

In [57]:
# Train a Multimoda Naive Bayes classifier
clf = MultinomialNB().fit(docs_train, y_train)

In [58]:
# Predicting the Test set results, find accuracy
y_pred = clf.predict(docs_test)
print("Accuracy: ")
print(sklearn.metrics.accuracy_score(y_test, y_pred))

Accuracy: 
0.82


In [59]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[175  31]
 [ 41 153]]


### Try the classifier on your own data


In [60]:
# very short and fake movie reviews
reviews_new = ['This movie was excellent', 'Absolute joy ride', 
            'The acting was terrible', 'Tom Hardy shined through.', 
              'This was certainly a movie', 'Two thumbs up', 'I fell asleep halfway through', 
              "We can't wait for the sequel!!", '!', '?', 'I cannot recommend this highly enough', 
              'instant classic.', 'Simon Abkarian was amazing. His performance was Oscar-worthy.']
reviews_new_counts = movie_vec.transform(reviews_new)
reviews_new_tfidf = tfidf_transformer.transform(reviews_new_counts)

In [61]:
# have classifier make a prediction
pred = clf.predict(reviews_new_tfidf)

In [62]:
# print out results
for review, category in zip(reviews_new, pred):
    print('%r => %s' % (review, movie_train.target_names[category]))

'This movie was excellent' => pos
'Absolute joy ride' => pos
'The acting was terrible' => neg
'Tom Hardy shined through.' => pos
'This was certainly a movie' => neg
'Two thumbs up' => neg
'I fell asleep halfway through' => neg
"We can't wait for the sequel!!" => neg
'!' => neg
'?' => neg
'I cannot recommend this highly enough' => pos
'instant classic.' => pos
'Simon Abkarian was amazing. His performance was Oscar-worthy.' => pos
