# Lab2 - Movie Reviews Sentiment Analysis with Scikit-Learn

The focus of this notebook is on performing sentiment analysis using the scikit-learn package. Material from [this notebook](http://www.pitt.edu/~naraehan/presentation/Movie+Reviews+sentiment+analysis+with+Scikit-Learn.html) was used.

**at the end of this notebook, you will be able to**:
* load the training data, i.e., the movie reviews
* inspect the training data, i.e., the movie reviews
* extracting features from the training data
* training and evaluating the *NaiveBayesClassifier*
* apply the classifier to fake movie reviews

**If you want to learn more, you might find the following link useful:**
* [documentation on dataset loading](http://scikit-learn.org/stable/datasets/)

In [155]:
import sklearn
import numpy
import nltk
from nltk.corpus import stopwords
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

## Loading the dataset
We are first going to load and inspect the movie reviews dataset.
However, we are going to use a general method to load the dataset.
This is relevant for when you want to use other datasets as training data for the same algorithm.
Let's first inspect what the help message of the function **load_files** states.

In [156]:
help(load_files)

Help on function load_files in module sklearn.datasets.base:

load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)
    Load text files with categories as subfolder names.
    
    Individual samples are assumed to be files stored a two levels folder
    structure such as the following:
    
        container_folder/
            category_1_folder/
                file_1.txt
                file_2.txt
                ...
                file_42.txt
            category_2_folder/
                file_43.txt
                file_44.txt
                ...
    
    The folder names are used as supervised signal label names. The individual
    file names are not important.
    
    This function does not try to extract features into a numpy array or scipy
    sparse matrix. In addition, if load_content is false it does not try to
    load the files in memory.
    
    To use text files in a scikit

Ok, so the function requires the following structure in order for it to work:
* container_folder/
    * category_1_folder/ (e.g., 'pos')
        * file_1.txt
        * file_2.txt
        * ...
        file_42.txt
    * category_2_folder/ (e.g., 'neg')
        * file_43.txt
        * file_44.txt
        * ...

Let's check whether our **movie_review** corpus has this structure.

In [157]:
from nltk.corpus import movie_reviews

In [158]:
print('path to corpus on your computer', movie_reviews._root)

path to corpus on your computer /Users/marten/nltk_data/corpora/movie_reviews


Using the command line or using Mac Finder or equivalent on Windows, please go the movie_reviews folder and inspect whether the corpus has the required structure.

....

....

Hopefully, it is! Let's now load it using the function.

In [159]:
# Please change the path here if you want to train it on a different corpus!
moviedir = str(movie_reviews._root)
print(moviedir)

/Users/marten/nltk_data/corpora/movie_reviews


In [161]:
# loading all files as training data. 
movie_train = load_files(moviedir, shuffle=True)

## Inspecting dataset

How many files do we have?

In [162]:
len(movie_train.data)

2000

In [163]:
# target names ("classes") are automatically generated from subfolder names
movie_train.target_names

['neg', 'pos']

In [164]:
# Let's inspect one file
movie_train.data[0][:500]

b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so cal"

In [165]:
# first file is in "neg" folder
movie_train.filenames[0]

'/Users/marten/nltk_data/corpora/movie_reviews/neg/cv405_21868.txt'

In [167]:
# first file is a negative review and is mapped to 0 index 'neg' in target_names
movie_train.target[0]

0

## Extracting features from training data (see notebook Lab2-Feature representation.ipynb for more information)

In [171]:
# initialize movie_vector object, and then turn movie train data into a vector 

movie_vec = CountVectorizer(min_df=2, 
                            tokenizer=nltk.word_tokenize,
                            stop_words=stopwords.words('english'))
movie_counts = movie_vec.fit_transform(movie_train.data)

In [172]:
# 'screen' is found in the corpus, mapped to index 19637
movie_vec.vocabulary_.get('screen')

19509

In [173]:
# Likewise, Mr. Steven Seagal is present...
movie_vec.vocabulary_.get('seagal')

19562

In [174]:
# huge dimensions! 2,000 documents, 25K unique terms. 
movie_counts.shape

(2000, 25138)

In [177]:
# Convert raw frequency counts into TF-IDF values
tfidf_transformer = TfidfTransformer()
movie_tfidf = tfidf_transformer.fit_transform(movie_counts)

In [178]:
# Same dimensions, now with tf-idf values instead of raw frequency counts
movie_tfidf.shape

(2000, 25138)

## Training and testing a Naive Bayes classifier
To train the classifier, we will first split the data into train and test.

In [179]:
# Now ready to build a classifier. 
# We will use Multinominal Naive Bayes as our model
from sklearn.naive_bayes import MultinomialNB

In [180]:
# Split data into training and test sets
# from sklearn.cross_validation import train_test_split  # deprecated in 0.18
from sklearn.model_selection import train_test_split

We choose 80% training and 20% test. 

In [182]:
docs_train, docs_test, y_train, y_test = train_test_split(
    movie_tfidf, 
    movie_train.target, 
    test_size = 0.20) 

One instance looks like this:

In [183]:
docs_train[0].toarray()

array([[0., 0., 0., ..., 0., 0., 0.]])

it's label is:

In [184]:
y_train[0]

0

In [185]:
# Train a Multimoda Naive Bayes classifier
clf = MultinomialNB().fit(docs_train, y_train)

In [186]:
# Predicting the Test set results, find macro recall
y_pred = clf.predict(docs_test)

In [190]:
print('one movie review', movie_train.data[0])
print('label (0=negative, 1=positive)', movie_train.target[0])
print('classifier predicted', y_pred[0])

one movie review b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . \nwith the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! \nparts of this are actually so absurd , that they would fit right in with dogma . \ny

See notebook **Lab2-Evaluation** for information on the evaluation. The micro recall of our training classifier is:

In [193]:
sklearn.metrics.recall_score(y_true=y_test,
                             y_pred=y_pred,
                             average='micro')

0.845

We can also inspect the least and most important features. The left two columns show the words features and the right columns show the best features. As you can see, preprocessing the data might lead to much better results.

In [195]:
def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))
        
show_most_informative_features(movie_vec, clf)

	-10.4520	              		-4.9148	,              
	-10.4520	              		-5.1215	.              
	-10.4520	'94            		-6.2809	``             
	-10.4520	'bout          		-6.3436	's             
	-10.4520	'carry         		-6.7562	)              
	-10.4520	'comedy        		-6.7605	(              
	-10.4520	'end           		-6.8935	film           
	-10.4520	'ghetto        		-7.3433	movie          
	-10.4520	'matron        		-7.4024	one            
	-10.4520	'new           		-7.4101	n't            
	-10.4520	'romeo         		-7.6748	?              
	-10.4520	'rumble        		-7.7432	--             
	-10.4520	'secret        		-7.7566	:              
	-10.4520	't             		-7.7587	like           
	-10.4520	'three         		-7.8243	-              
	-10.4520	-1             		-7.8635	'              
	-10.4520	-but           		-7.8911	;              
	-10.4520	-ed            		-7.8997	story          
	-10.4520	-give          		-7.9143	life           
	-10.4520	-reviewed      		-7.9

## Applying classifier on fake movie reviews
Now we can apply our classifier to new data.

In [201]:
# very short and fake movie reviews
reviews_new = ['This movie was excellent', 
               'Absolute joy ride', 
               'Steven Seagal was terrible', 
               'Steven Seagal shined through.', 
               'This was certainly a movie', 
               'Two thumbs up', 
               'I fell asleep halfway through', 
               "We can't wait for the sequel!!", 
               'I cannot recommend this highly enough', 
               'instant classic.', 
               'Steven Seagal was amazing.']
len(reviews_new)

11

In [202]:
# We re-use movie_vec to transform it in the same way as the training data
reviews_new_counts = movie_vec.transform(reviews_new)
reviews_new_tfidf = tfidf_transformer.transform(reviews_new_counts)

In [203]:
reviews_new_tfidf.shape

(11, 25138)

In [204]:
# have classifier make a prediction
pred = clf.predict(reviews_new_tfidf)

In [205]:
# print out results (0 is negative and 1 is positive)
for review, predicted_label in zip(reviews_new, pred):
    print('%s => %s' % (review, predicted_label))

This movie was excellent => 1
Absolute joy ride => 1
Steven Seagal was terrible => 0
Steven Seagal shined through. => 0
This was certainly a movie => 0
Two thumbs up => 0
I fell asleep halfway through => 0
We can't wait for the sequel!! => 0
I cannot recommend this highly enough => 1
instant classic. => 1
Steven Seagal was amazing. => 0
