# Lab2 - Movie Reviews Sentiment Analysis with Scikit-Learn

The focus of this notebook is on performing sentiment analysis using the scikit-learn package. Material from [this notebook](http://www.pitt.edu/~naraehan/presentation/Movie+Reviews+sentiment+analysis+with+Scikit-Learn.html) was used.

**at the end of this notebook, you will be able to**:
* load the training data, i.e., the movie reviews
* inspect the training data, i.e., the movie reviews
* extracting features from training data
* training and evaluating the *NaiveBayesClassifier*
* apply the classifier to fake movie reviews
* create your own gold standard and evaluate a system on it

**If you want to learn more, you might find the following links useful:**
* [documentation on dataset loading](http://scikit-learn.org/stable/datasets/)
* [introduction TF-IDF](https://medium.freecodecamp.org/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3)

## Loading the dataset
We are first going to load and inspect the movie reviews dataset.
However, we are going to use a general method to load the dataset.
This is relevant for when you want to use other datasets as training data for the same algorithm.
Let's first inspect what the help message of the function **load_files** states.

In [None]:
import sklearn
import numpy
from sklearn.datasets import load_files

In [None]:
help(load_files)

Ok, so the function requires the following structure in order for it to work:
* container_folder/
    * category_1_folder/ (e.g., 'pos')
        * file_1.txt
        * file_2.txt
        * ...
        file_42.txt
    * category_2_folder/ (e.g., 'neg')
        * file_43.txt
        * file_44.txt
        * ...

Let's check whether our **movie_review** corpus has this structure.

In [None]:
from nltk.corpus import movie_reviews

In [None]:
print('path to corpus on your computer', movie_reviews._root)

Using the command line or using Mac Finder or equivalent on Windows, please go the movie_reviews folder and inspect whether the corpus has the required structure.

....

....

Hopefully, it is! Let's now load it using the function.

In [None]:
# Please change the path here if you want to train it on a different corpus!
moviedir = str(movie_reviews._root)
print(moviedir)

In [None]:
# loading all files as training data. 
movie_train = load_files(moviedir, shuffle=True)

## Inspecting dataset

How many files do we have?

In [None]:
len(movie_train.data)

In [None]:
# target names ("classes") are automatically generated from subfolder names
movie_train.target_names

In [None]:
# Let's inspect one file
movie_train.data[0][:500]

In [None]:
# first file is in "neg" folder
movie_train.filenames[0]

In [None]:
# first file is a negative review and is mapped to 0 index 'neg' in target_names
movie_train.target[0]

## A detour: try out CountVectorizer & TF-IDF

In [None]:
# import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Turn off pretty printing of jupyter notebook... it generates long lines
%pprint

In [None]:
import nltk

In [None]:
sents = ['A rose is a rose is a rose is a rose.',
         'Oh, what a fine day it is.',
        "It ain't over till it's over, I tell you!!"]

In [None]:
# Initialize a CoutVectorizer to use NLTK's tokenizer instead of its 
# default one (which ignores punctuation and stopwords). 
# Minimum document frequency set to 1. 
foovec = CountVectorizer(min_df=1, tokenizer=nltk.word_tokenize)

In [None]:
# sents turned into sparse vector of word frequency counts
sents_counts = foovec.fit_transform(sents)
# foovec now contains vocab dictionary which maps unique words to indexes
foovec.vocabulary_

In [None]:
# sents_counts has a dimension of 3 (document count) by 19 (# of unique words)
sents_counts.shape

In [None]:
# this vector is small enough to view in full! 
sents_counts.toarray()

In [None]:
# Convert raw frequency counts into TF-IDF (Term Frequency -- Inverse Document Frequency) values
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
sents_tfidf = tfidf_transformer.fit_transform(sents_counts)

In [None]:
# TF-IDF values
# raw counts have been normalized against document length, 
# terms that are found across many docs are weighted down
sents_tfidf.toarray()

## Extracting features from training data
Instead of using the words as features, we are going to represent each movie review using a vector. This vector has the length of the number of terms that there are in all the movie reviews combined. Instead of a cell value being the frequency of occurrence, we use it's td-idf value (as shown above).

In [None]:
# initialize movie_vector object, and then turn movie train data into a vector 
movie_vec = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize)
movie_counts = movie_vec.fit_transform(movie_train.data)

In [None]:
# 'screen' is found in the corpus, mapped to index 19637
movie_vec.vocabulary_.get('screen')

In [None]:
# Likewise, Mr. Steven Seagal is present...
movie_vec.vocabulary_.get('seagal')

In [None]:
# huge dimensions! 2,000 documents, 25K unique terms. 
movie_counts.shape

In [None]:
# Convert raw frequency counts into TF-IDF values
tfidf_transformer = TfidfTransformer()
movie_tfidf = tfidf_transformer.fit_transform(movie_counts)

In [None]:
# Same dimensions, now with tf-idf values instead of raw frequency counts
movie_tfidf.shape

## Training and testing a Naive Bayes classifier
To train the classifier, we will first split the data into train and test.

In [None]:
# Now ready to build a classifier. 
# We will use Multinominal Naive Bayes as our model
from sklearn.naive_bayes import MultinomialNB

In [None]:
# Split data into training and test sets
# from sklearn.cross_validation import train_test_split  # deprecated in 0.18
from sklearn.model_selection import train_test_split

We again choose 80% training and 20% test. 

In [None]:
docs_train, docs_test, y_train, y_test = train_test_split(
    movie_tfidf, 
    movie_train.target, 
    test_size = 0.20) 

One instance looks like this:

In [None]:
docs_train[0]

it's label is:

In [None]:
y_train[0]

In [None]:
# Train a Multimoda Naive Bayes classifier
clf = MultinomialNB().fit(docs_train, y_train)

In [None]:
# Predicting the Test set results, find accuracy
y_pred = clf.predict(docs_test)
sklearn.metrics.accuracy_score(y_test, y_pred)

## Applying classifier on fake movie reviews

In [None]:
# very short and fake movie reviews
reviews_new = ['This movie was excellent', 'Absolute joy ride', 
            'Steven Seagal was terrible', 'Steven Seagal shined through.', 
              'This was certainly a movie', 'Two thumbs up', 'I fell asleep halfway through', 
              "We can't wait for the sequel!!", '!', '?', 'I cannot recommend this highly enough', 
              'instant classic.', 'Steven Seagal was amazing. His performance was Oscar-worthy.']
len(reviews_new)

In [None]:
# We re-use movie_vec to transform it in the same way as the training data
reviews_new_counts = movie_vec.transform(reviews_new)
reviews_new_tfidf = tfidf_transformer.transform(reviews_new_counts)

In [None]:
# The dimensions for the 13 reviews are the same
reviews_new_tfidf.shape

In [None]:
# have classifier make a prediction
pred = clf.predict(reviews_new_tfidf)

In [None]:
# Note that the predictions are just an array with the classification result for the 13 reviews
print(pred)

In [None]:
# print out results
for review, category in zip(reviews_new, pred):
    print('%s => %s' % (review, movie_train.target_names[category]))

## Define your own gold standard and evaluate a system on it
You can make your own gold standard in the following way.

In [None]:
reviews_new = ['This movie was excellent', 'Absolute joy ride', 
            'Steven Seagal was terrible', 'Steven Seagal shined through.', 
              'This was certainly a movie', 'Two thumbs up', 'I fell asleep halfway through', 
              "We can't wait for the sequel!!", '!', '?', 'I cannot recommend this highly enough', 
              'instant classic.', 'Steven Seagal was amazing. His performance was Oscar-worthy.']

For each review in **reviews_new**, we determine whether it was positive (1) or negative (0)

In [None]:
gold_values = [1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]

We turn it into the data type needed by scikit-learn.

In [None]:
gold = numpy.array(gold_values, numpy.int64)
print(type(gold))
print(gold.shape)
print(gold.dtype)

In [None]:
# print out the gold standard
for review, category in zip(reviews_new, gold):
    print('%s => %s' % (review, movie_train.target_names[category]))

In [None]:
### We can now directly get the accuracy
sklearn.metrics.accuracy_score(gold, pred)

In [None]:
### We can now directly get the recall, precision and f1-score
print("recall:",sklearn.metrics.recall_score(gold, pred))
print("precision:",sklearn.metrics.precision_score(gold, pred))
print("f-measure:",sklearn.metrics.f1_score(gold, pred))

In [None]:
print("confusion-matrix:",sklearn.metrics.confusion_matrix(gold, pred))

In [None]:
print(sklearn.metrics.classification_report(gold, pred))