# Lab2 - Movie Reviews Sentiment Analysis with Scikit-Learn

The focus of this notebook is on performing sentiment analysis using the scikit-learn package. Material from [this notebook](http://www.pitt.edu/~naraehan/presentation/Movie+Reviews+sentiment+analysis+with+Scikit-Learn.html) was used.

**at the end of this notebook, you will be able to**:
* load the training data, i.e., the movie reviews
* inspect the training data, i.e., the movie reviews
* extracting features from training data
* training and evaluating the *NaiveBayesClassifier*
* apply the classifier to fake movie reviews
* creating your own gold standard

**If you want to learn more, you might find the following links useful:**
* [documentation on dataset loading](http://scikit-learn.org/stable/datasets/)
* [introduction TF-IDF](https://medium.freecodecamp.org/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3)

## Loading the dataset
We are first going to load and inspect the movie reviews dataset.
However, we are going to use a general method to load the dataset.
This is relevant for when you want to use other datasets as training data for the same algorithm.
Let's first inspect what the help message of the function **load_files** states.

In [136]:
import sklearn
import numpy
from sklearn.datasets import load_files

In [7]:
help(load_files)

Help on function load_files in module sklearn.datasets.base:

load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)
    Load text files with categories as subfolder names.
    
    Individual samples are assumed to be files stored a two levels folder
    structure such as the following:
    
        container_folder/
            category_1_folder/
                file_1.txt
                file_2.txt
                ...
                file_42.txt
            category_2_folder/
                file_43.txt
                file_44.txt
                ...
    
    The folder names are used as supervised signal label names. The individual
    file names are not important.
    
    This function does not try to extract features into a numpy array or scipy
    sparse matrix. In addition, if load_content is false it does not try to
    load the files in memory.
    
    To use text files in a scikit

Ok, so the function requires the following structure in order for it to work:
* container_folder/
    * category_1_folder/ (e.g., 'pos')
        * file_1.txt
        * file_2.txt
        * ...
        file_42.txt
    * category_2_folder/ (e.g., 'neg')
        * file_43.txt
        * file_44.txt
        * ...

Let's check whether our **movie_review** corpus has this structure.

In [9]:
from nltk.corpus import movie_reviews

In [15]:
print('path to corpus on your computer', movie_reviews._root)

path to corpus on your computer /Users/marten/nltk_data/corpora/movie_reviews


Using the command line or using Mac Finder or equivalent on Windows, please go the movie_reviews folder and inspect whether the corpus has the required structure.

....

....

Hopefully, it is! Let's now load it using the function.

In [22]:
# Please change the path here if you want to train it on a different corpus!
moviedir = str(movie_reviews._root)
print(moviedir)

/Users/marten/nltk_data/corpora/movie_reviews


In [27]:
# loading all files as training data. 
movie_train = load_files(moviedir, shuffle=True)

## Inspecting dataset

How many files do we have?

In [25]:
len(movie_train.data)

2000

In [28]:
# target names ("classes") are automatically generated from subfolder names
movie_train.target_names

['neg', 'pos']

In [29]:
# Let's inspect one file
movie_train.data[0][:500]

b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so cal"

In [30]:
# first file is in "neg" folder
movie_train.filenames[0]

'/Users/marten/nltk_data/corpora/movie_reviews/neg/cv405_21868.txt'

In [31]:
# first file is a negative review and is mapped to 0 index 'neg' in target_names
movie_train.target[0]

0

## A detour: try out CountVectorizer & TF-IDF

In [71]:
# import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [72]:
# Turn off pretty printing of jupyter notebook... it generates long lines
%pprint

Pretty printing has been turned OFF


In [73]:
import nltk

In [74]:
sents = ['A rose is a rose is a rose is a rose.',
         'Oh, what a fine day it is.',
        "It ain't over till it's over, I tell you!!"]

In [75]:
# Initialize a CoutVectorizer to use NLTK's tokenizer instead of its 
# default one (which ignores punctuation and stopwords). 
# Minimum document frequency set to 1. 
foovec = CountVectorizer(min_df=1, tokenizer=nltk.word_tokenize)

In [76]:
# sents turned into sparse vector of word frequency counts
sents_counts = foovec.fit_transform(sents)
# foovec now contains vocab dictionary which maps unique words to indexes
foovec.vocabulary_

{'a': 4, 'rose': 14, 'is': 9, '.': 3, 'oh': 12, ',': 2, 'what': 17, 'fine': 7, 'day': 6, 'it': 10, 'ai': 5, "n't": 11, 'over': 13, 'till': 16, "'s": 1, 'i': 8, 'tell': 15, 'you': 18, '!': 0}

In [77]:
# sents_counts has a dimension of 3 (document count) by 19 (# of unique words)
sents_counts.shape

(3, 19)

In [78]:
# this vector is small enough to view in full! 
sents_counts.toarray()

array([[0, 0, 0, 1, 4, 0, 0, 0, 0, 3, 0, 0, 0, 0, 4, 0, 0, 0, 0],
       [0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0],
       [2, 1, 1, 0, 0, 1, 0, 0, 1, 0, 2, 1, 0, 2, 0, 1, 1, 0, 1]],
      dtype=int64)

In [80]:
# Convert raw frequency counts into TF-IDF (Term Frequency -- Inverse Document Frequency) values
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
sents_tfidf = tfidf_transformer.fit_transform(sents_counts)

In [81]:
# TF-IDF values
# raw counts have been normalized against document length, 
# terms that are found across many docs are weighted down
sents_tfidf.toarray()

array([[0.        , 0.        , 0.        , 0.13650997, 0.54603988,
        0.        , 0.        , 0.        , 0.        , 0.40952991,
        0.        , 0.        , 0.        , 0.        , 0.71797683,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.28969526, 0.28969526, 0.28969526,
        0.        , 0.38091445, 0.38091445, 0.        , 0.28969526,
        0.28969526, 0.        , 0.38091445, 0.        , 0.        ,
        0.        , 0.        , 0.38091445, 0.        ],
       [0.47282517, 0.23641258, 0.17979786, 0.        , 0.        ,
        0.23641258, 0.        , 0.        , 0.23641258, 0.        ,
        0.35959573, 0.23641258, 0.        , 0.47282517, 0.        ,
        0.23641258, 0.23641258, 0.        , 0.23641258]])

## Extracting features from training data
Instead of using the words as features, we are going to represent each movie review using a vector. This vector has the length of the amount of terms that there are in all the movie reviews combined. Instead of a cell value being the frequency of occurrence, we use it's td-idf value (as shown above).

In [82]:
# initialize movie_vector object, and then turn movie train data into a vector 
movie_vec = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize)
movie_counts = movie_vec.fit_transform(movie_train.data)

In [83]:
# 'screen' is found in the corpus, mapped to index 19637
movie_vec.vocabulary_.get('screen')

19603

In [84]:
# Likewise, Mr. Steven Seagal is present...
movie_vec.vocabulary_.get('seagal')

19656

In [85]:
# huge dimensions! 2,000 documents, 25K unique terms. 
movie_counts.shape

(2000, 25279)

In [86]:
# Convert raw frequency counts into TF-IDF values
tfidf_transformer = TfidfTransformer()
movie_tfidf = tfidf_transformer.fit_transform(movie_counts)

In [87]:
# Same dimensions, now with tf-idf values instead of raw frequency counts
movie_tfidf.shape

(2000, 25279)

## Training and testing a Naive Bayes classifier
In order to train the classifier, we will first split the data into train and test.

In [104]:
# Now ready to build a classifier. 
# We will use Multinominal Naive Bayes as our model
from sklearn.naive_bayes import MultinomialNB

In [92]:
# Split data into training and test sets
# from sklearn.cross_validation import train_test_split  # deprecated in 0.18
from sklearn.model_selection import train_test_split

We again choose 80% training and 20% test. 

In [97]:
docs_train, docs_test, y_train, y_test = train_test_split(
    movie_tfidf, 
    movie_train.target, 
    test_size = 0.20) 

One instance looks like this:

In [99]:
docs_train[0]

<1x25279 sparse matrix of type '<class 'numpy.float64'>'
	with 310 stored elements in Compressed Sparse Row format>

it's label is:

In [106]:
y_train[0]

0

In [107]:
# Train a Multimoda Naive Bayes classifier
clf = MultinomialNB().fit(docs_train, y_train)

In [115]:
# Predicting the Test set results, find accuracy
y_pred = clf.predict(docs_test)
sklearn.metrics.accuracy_score(y_test, y_pred)

0.835

## Applying classifier on fake movie reviews

In [127]:
# very short and fake movie reviews
reviews_new = ['This movie was excellent', 'Absolute joy ride', 
            'Steven Seagal was terrible', 'Steven Seagal shined through.', 
              'This was certainly a movie', 'Two thumbs up', 'I fell asleep halfway through', 
              "We can't wait for the sequel!!", '!', '?', 'I cannot recommend this highly enough', 
              'instant classic.', 'Steven Seagal was amazing. His performance was Oscar-worthy.']
len(reviews_new)

13

In [128]:
# We re-use movie_vec to transform it in the same way as the training data
reviews_new_counts = movie_vec.transform(reviews_new)
reviews_new_tfidf = tfidf_transformer.transform(reviews_new_counts)

In [129]:
# The dimensions for the 13 reviews are the same
reviews_new_tfidf.shape

(13, 25279)

In [130]:
# have classifier make a prediction
pred = clf.predict(reviews_new_tfidf)

In [131]:
# Note that the predictions are just an array with the class result for the 13 reviews
print(pred)

[1 1 0 0 0 1 0 0 0 0 1 1 0]


In [133]:
# print out results
for review, category in zip(reviews_new, pred):
    print('%s => %s' % (review, movie_train.target_names[category]))

This movie was excellent => pos
Absolute joy ride => pos
Steven Seagal was terrible => neg
Steven Seagal shined through. => neg
This was certainly a movie => neg
Two thumbs up => pos
I fell asleep halfway through => neg
We can't wait for the sequel!! => neg
! => neg
? => neg
I cannot recommend this highly enough => pos
instant classic. => pos
Steven Seagal was amazing. His performance was Oscar-worthy. => neg


## Define your own gold standard
You can make your own gold standard in the following way.

In [134]:
reviews_new = ['This movie was excellent', 'Absolute joy ride', 
            'Steven Seagal was terrible', 'Steven Seagal shined through.', 
              'This was certainly a movie', 'Two thumbs up', 'I fell asleep halfway through', 
              "We can't wait for the sequel!!", '!', '?', 'I cannot recommend this highly enough', 
              'instant classic.', 'Steven Seagal was amazing. His performance was Oscar-worthy.']

For each review in **reviews_new**, we determine whether it was positive (1) or negative (0)

In [138]:
gold_values = [1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]

We turn it into the data type needed by scikit-learn.

In [139]:
gold = numpy.array(gold_values, numpy.int64)
print(type(gold))
print(gold.shape)
print(gold.dtype)

<class 'numpy.ndarray'>
(13,)
int64


In [140]:
# print out the gold standard
for review, category in zip(reviews_new, gold):
    print('%s => %s' % (review, movie_train.target_names[category]))

This movie was excellent => pos
Absolute joy ride => pos
Steven Seagal was terrible => neg
Steven Seagal shined through. => pos
This was certainly a movie => pos
Two thumbs up => neg
I fell asleep halfway through => pos
We can't wait for the sequel!! => neg
! => neg
? => neg
I cannot recommend this highly enough => neg
instant classic. => pos
Steven Seagal was amazing. His performance was Oscar-worthy. => pos


In [142]:
### We can now directly get the accuracy
sklearn.metrics.accuracy_score(gold, pred)

0.5384615384615384

In [143]:
### We can now directly get the recall, precision and f1-score
print("recall:",sklearn.metrics.recall_score(gold, pred))
print("precision:",sklearn.metrics.precision_score(gold, pred))
print("f-measure:",sklearn.metrics.f1_score(gold, pred))

recall: 0.42857142857142855
precision: 0.6
f-measure: 0.5


In [144]:
print("confusion-matrix:",sklearn.metrics.confusion_matrix(gold, pred))

confusion-matrix: [[4 2]
 [4 3]]


In [145]:
print(sklearn.metrics.classification_report(gold, pred))

              precision    recall  f1-score   support

           0       0.50      0.67      0.57         6
           1       0.60      0.43      0.50         7

   micro avg       0.54      0.54      0.54        13
   macro avg       0.55      0.55      0.54        13
weighted avg       0.55      0.54      0.53        13

