# Sentiment Analysis on Movie Reviews

In this notebook Sentiment Analysis is performed on movie reviews.

---

<h1>Content<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Import" data-toc-modified-id="Data-Import-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Import</a></span></li><li><span><a href="#Data-Preprocessing" data-toc-modified-id="Data-Preprocessing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Preprocessing</a></span></li><li><span><a href="#Feature-Engineering" data-toc-modified-id="Feature-Engineering-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Feature Engineering</a></span></li><li><span><a href="#Feature-Selection" data-toc-modified-id="Feature-Selection-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Feature Selection</a></span></li><li><span><a href="#Model-Architecture" data-toc-modified-id="Model-Architecture-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Model Architecture</a></span></li><li><span><a href="#Model-Training" data-toc-modified-id="Model-Training-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Model Training</a></span></li><li><span><a href="#Model-Evaluation" data-toc-modified-id="Model-Evaluation-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Model Evaluation</a></span><ul class="toc-item"><li><span><a href="#Accuracy-&amp;-Loss" data-toc-modified-id="Accuracy-&amp;-Loss-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Accuracy &amp; Loss</a></span></li><li><span><a href="#Error-Analysis" data-toc-modified-id="Error-Analysis-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Error Analysis</a></span></li></ul></li><li><span><a href="#Model-Application" data-toc-modified-id="Model-Application-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Model Application</a></span><ul class="toc-item"><li><span><a href="#Test-Predictions" data-toc-modified-id="Test-Predictions-8.1"><span class="toc-item-num">8.1&nbsp;&nbsp;</span>Test Predictions</a></span></li><li><span><a href="#Custom-Reviews" data-toc-modified-id="Custom-Reviews-8.2"><span class="toc-item-num">8.2&nbsp;&nbsp;</span>Custom Reviews</a></span></li></ul></li></ul></div>

In [37]:
import pandas as pd
import numpy as np
import re
import os
from IPython.display import HTML

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text 
from sklearn.decomposition import PCA

from tensorflow.python.keras.models import Sequential, load_model
#from tensorflow.python.keras.layers import Dense, Dropout
#from tensorflow.python.keras import optimizers

import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import words
from nltk.corpus import wordnet 
allEnglishWords = words.words() + [w for w in wordnet.words()]
allEnglishWords = np.unique([x.lower() for x in allEnglishWords])

import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'tensorflow'

---

## Data Import
First, we need to import the data.

In [11]:
path = "data/"
positiveFiles = [x for x in os.listdir(path+"train/pos/") if x.endswith(".txt")]
negativeFiles = [x for x in os.listdir(path+"train/neg/") if x.endswith(".txt")]
testFiles = [x for x in os.listdir(path+"test/") if x.endswith(".txt")]

In [12]:
positiveReviews, negativeReviews, testReviews = [], [], []
for pfile in positiveFiles:
    with open(path+"train/pos/"+pfile, encoding="latin1") as f:
        positiveReviews.append(f.read())
for nfile in negativeFiles:
    with open(path+"train/neg/"+nfile, encoding="latin1") as f:
        negativeReviews.append(f.read())
for tfile in testFiles:
    with open(path+"test/"+tfile, encoding="latin1") as f:
        testReviews.append(f.read())

In [13]:
reviews = pd.concat([
    pd.DataFrame({"review":positiveReviews, "label":1, "file":positiveFiles}),
    pd.DataFrame({"review":negativeReviews, "label":0, "file":negativeFiles}),
    pd.DataFrame({"review":testReviews, "label":-1, "file":testFiles})
], ignore_index=True).sample(frac=1, random_state=1)
reviews.head()

Unnamed: 0,review,label,file
21939,"""National Lampoon Goes to the Movies"" is the w...",0,7277_1.txt
24113,I can't believe that so much talent can be was...,0,1073_2.txt
4633,This is a wonderful film. The non-stop patter ...,1,10574_10.txt
17240,"Did anyone who was making this movie, particul...",0,1085_2.txt
4894,While a bit preachy on the topic of progress a...,1,8127_8.txt


With everything centralized in 1 dataframe, we now perform train, validation and test set splits.

In [14]:
reviews = reviews[["review", "label", "file"]].sample(frac=1, random_state=1)
train = reviews[reviews.label!=-1].sample(frac=0.6, random_state=1)
valid = reviews[reviews.label!=-1].drop(train.index)
test = reviews[reviews.label==-1]

In [15]:
print(train.shape)
print(valid.shape)
print(test.shape)

(15000, 3)
(10000, 3)
(2, 3)


In [16]:
HTML(train.review.iloc[0])

---

## Data Preprocessing
The next step is data preprocessing. The following class behaves like your typical SKLearn vectorizer.

It can perform the following operations.
* Discard non alpha-numeric characters
* Set everything to lower case
* Stems all words using PorterStemmer, and change the stems back to the most occurring existent word.
* Discard non-Egnlish words (not by default).

In [17]:
class Preprocessor(object):
    ''' Preprocess data for NLP tasks. '''

    def __init__(self, alpha=True, lower=True, stemmer=True, english=False):
        self.alpha = alpha
        self.lower = lower
        self.stemmer = stemmer
        self.english = english
        
        self.uniqueWords = None
        self.uniqueStems = None
        
    def fit(self, texts):
        texts = self._doAlways(texts)

        allwords = pd.DataFrame({"word": np.concatenate(texts.apply(lambda x: x.split()).values)})
        self.uniqueWords = allwords.groupby(["word"]).size().rename("count").reset_index()
        self.uniqueWords = self.uniqueWords[self.uniqueWords["count"]>1]
        if self.stemmer:
            self.uniqueWords["stem"] = self.uniqueWords.word.apply(lambda x: PorterStemmer().stem(x)).values
            self.uniqueWords.sort_values(["stem", "count"], inplace=True, ascending=False)
            self.uniqueStems = self.uniqueWords.groupby("stem").first()
        
        #if self.english: self.words["english"] = np.in1d(self.words["mode"], allEnglishWords)
        print("Fitted.")
            
    def transform(self, texts):
        texts = self._doAlways(texts)
        if self.stemmer:
            allwords = np.concatenate(texts.apply(lambda x: x.split()).values)
            uniqueWords = pd.DataFrame(index=np.unique(allwords))
            uniqueWords["stem"] = pd.Series(uniqueWords.index).apply(lambda x: PorterStemmer().stem(x)).values
            uniqueWords["mode"] = uniqueWords.stem.apply(lambda x: self.uniqueStems.loc[x, "word"] if x in self.uniqueStems.index else "")
            texts = texts.apply(lambda x: " ".join([uniqueWords.loc[y, "mode"] for y in x.split()]))
        #if self.english: texts = self.words.apply(lambda x: " ".join([y for y in x.split() if self.words.loc[y,"english"]]))
        print("Transformed.")
        return(texts)

    def fit_transform(self, texts):
        texts = self._doAlways(texts)
        self.fit(texts)
        texts = self.transform(texts)
        return(texts)
    
    def _doAlways(self, texts):
        # Remove parts between <>'s
        texts = texts.apply(lambda x: re.sub('<.*?>', ' ', x))
        # Keep letters and digits only.
        if self.alpha: texts = texts.apply(lambda x: re.sub('[^a-zA-Z0-9 ]+', ' ', x))
        # Set everything to lower case
        if self.lower: texts = texts.apply(lambda x: x.lower())
        return texts  

In [18]:
train.head()

Unnamed: 0,review,label,file
6011,This documentary explores a story covered in P...,1,1251_9.txt
9653,"Released two years before I was born, this Osc...",1,11396_10.txt
15040,THE ZOMBIE CHRONICLES <br /><br />Aspect ratio...,0,94_1.txt
6029,it's amazing that so many people that i know h...,1,8654_9.txt
9729,This final entry in George Lucas's STAR WARS m...,1,820_10.txt


In [19]:
preprocess = Preprocessor(alpha=True, lower=True, stemmer=True)

In [20]:
%%time
trainX = preprocess.fit_transform(train.review)
validX = preprocess.transform(valid.review)

Fitted.
Transformed.
Transformed.
CPU times: user 1min 16s, sys: 2.79 s, total: 1min 19s
Wall time: 1min 21s


In [21]:
trainX.head()

6011     this documentary explore a story cover in pilg...
9653     released two years before i was born this osca...
15040    the zombie chronicles aspect ratio 1 33 1 nu v...
6029     it s amazing that so many people that i know h...
9729     this finally entry in george lucas s star war ...
Name: review, dtype: object

In [22]:
print(preprocess.uniqueWords.shape)
preprocess.uniqueWords[preprocess.uniqueWords.word.str.contains("disappoint")]

(38298, 3)


Unnamed: 0,word,count,stem
15173,disappointingly,9,disappointingli
15171,disappointed,527,disappoint
15174,disappointment,271,disappoint
15172,disappointing,235,disappoint
15170,disappoint,62,disappoint
15177,disappoints,19,disappoint
15176,disappointments,14,disappoint


In [23]:
print(preprocess.uniqueStems.shape)
preprocess.uniqueStems[preprocess.uniqueStems.word.str.contains("disappoint")]

(25381, 2)


Unnamed: 0_level_0,word,count
stem,Unnamed: 1_level_1,Unnamed: 2_level_1
disappoint,disappointed,527
disappointingli,disappointingly,9


---

## Feature Engineering
Next, we take the preprocessed texts as input and calculate their TF-IDF's ([info](http://www.tfidf.com)). We retain 10000 features per text.

In [24]:
stop_words = text.ENGLISH_STOP_WORDS.union(["thats","weve","dont","lets","youre","im","thi","ha",
    "wa","st","ask","want","like","thank","know","susan","ryan","say","got","ought","ive","theyre"])
tfidf = TfidfVectorizer(min_df=2, max_features=10000, stop_words=stop_words) #, ngram_range=(1,3)

In [25]:
%%time
trainX = tfidf.fit_transform(trainX).toarray()
validX = tfidf.transform(validX).toarray()

CPU times: user 4.59 s, sys: 892 ms, total: 5.49 s
Wall time: 6.16 s


In [26]:
print(trainX.shape)
print(validX.shape)

(15000, 10000)
(10000, 10000)


In [27]:
trainY = train.label
validY = valid.label

In [28]:
print(trainX.shape, trainY.shape)
print(validX.shape, validY.shape)

(15000, 10000) (15000,)
(10000, 10000) (10000,)


---

## Feature Selection
Next, we take the 10k dimensional tfidf's as input, and keep the 2000 dimensions that correlate the most with our sentiment target. The corresponding words - see below - make sense.

In [29]:
from scipy.stats.stats import pearsonr

In [30]:
getCorrelation = np.vectorize(lambda x: pearsonr(trainX[:,x], trainY)[0])
correlations = getCorrelation(np.arange(trainX.shape[1]))
print(correlations)

[-0.01394664 -0.02379348  0.01151353 ...  0.01492592  0.02289527
  0.00199103]


In [31]:
allIndeces = np.argsort(-correlations)
bestIndeces = allIndeces[np.concatenate([np.arange(1000), np.arange(-1000, 0)])]

In [32]:
vocabulary = np.array(tfidf.get_feature_names())
print(vocabulary[bestIndeces][:10])
print(vocabulary[bestIndeces][-10:])

['great' 'love' 'excellent' 'beautiful' 'best' 'perfect' 'favorite'
 'enjoy' 'amazing' 'performance']
['minutes' 'poor' 'horrible' 'worse' 'terrible' 'boring' 'awful' 'waste'
 'worst' 'bad']


In [33]:
trainX = trainX[:,bestIndeces]
validX = validX[:,bestIndeces]

In [34]:
print(trainX.shape, trainY.shape)
print(validX.shape, validY.shape)

(15000, 2000) (15000,)
(10000, 2000) (10000,)


---

## Model Architecture
We choose a very simple dense network with 6 layers, performing binary classification.

In [36]:
DROPOUT = 0.5
ACTIVATION = "tanh"

model = Sequential([    
    Dense(int(trainX.shape[1]/2), activation=ACTIVATION, input_dim=trainX.shape[1]),
    Dropout(DROPOUT),
    Dense(int(trainX.shape[1]/2), activation=ACTIVATION, input_dim=trainX.shape[1]),
    Dropout(DROPOUT),
    Dense(int(trainX.shape[1]/4), activation=ACTIVATION),
    Dropout(DROPOUT),
    Dense(100, activation=ACTIVATION),
    Dropout(DROPOUT),
    Dense(20, activation=ACTIVATION),
    Dropout(DROPOUT),
    Dense(5, activation=ACTIVATION),
    Dropout(DROPOUT),
    Dense(1, activation='sigmoid'),
])

NameError: name 'Sequential' is not defined

In [None]:
model.compile(optimizer=optimizers.Adam(0.00005), loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

---

## Model Training
Let's go.

In [None]:
EPOCHS = 30
BATCHSIZE = 1500

In [None]:
model.fit(trainX, trainY, epochs=EPOCHS, batch_size=BATCHSIZE, validation_data=(validX, validY))

In [None]:
x = np.arange(EPOCHS)
history = model.history.history

data = [
    go.Scatter(x=x, y=history["acc"], name="Train Accuracy", marker=dict(size=5), yaxis='y2'),
    go.Scatter(x=x, y=history["val_acc"], name="Valid Accuracy", marker=dict(size=5), yaxis='y2'),
    go.Scatter(x=x, y=history["loss"], name="Train Loss", marker=dict(size=5)),
    go.Scatter(x=x, y=history["val_loss"], name="Valid Loss", marker=dict(size=5))
]
layout = go.Layout(
    title="Model Training Evolution", font=dict(family='Palatino'), xaxis=dict(title='Epoch', dtick=1),
    yaxis1=dict(title="Loss", domain=[0, 0.45]), yaxis2=dict(title="Accuracy", domain=[0.55, 1]),
)
py.iplot(go.Figure(data=data, layout=layout), show_link=False)

---

## Model Evaluation

### Accuracy & Loss
Let's first centralize the probabilities and predictions with the original train and validation dataframes. Then we can print out the respective accuracies and losses.

In [None]:
train["probability"] = model.predict(trainX)
train["prediction"] = train.probability-0.5>0
train["truth"] = train.label==1
train.tail()

In [None]:
print(model.evaluate(trainX, trainY))
print((train.truth==train.prediction).mean())

In [None]:
valid["probability"] = model.predict(validX)
valid["prediction"] = valid.probability-0.5>0
valid["truth"] = valid.label==1
valid.tail()

In [None]:
print(model.evaluate(validX, validY))
print((valid.truth==valid.prediction).mean())

### Error Analysis
Error analysis gives us great insight in the way the model is making its errors. Often, it shows data quality issues.

In [None]:
trainCross = train.groupby(["prediction", "truth"]).size().unstack()
trainCross

In [None]:
validCross = valid.groupby(["prediction", "truth"]).size().unstack()
validCross

In [None]:
truepositives = valid[(valid.truth==True)&(valid.truth==valid.prediction)]
print(len(truepositives), "true positives.")
truepositives.sort_values("probability", ascending=False).head(3)

In [None]:
truenegatives = valid[(valid.truth==False)&(valid.truth==valid.prediction)]
print(len(truenegatives), "true negatives.")
truenegatives.sort_values("probability", ascending=True).head(3)

In [None]:
falsepositives = valid[(valid.truth==True)&(valid.truth!=valid.prediction)]
print(len(falsepositives), "false positives.")
falsepositives.sort_values("probability", ascending=True).head(3)

In [None]:
falsenegatives = valid[(valid.truth==False)&(valid.truth!=valid.prediction)]
print(len(falsenegatives), "false negatives.")
falsenegatives.sort_values("probability", ascending=False).head(3)

This is the review that got predicted as positive most certainly - while being labeled as negative. However, we can easily recognize it as a poorly labeled sample.

In [None]:
HTML(valid.loc[22148].review)

---

## Model Application

### Custom Reviews
To use this model, we would store the model, along with the preprocessing vectorizers, and run the unseen texts through following pipeline.

In [None]:
unseen = pd.Series("this movie very good")

In [None]:
unseen = preprocess.transform(unseen)       # Text preprocessing
unseen = tfidf.transform(unseen).toarray()  # Feature engineering
unseen = unseen[:,bestIndeces]              # Feature selection
probability = model.predict(unseen)[0,0]  # Network feedforward

In [None]:
print(probability)
print("Positive!") if probability > 0.5 else print("Negative!")