# Sentiment Analysis Using Deep Learning - Classifying IMDB Movie Reviews

This is a Natural Language Processing (NLP) classification project that aims to predict whether a movie review is positive or negative.

The data is scraped from 25,000 [IMDB](http://www.imdb.com/) movie reviews in the text file `reviews.txt`. For training the data, the pre-assigned labels (whether the human-generated text review is positive or negative) is also contained in the text file `labels.txt`.

# Overview

The first step to any machine learning problem is to curate the relevant dataset. Then, a rudimentary predictive theory of how the neural networks would behave is important to identify the correlation between the input and output data

The predictive theory in this project is validated by using simple count-based heuristics. This method were able to identify words with both positive and negative correlation to the output data.

![title](img/sa_net.png)

It is also important to amplify the amount of signal and reduce the amount of noise. This is important so the neural network would be able to converge much faster.

# Curating the Dataset

Parse the data from two `.txt` files which contain all the movie reviews with their respective labels "positive" or "negative".



In [1]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we want to know
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

# Building the Neural Network

The neural network that we are trying to implement consists of 3 layers: 1 input layer, 1 hidden and 1 output layer.

![title](img/sa_net2.png)

While classification neural networks typically use non-linearity in hidden layers, for this neural network we will assume there is no non-linear activation function in the hidden layer.

We will use the functions defined below to build the network and create the training data, and also create a "pre_process_data" function to create vocabulary for our training data-generating functions. We would then modify "train" function to train over the entire corpus

In [2]:
import time
import sys
import numpy as np
from collections import Counter

class SentimentNetwork:
    def __init__(self, reviews,labels,min_count = 10,polarity_cutoff = 0.1, \
                 hidden_nodes = 10, learning_rate = 0.1):
        np.random.seed(1)
        self.pre_process_data(reviews, polarity_cutoff, min_count)      
        self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)
           
    def pre_process_data(self,reviews, polarity_cutoff,min_count):
        positive_counts = Counter()
        negative_counts = Counter()
        total_counts = Counter()

        for i in range(len(reviews)):
            if(labels[i] == 'POSITIVE'):
                for word in reviews[i].split(" "):
                    positive_counts[word] += 1
                    total_counts[word] += 1
            else:
                for word in reviews[i].split(" "):
                    negative_counts[word] += 1
                    total_counts[word] += 1

        pos_neg_ratios = Counter()

        for term,cnt in list(total_counts.most_common()):
            if(cnt >= 50):
                pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
                pos_neg_ratios[term] = pos_neg_ratio

        for word,ratio in pos_neg_ratios.most_common():
            if(ratio > 1):
                pos_neg_ratios[word] = np.log(ratio)
            else:
                pos_neg_ratios[word] = -np.log((1 / (ratio + 0.01)))
        
        review_vocab = set()
        for review in reviews:
            for word in review.split(" "):
                if(total_counts[word] > min_count):
                    if(word in pos_neg_ratios.keys()):
                        if((pos_neg_ratios[word] >= polarity_cutoff) \
                           or (pos_neg_ratios[word] <= -polarity_cutoff)):
                            review_vocab.add(word)
                    else:
                        review_vocab.add(word)
        self.review_vocab = list(review_vocab)
        
        label_vocab = set()
        for label in labels:
            label_vocab.add(label)
        
        self.label_vocab = list(label_vocab)
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        self.word2index = {}
        
        for i, word in enumerate(self.review_vocab):
            self.word2index[word] = i
        
        self.label2index = {}
        for i, label in enumerate(self.label_vocab):
            self.label2index[label] = i
          
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        # Set number of nodes in input, hidden and output layers.
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes

        # Initialize weights
        self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))
        self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5, 
                                                (self.hidden_nodes, self.output_nodes))
        
        self.learning_rate = learning_rate
        
        self.layer_0 = np.zeros((1,input_nodes))
        self.layer_1 = np.zeros((1,hidden_nodes))
        
    def sigmoid(self,x):
        return 1 / (1 + np.exp(-x))  
    
    def sigmoid_output_2_derivative(self,output):
        return output * (1 - output)
    
    def update_input_layer(self,review):
        self.layer_0 *= 0
        for word in review.split(" "):
            self.layer_0[0][self.word2index[word]] = 1

    def get_target_for_label(self,label):
        if(label == 'POSITIVE'):
            return 1
        else:
            return 0
        
    def train(self, training_reviews_raw, training_labels):
        training_reviews = list()
        for review in training_reviews_raw:
            indices = set()
            for word in review.split(" "):
                if(word in self.word2index.keys()):
                    indices.add(self.word2index[word])
            training_reviews.append(list(indices))
        
        assert(len(training_reviews) == len(training_labels))      
        correct_so_far = 0   
        start = time.time()
        
        for i in range(len(training_reviews)):        
            review = training_reviews[i]
            label = training_labels[i]
            
            ### Forward pass ###
            # Hidden layer
            self.layer_1 *= 0
            for index in review:
                self.layer_1 += self.weights_0_1[index]        
            # Output layer
            layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))


            ### Backward pass ###
            # Output error
            layer_2_error = layer_2 - self.get_target_for_label(label)
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)

            # Backpropagated error
            layer_1_error = layer_2_delta.dot(self.weights_1_2.T) 
            layer_1_delta = layer_1_error 
            
            # Update hidden-to-output weights with gradient descent step
            self.weights_1_2 -= self.layer_1.T.dot(layer_2_delta) * self.learning_rate
                   
            for index in review:
                self.weights_0_1[index] -= layer_1_delta[0] * self.learning_rate # update input-to-hidden weights with gradient descent step

            if(layer_2 >= 0.5 and label == 'POSITIVE'):
                correct_so_far += 1
            if(layer_2 < 0.5 and label == 'NEGATIVE'):
                correct_so_far += 1
            
            reviews_per_second = i / float(time.time() - start)
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
        
    
    def test(self, testing_reviews, testing_labels):        
        correct = 0       
        start = time.time()        
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if(pred == testing_labels[i]):
                correct += 1
            
            reviews_per_second = i / float(time.time() - start)
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                            + "% #Correct:" + str(correct) + " #Tested:" + str(i+1) + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
    
    def run(self, review):
        # Hidden layer
        self.layer_1 *= 0
        unique_indices = set()
        for word in review.lower().split(" "):
            if word in self.word2index.keys():
                unique_indices.add(self.word2index[word])
        for index in unique_indices:
            self.layer_1 += self.weights_0_1[index]
        
        # Output layer
        layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))
        
        if(layer_2[0] >= 0.5):
            return "POSITIVE"
        else:
            return "NEGATIVE"
        

In [3]:
mlp_full = SentimentNetwork(reviews[:-1000],labels[:-1000],min_count=0,polarity_cutoff=0,learning_rate=0.01)

In [4]:
mlp_full.train(reviews[:-1000],labels[:-1000])

Progress:99.9% Speed(reviews/sec):2231. #Correct:20335 #Trained:24000 Training Accuracy:84.7%

In [5]:
mlp_full.test(reviews[-1000:],labels[-1000:])

Progress:99.9% Speed(reviews/sec):2577.% #Correct:856 #Tested:1000 Testing Accuracy:85.6%

In [6]:
def get_most_similar_words(focus = "horrible"):
    most_similar = Counter()

    for word in mlp_full.word2index.keys():
        most_similar[word] = np.dot(mlp_full.weights_0_1[mlp_full.word2index[word]],mlp_full.weights_0_1[mlp_full.word2index[focus]])
    
    return most_similar.most_common()

In [7]:
get_most_similar_words("excellent")

[('excellent', 0.13672950757352467),
 ('perfect', 0.12548286087225943),
 ('amazing', 0.091827633925999713),
 ('today', 0.090223662694414203),
 ('wonderful', 0.089355976962214589),
 ('fun', 0.087504466674206832),
 ('great', 0.087141758882292017),
 ('best', 0.085810885617880611),
 ('liked', 0.077697629123843426),
 ('definitely', 0.076628781406966009),
 ('brilliant', 0.073423858769279024),
 ('loved', 0.073285428928122121),
 ('favorite', 0.072781136036160751),
 ('superb', 0.071736207178505054),
 ('fantastic', 0.07092219191626617),
 ('job', 0.069160617207634001),
 ('incredible', 0.066424077952614402),
 ('enjoyable', 0.065632560502888806),
 ('rare', 0.064819212662615033),
 ('highly', 0.063889453350970501),
 ('enjoyed', 0.062127546101812939),
 ('wonderfully', 0.062055178604090142),
 ('perfectly', 0.061093208811887359),
 ('fascinating', 0.060663547937493859),
 ('bit', 0.059655427045653006),
 ('gem', 0.059510859296156779),
 ('outstanding', 0.058860808147082985),
 ('beautiful', 0.058613934703162

In [8]:
get_most_similar_words("terrible")

[('worst', 0.16966107259049851),
 ('awful', 0.12026847019691247),
 ('waste', 0.11945367265311009),
 ('poor', 0.092758887574435525),
 ('terrible', 0.091425387197728011),
 ('dull', 0.084209271678223618),
 ('poorly', 0.081241544516042041),
 ('disappointment', 0.080064759621368747),
 ('fails', 0.07859977372333754),
 ('disappointing', 0.077339485480323378),
 ('boring', 0.077127858748012909),
 ('unfortunately', 0.075502449705859079),
 ('worse', 0.070601835364194676),
 ('mess', 0.07056429962359044),
 ('stupid', 0.06948482283254305),
 ('badly', 0.0668889036662286),
 ('annoying', 0.065687021903374179),
 ('bad', 0.063093814537572165),
 ('save', 0.062880597495865748),
 ('disappointed', 0.062692353812072887),
 ('wasted', 0.061387183028051295),
 ('supposed', 0.060985452957725172),
 ('horrible', 0.060121772339380139),
 ('laughable', 0.058698406285467658),
 ('crap', 0.058104528667884611),
 ('basically', 0.057218840369636176),
 ('nothing', 0.057158220043034211),
 ('ridiculous', 0.056905481068931479),


In [9]:
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

In [10]:
for i in range(len(reviews)):
    if(labels[i] == 'POSITIVE'):
        for word in reviews[i].split(" "):
            positive_counts[word] += 1
            total_counts[word] += 1
    else:
        for word in reviews[i].split(" "):
            negative_counts[word] += 1
            total_counts[word] += 1

In [11]:
pos_neg_ratios = Counter()

for term,cnt in list(total_counts.most_common()):
    if(cnt > 100):
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
        pos_neg_ratios[term] = pos_neg_ratio

for word,ratio in pos_neg_ratios.most_common():
    if(ratio > 1):
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = -np.log((1 / (ratio+0.01)))

In [12]:
import matplotlib.colors as colors

words_to_visualize = list()
for word, ratio in pos_neg_ratios.most_common(500):
    if(word in mlp_full.word2index.keys()):
        words_to_visualize.append(word)
    
for word, ratio in list(reversed(pos_neg_ratios.most_common()))[0:500]:
    if(word in mlp_full.word2index.keys()):
        words_to_visualize.append(word)

In [13]:
pos = 0
neg = 0

colors_list = list()
vectors_list = list()
for word in words_to_visualize:
    if word in pos_neg_ratios.keys():
        vectors_list.append(mlp_full.weights_0_1[mlp_full.word2index[word]])
        if(pos_neg_ratios[word] > 0):
            pos+=1
            colors_list.append("#00ff00")
        else:
            neg+=1
            colors_list.append("#000000")
    

In [14]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
words_top_ted_tsne = tsne.fit_transform(vectors_list)

In [15]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="vector T-SNE for most polarized words")

source = ColumnDataSource(data=dict(x1=words_top_ted_tsne[:,0],
                                    x2=words_top_ted_tsne[:,1],
                                    names=words_to_visualize))

p.scatter(x="x1", y="x2", size=8, source=source,color=colors_list)

word_labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(word_labels)

show(p)

# green indicates positive words, black indicates negative words

Supplying a user-defined data source AND iterable values to glyph methods is deprecated.

See https://github.com/bokeh/bokeh/issues/2056 for more information.

  warn(message)
Supplying a user-defined data source AND iterable values to glyph methods is deprecated.

See https://github.com/bokeh/bokeh/issues/2056 for more information.

  warn(message)


## Further Analysis: T-SNE Dimensionality Reduction

The visualization from bokeh would look as follows:

![title](img/bokeh_plot.png)

We can see that the polarized words are grouped together. The black scatterplot indicates words more associated with negative movie reviews. Here we can see words such as "horrible", "stupid", "terrible" are grouped together.

![title](img/bokeh_plot2.png)

In another vector space, "obnoxious", "disgusting", and "rubbish" are clustered together.

![title](img/bokeh_plot3.png)

The green scatterplot, on the other hand, signifies words that are more associated with positive IMDB movie reviews. As we can see below, words such as "beauty", "stunning", and "friendship" are grouped together.

![title](img/bokeh_plot4.png)

In another cluster, words like "fascinating", "masterpiece", and "awesome" are also plotted close together.

![title](img/bokeh_plot5.png)