# Analyze text sentiment:The machine learning approach

This project is based on Andrew Trask 
[Sentiment project](https://github.com/udacity/deep-learning/tree/master/sentiment-network).

The dataset is part of the [Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/data/sentiment/) publication.

In [54]:
from collections import Counter
import os
import math
from random import randint
import sys
import time
from IPython.display import Image


import numpy as np

from lib.reviews.load_reviews import load_reviews
from lib.reviews.get_words_indexes import get_words_indexes
from lib.activation_functions.sigmoid import sigmoid
from lib.derivatives.sigmoid_derivative import sigmoid_derivative

### Load the reviews and labels data

In [55]:
POSITIVE_DATASET_PATH = "dataset/positive_reviews.txt"
positive_reviews = load_reviews(POSITIVE_DATASET_PATH)

positive_reviews[0]

'I find it remarkable that so little was actually done with the story of the abomb and its development for decades after the Manhattan Project was completed My suspicion is that this was due to serious fears in the movie and entertainment industries in the s through the s with McCarthyism and related national security phobias including the Hollywood blacklist There was one film in the s with Robert Taylor about Col Paul Tibbits who flew the Enola Gay in the Hiroshima bombing but otherwise nothing else One could glance at a side issue tragedy the sinking of the USS Indianapolis soon after the delivery of the bombs to Tinian in Robert Shaws description of the shark attacks on the survivors in JAWS But the actual trials and tribulations of Groves Oppenheimer and their team was not considered filmablebr br And then in  two films appeared I have reviewed one already DAY ONE which I feel is the better of the two in discussing the lengthy technical and emotional and political problems in the 

In [56]:
NEGATIVE_DATASET_PATH = "dataset/negative_reviews.txt"
negative_reviews = load_reviews(NEGATIVE_DATASET_PATH)

negative_reviews[0]

'Boring movie Poor plot Poor actors The movie happens in a room supposed to be in Morocco but actually in some American city The Arab terrorists are the patriots the blonde patriot is the Arab terroristDAMNbr br There is something good about this movie though thats why the score is  out of  The director turns the ridiculous stereotype about terrorism the media feeds us every day into the real thing the terrorists are Americans or western people if you likebr br The movie is divided into two parts The first part of the movie concerns the Dutchman travel  seconds while the second part is about the staying in the amazing dark brown room  hour and somethingbr br The Dutch guy is going to deliver money in Morocco to some charity organization gets off the plain takes the bus and ends up kidnapped in a dark brown room He is kidnapped with another guy that is shot after telling They will not shoot at us The Dutch survivor is forced to play chess with a Morpheuslike Arab guy for so long that yo

### Create the words counters

We'll create three `Counter` objects, one for words from postive reviews, one for words from negative reviews, and one for all the words.

In [57]:
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

In [58]:
def get_words_count(reviews):
    words_counts = Counter()
    
    for index in range(len(reviews)):
        words = reviews[index].split(' ')
        
        for word in words:
            words_counts[word] += 1
            
    return words_counts

positive_counts = get_words_count(positive_reviews)
negative_counts = get_words_count(negative_reviews)

total_counts = positive_counts + negative_counts

Examine the most common words in positive reviews

In [59]:
positive_counts.most_common()

[('the', 144410),
 ('and', 82816),
 ('a', 76738),
 ('of', 74140),
 ('to', 63529),
 ('is', 53902),
 ('in', 45347),
 ('I', 33703),
 ('that', 31574),
 ('it', 31043),
 ('br', 27999),
 ('this', 26696),
 ('as', 22970),
 ('with', 21696),
 ('The', 21696),
 ('', 21219),
 ('was', 21213),
 ('for', 20539),
 ('film', 19232),
 ('movie', 17308),
 ('but', 16154),
 ('on', 15865),
 ('his', 15596),
 ('are', 14242),
 ('you', 13167),
 ('not', 12817),
 ('be', 11861),
 ('have', 11859),
 ('one', 11631),
 ('by', 11367),
 ('he', 11196),
 ('an', 10725),
 ('at', 10378),
 ('who', 10294),
 ('all', 9683),
 ('from', 9634),
 ('its', 9154),
 ('has', 8715),
 ('her', 8475),
 ('like', 8010),
 ('about', 7903),
 ('so', 7509),
 ('out', 7485),
 ('they', 7474),
 ('very', 7462),
 ('This', 7388),
 ('or', 7275),
 ('more', 6958),
 ('good', 6651),
 ('just', 6391),
 ('some', 6359),
 ('It', 6144),
 ('what', 5955),
 ('their', 5926),
 ('great', 5923),
 ('when', 5847),
 ('see', 5774),
 ('story', 5707),
 ('which', 5577),
 ('time', 5570),

And the respective most common words in negative reviews

In [60]:
negative_counts.most_common()

[('the', 139017),
 ('a', 75880),
 ('and', 68762),
 ('of', 67174),
 ('to', 66524),
 ('is', 48337),
 ('in', 40437),
 ('I', 37952),
 ('that', 33232),
 ('this', 32683),
 ('it', 31994),
 ('br', 29148),
 ('was', 25662),
 ('', 25219),
 ('movie', 23157),
 ('The', 22535),
 ('with', 20073),
 ('for', 19983),
 ('as', 18377),
 ('but', 16949),
 ('film', 16785),
 ('on', 16097),
 ('have', 15290),
 ('are', 14314),
 ('not', 14212),
 ('be', 13966),
 ('you', 13785),
 ('one', 11217),
 ('at', 11151),
 ('his', 11078),
 ('like', 10583),
 ('all', 10196),
 ('they', 10037),
 ('an', 9846),
 ('just', 9749),
 ('by', 9746),
 ('he', 9507),
 ('or', 9446),
 ('from', 9353),
 ('so', 9314),
 ('who', 8854),
 ('about', 8531),
 ('out', 8509),
 ('its', 8392),
 ('some', 7673),
 ('has', 7363),
 ('This', 7034),
 ('her', 6994),
 ('good', 6727),
 ('would', 6725),
 ('even', 6619),
 ('bad', 6550),
 ('if', 6519),
 ('no', 6473),
 ('more', 6266),
 ('up', 6249),
 ('only', 6238),
 ('what', 6096),
 ('were', 5989),
 ('really', 5804),
 ('th

As you can see, common words like "the" appear very often in both positive and negative reviews. Instead of finding the most common words in positive or negative reviews, what you really want are the words found in positive reviews more often than in negative reviews, and vice versa. To accomplish this, you'll need to calculate the ratios of word usage between positive and negative reviews.

In [61]:
pos_neg_ratios = Counter()

for word in positive_counts:
    if(positive_counts[word] > 100 or negative_counts[word] > 100):
        pos_neg_ratios[word] = math.log(positive_counts[word] / (negative_counts[word] + 1))

Examine the calculated ratios for a few words:

In [62]:
print(positive_counts["the"])
print(negative_counts["the"])
print(pos_neg_ratios["the"])

print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {}".format(pos_neg_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))

144410
139017
0.038053054988839714
Pos-to-neg ratio for 'the' = 0.038053054988839714
Pos-to-neg ratio for 'amazing' = 1.3419941022233106
Pos-to-neg ratio for 'terrible' = -2.074219597698684


Neutral word have a ratio value close to 0. Words expected to see more often in positive reviews – like "amazing" – have a ratio greater than 0. Words with a ratio lower than 0 were expected to be more often in negative reviews.
Extremely positive and extremely negative words will have positive-to-negative ratios with similar magnitudes but opposite signs.

### Build the neural network

Assign a seed to our random number generator to ensure we get reproducable results during development.

In [63]:
np.random.seed(1)

Define the hyperparameters

In [64]:
# The network learning rate.
learning_rate = 0.001

# The polarity cutoff to exclude values very close to 0.
POLARITY_CUTOFF = 0.02

# The early stopping value expressed in percentage for the validation 
EARLY_STOPPING_VALUE = 80

# The number of single pass through whole training dataset
EPOCHS = 3

Create the words indexes dictionary processing the positive and negative reviews and keeping only the words with a ratio greater than the polarity cutoff.

In [65]:
word_index = 0
words_indexes_dictionary = {}

for word in pos_neg_ratios:
    if(abs(pos_neg_ratios[word]) > POLARITY_CUTOFF):
        words_indexes_dictionary[word] = word_index
        word_index += 1

Define the data sets for training and testing the neural network.

In [66]:
NEGATIVE = 0
POSITIVE = 1

reviews = []
labels = []

# Insert positive reviews

reviews = positive_reviews[:]
labels = [POSITIVE] * len(reviews)

# Insert randomly negative reviews

for review_index in range(len(negative_reviews)):
    index = randint(0, len(reviews))
    reviews.insert(index, negative_reviews[review_index])
    labels.insert(index, NEGATIVE)

train_reviews = reviews[:16000]
valid_reviews = reviews[16000:17000]
test_reviews = reviews[-5000:]

train_labels = labels[:16000]
valid_labels = labels[16000:17000]
test_labels = labels[-5000:]

Build the neural network structure having only an hidden layer.

In [67]:
INPUT_LAYER_NODES = len(words_indexes_dictionary)
HIDDEN_LAYER_NODES = 10
OUTPUT_LAYER_NODES = 1

input_to_hidden_weights = np.zeros((INPUT_LAYER_NODES, HIDDEN_LAYER_NODES))
hidden_to_output_weights = np.random.normal(0.0, HIDDEN_LAYER_NODES ** -0.5, 
                                            (HIDDEN_LAYER_NODES, OUTPUT_LAYER_NODES))

hidden_layer = np.zeros((1, HIDDEN_LAYER_NODES))

### Train the neural network

Loop through all the given reviews and run a forward and backward pass, updating weights for every item.

In [68]:
for epoch in range(EPOCHS):
    correct_predictions = 0

    for review_index in range(len(train_reviews)):
        review = train_reviews[review_index]
        label = train_labels[review_index]

        # Prepare the list of unique word indexes found on current review

        words_indexes = get_words_indexes(words_indexes_dictionary, review)

        ## The forward pass through the network

        # Calculate the hidden layer values with the input to hidden weights

        hidden_layer = np.zeros((OUTPUT_LAYER_NODES, HIDDEN_LAYER_NODES))

        for word_index in words_indexes:
            hidden_layer += input_to_hidden_weights[word_index]

        # Calculate the output value multiplying the hidden layer values by the hidden to output weights

        output = hidden_layer.dot(hidden_to_output_weights)
        output = sigmoid(output)

        ## The network validation

        valid_correct_predictions = 0

        for valid_index in range(len(valid_reviews)):
            valid_review = valid_reviews[valid_index]
            valid_label = valid_labels[valid_index]

            words_indexes = get_words_indexes(words_indexes_dictionary, valid_review)

            hidden_layer = np.zeros((OUTPUT_LAYER_NODES, HIDDEN_LAYER_NODES))

            for word_index in words_indexes:
                hidden_layer += input_to_hidden_weights[word_index]

            valid_output = hidden_layer.dot(hidden_to_output_weights)
            valid_output = sigmoid(valid_output)

            valid_error = valid_output - valid_label

            if(np.abs(valid_error) < 0.5):
                valid_correct_predictions += 1

        valid_accuracy = valid_correct_predictions * 100 / len(valid_reviews)

        # The training will stop when chosen performance measure stops improving
        # to avoid overfitting

        if(valid_accuracy > EARLY_STOPPING_VALUE):
            print("The early stopping value has been reached during validation.")
            break

        ## The back propagation pass

        # Calculate the output error and delta

        error = output - label
        
        output_delta = error * sigmoid_derivative(output)

        # Calculate the hidden error and delta

        hidden_errors = output_delta.dot(hidden_to_output_weights.T)
        hidden_deltas = hidden_errors

        # Update the network weights using the calculated deltas

        hidden_to_output_weights -= hidden_layer.T.dot(output_delta) * learning_rate

        for word_index in words_indexes:
            input_to_hidden_weights[word_index] -= hidden_deltas[0] * learning_rate

        # Keep track of errors and correct predictions 
        
        if(np.abs(error) < 0.5):
            correct_predictions += 1

        accuracy = correct_predictions * 100 / float(review_index + 1)

        sys.stdout.write("\rCorrect predictions: " + str(correct_predictions) + 
                         " - Trained: " + str(review_index) +
                         # " - Valid accuracy: " + str(valid_accuracy) +
                         " - Testing Accuracy:" + str(accuracy)[:4] + "%")

Correct predictions: 430 - Trained: 902 - Testing Accuracy:47.6%

KeyboardInterrupt: 

### Test the neural network

Use the test_labels to calculate the accuracy of previous predictions

In [48]:
correct_predictions = 0

for review_index in range(len(test_reviews)):
    review = test_reviews[review_index]
    label = test_labels[review_index]
    
    # Prepare the list of unique word indexes found on current review
    
    words_indexes = get_words_indexes(words_indexes_dictionary, review)
            
    ## The forward pass through the network
            
    # Calculate the hidden layer values with the input to hidden weights
        
    hidden_layer = np.zeros((OUTPUT_LAYER_NODES, HIDDEN_LAYER_NODES))
    
    for word_index in words_indexes:
        hidden_layer += input_to_hidden_weights[word_index]
    
    # Calculate the output value multiplying the hidden layer values by the hidden to output weights
    
    output = hidden_layer.dot(hidden_to_output_weights)
    output = sigmoid(output)
    
    error = output - label
    
    # Keep track of correct predictions
    
    if(np.abs(error) < 0.5):
        correct_predictions += 1
     
    sys.stdout.write("\rCorrect predictions: " + str(correct_predictions) \
                     + " - Trained: " + str(review_index) \
                     + " - Testing Accuracy:" \
                     + str(correct_predictions * 100 / float(review_index + 1))[:4] + "%")

Correct predictions: 1 - Trained: 0 - Testing Accuracy:100.%Correct predictions: 2 - Trained: 1 - Testing Accuracy:100.%Correct predictions: 3 - Trained: 2 - Testing Accuracy:100.%Correct predictions: 3 - Trained: 3 - Testing Accuracy:75.0%Correct predictions: 4 - Trained: 4 - Testing Accuracy:80.0%Correct predictions: 4 - Trained: 5 - Testing Accuracy:66.6%Correct predictions: 5 - Trained: 6 - Testing Accuracy:71.4%Correct predictions: 6 - Trained: 7 - Testing Accuracy:75.0%Correct predictions: 7 - Trained: 8 - Testing Accuracy:77.7%Correct predictions: 8 - Trained: 9 - Testing Accuracy:80.0%Correct predictions: 9 - Trained: 10 - Testing Accuracy:81.8%Correct predictions: 9 - Trained: 11 - Testing Accuracy:75.0%Correct predictions: 10 - Trained: 12 - Testing Accuracy:76.9%Correct predictions: 11 - Trained: 13 - Testing Accuracy:78.5%Correct predictions: 12 - Trained: 14 - Testing Accuracy:80.0%Correct predictions: 13 - Trained: 15 - Testing Accuracy:81.2%Correct predi

Correct predictions: 589 - Trained: 709 - Testing Accuracy:82.9%Correct predictions: 590 - Trained: 710 - Testing Accuracy:82.9%Correct predictions: 591 - Trained: 711 - Testing Accuracy:83.0%Correct predictions: 592 - Trained: 712 - Testing Accuracy:83.0%Correct predictions: 593 - Trained: 713 - Testing Accuracy:83.0%Correct predictions: 594 - Trained: 714 - Testing Accuracy:83.0%Correct predictions: 595 - Trained: 715 - Testing Accuracy:83.1%Correct predictions: 596 - Trained: 716 - Testing Accuracy:83.1%Correct predictions: 596 - Trained: 717 - Testing Accuracy:83.0%Correct predictions: 597 - Trained: 718 - Testing Accuracy:83.0%Correct predictions: 598 - Trained: 719 - Testing Accuracy:83.0%Correct predictions: 599 - Trained: 720 - Testing Accuracy:83.0%Correct predictions: 599 - Trained: 721 - Testing Accuracy:82.9%Correct predictions: 600 - Trained: 722 - Testing Accuracy:82.9%Correct predictions: 601 - Trained: 723 - Testing Accuracy:83.0%Correct predictions: 602

Correct predictions: 1157 - Trained: 1418 - Testing Accuracy:81.5%Correct predictions: 1158 - Trained: 1419 - Testing Accuracy:81.5%Correct predictions: 1159 - Trained: 1420 - Testing Accuracy:81.5%Correct predictions: 1159 - Trained: 1421 - Testing Accuracy:81.5%Correct predictions: 1160 - Trained: 1422 - Testing Accuracy:81.5%Correct predictions: 1161 - Trained: 1423 - Testing Accuracy:81.5%Correct predictions: 1162 - Trained: 1424 - Testing Accuracy:81.5%Correct predictions: 1162 - Trained: 1425 - Testing Accuracy:81.4%Correct predictions: 1163 - Trained: 1426 - Testing Accuracy:81.4%Correct predictions: 1164 - Trained: 1427 - Testing Accuracy:81.5%Correct predictions: 1165 - Trained: 1428 - Testing Accuracy:81.5%Correct predictions: 1166 - Trained: 1429 - Testing Accuracy:81.5%Correct predictions: 1167 - Trained: 1430 - Testing Accuracy:81.5%Correct predictions: 1168 - Trained: 1431 - Testing Accuracy:81.5%Correct predictions: 1169 - Trained: 1432 - Testing Accuracy:

Correct predictions: 1675 - Trained: 2062 - Testing Accuracy:81.1%Correct predictions: 1676 - Trained: 2063 - Testing Accuracy:81.2%Correct predictions: 1677 - Trained: 2064 - Testing Accuracy:81.2%Correct predictions: 1678 - Trained: 2065 - Testing Accuracy:81.2%Correct predictions: 1679 - Trained: 2066 - Testing Accuracy:81.2%Correct predictions: 1679 - Trained: 2067 - Testing Accuracy:81.1%Correct predictions: 1680 - Trained: 2068 - Testing Accuracy:81.1%Correct predictions: 1681 - Trained: 2069 - Testing Accuracy:81.2%Correct predictions: 1682 - Trained: 2070 - Testing Accuracy:81.2%Correct predictions: 1683 - Trained: 2071 - Testing Accuracy:81.2%Correct predictions: 1684 - Trained: 2072 - Testing Accuracy:81.2%Correct predictions: 1685 - Trained: 2073 - Testing Accuracy:81.2%Correct predictions: 1686 - Trained: 2074 - Testing Accuracy:81.2%Correct predictions: 1687 - Trained: 2075 - Testing Accuracy:81.2%Correct predictions: 1688 - Trained: 2076 - Testing Accuracy:

Correct predictions: 2189 - Trained: 2695 - Testing Accuracy:81.1%Correct predictions: 2190 - Trained: 2696 - Testing Accuracy:81.2%Correct predictions: 2191 - Trained: 2697 - Testing Accuracy:81.2%Correct predictions: 2192 - Trained: 2698 - Testing Accuracy:81.2%Correct predictions: 2193 - Trained: 2699 - Testing Accuracy:81.2%Correct predictions: 2194 - Trained: 2700 - Testing Accuracy:81.2%Correct predictions: 2195 - Trained: 2701 - Testing Accuracy:81.2%Correct predictions: 2196 - Trained: 2702 - Testing Accuracy:81.2%Correct predictions: 2196 - Trained: 2703 - Testing Accuracy:81.2%Correct predictions: 2196 - Trained: 2704 - Testing Accuracy:81.1%Correct predictions: 2197 - Trained: 2705 - Testing Accuracy:81.1%Correct predictions: 2198 - Trained: 2706 - Testing Accuracy:81.1%Correct predictions: 2199 - Trained: 2707 - Testing Accuracy:81.2%Correct predictions: 2200 - Trained: 2708 - Testing Accuracy:81.2%Correct predictions: 2201 - Trained: 2709 - Testing Accuracy:

Correct predictions: 2808 - Trained: 3464 - Testing Accuracy:81.0%Correct predictions: 2809 - Trained: 3465 - Testing Accuracy:81.0%Correct predictions: 2810 - Trained: 3466 - Testing Accuracy:81.0%Correct predictions: 2811 - Trained: 3467 - Testing Accuracy:81.0%Correct predictions: 2811 - Trained: 3468 - Testing Accuracy:81.0%Correct predictions: 2812 - Trained: 3469 - Testing Accuracy:81.0%Correct predictions: 2813 - Trained: 3470 - Testing Accuracy:81.0%Correct predictions: 2814 - Trained: 3471 - Testing Accuracy:81.0%Correct predictions: 2815 - Trained: 3472 - Testing Accuracy:81.0%Correct predictions: 2816 - Trained: 3473 - Testing Accuracy:81.0%Correct predictions: 2817 - Trained: 3474 - Testing Accuracy:81.0%Correct predictions: 2818 - Trained: 3475 - Testing Accuracy:81.0%Correct predictions: 2819 - Trained: 3476 - Testing Accuracy:81.0%Correct predictions: 2819 - Trained: 3477 - Testing Accuracy:81.0%Correct predictions: 2820 - Trained: 3478 - Testing Accuracy:

Correct predictions: 3340 - Trained: 4132 - Testing Accuracy:80.8%Correct predictions: 3341 - Trained: 4133 - Testing Accuracy:80.8%Correct predictions: 3342 - Trained: 4134 - Testing Accuracy:80.8%Correct predictions: 3343 - Trained: 4135 - Testing Accuracy:80.8%Correct predictions: 3343 - Trained: 4136 - Testing Accuracy:80.8%Correct predictions: 3344 - Trained: 4137 - Testing Accuracy:80.8%Correct predictions: 3344 - Trained: 4138 - Testing Accuracy:80.7%Correct predictions: 3345 - Trained: 4139 - Testing Accuracy:80.7%Correct predictions: 3346 - Trained: 4140 - Testing Accuracy:80.8%Correct predictions: 3347 - Trained: 4141 - Testing Accuracy:80.8%Correct predictions: 3348 - Trained: 4142 - Testing Accuracy:80.8%Correct predictions: 3349 - Trained: 4143 - Testing Accuracy:80.8%Correct predictions: 3350 - Trained: 4144 - Testing Accuracy:80.8%Correct predictions: 3350 - Trained: 4145 - Testing Accuracy:80.8%Correct predictions: 3351 - Trained: 4146 - Testing Accuracy:

Correct predictions: 3937 - Trained: 4867 - Testing Accuracy:80.8%Correct predictions: 3938 - Trained: 4868 - Testing Accuracy:80.8%Correct predictions: 3939 - Trained: 4869 - Testing Accuracy:80.8%Correct predictions: 3940 - Trained: 4870 - Testing Accuracy:80.8%Correct predictions: 3941 - Trained: 4871 - Testing Accuracy:80.8%Correct predictions: 3941 - Trained: 4872 - Testing Accuracy:80.8%Correct predictions: 3942 - Trained: 4873 - Testing Accuracy:80.8%Correct predictions: 3943 - Trained: 4874 - Testing Accuracy:80.8%Correct predictions: 3944 - Trained: 4875 - Testing Accuracy:80.8%Correct predictions: 3944 - Trained: 4876 - Testing Accuracy:80.8%Correct predictions: 3945 - Trained: 4877 - Testing Accuracy:80.8%Correct predictions: 3946 - Trained: 4878 - Testing Accuracy:80.8%Correct predictions: 3946 - Trained: 4879 - Testing Accuracy:80.8%Correct predictions: 3947 - Trained: 4880 - Testing Accuracy:80.8%Correct predictions: 3948 - Trained: 4881 - Testing Accuracy: