# Analyze text sentiment:The machine learning approach

This project is based on Andrew Trask 
[Sentiment project](https://github.com/udacity/deep-learning/tree/master/sentiment-network).

The dataset is part of the [Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/data/sentiment/) publication.

In [None]:
from collections import Counter
import os
import math
from random import randint
import sys
import time
from IPython.display import Image


import numpy as np

from lib.reviews.load_reviews import load_reviews
from lib.reviews.get_words_indexes import get_words_indexes
from lib.activation_functions.sigmoid import sigmoid
from lib.derivatives.sigmoid_derivative import sigmoid_derivative

### Load the reviews and labels data

In [None]:
POSITIVE_DATASET_PATH = "dataset/positive_reviews.txt"
positive_reviews = load_reviews(POSITIVE_DATASET_PATH)

positive_reviews[0]

In [None]:
NEGATIVE_DATASET_PATH = "dataset/negative_reviews.txt"
negative_reviews = load_reviews(NEGATIVE_DATASET_PATH)

negative_reviews[0]

### Create the words counters

We'll create three `Counter` objects, one for words from postive reviews, one for words from negative reviews, and one for all the words.

In [None]:
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

In [None]:
def get_words_count(reviews):
    words_counts = Counter()
    
    for index in range(len(reviews)):
        words = reviews[index].split(' ')
        
        for word in words:
            words_counts[word] += 1
            
    return words_counts

positive_counts = get_words_count(positive_reviews)
negative_counts = get_words_count(negative_reviews)

total_counts = positive_counts + negative_counts

Examine the most common words in positive reviews

In [None]:
positive_counts.most_common()

And the respective most common words in negative reviews

In [None]:
negative_counts.most_common()

As you can see, common words like "the" appear very often in both positive and negative reviews. Instead of finding the most common words in positive or negative reviews, what you really want are the words found in positive reviews more often than in negative reviews, and vice versa. To accomplish this, you'll need to calculate the ratios of word usage between positive and negative reviews.

In [None]:
pos_neg_ratios = Counter()

for word in positive_counts:
    if(positive_counts[word] > 100 or negative_counts[word] > 100):
        pos_neg_ratios[word] = math.log(positive_counts[word] / (negative_counts[word] + 1))

Examine the calculated ratios for a few words:

In [None]:
print(positive_counts["the"])
print(negative_counts["the"])
print(pos_neg_ratios["the"])

print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {}".format(pos_neg_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))

Neutral word have a ratio value close to 0. Words expected to see more often in positive reviews – like "amazing" – have a ratio greater than 0. Words with a ratio lower than 0 were expected to be more often in negative reviews.
Extremely positive and extremely negative words will have positive-to-negative ratios with similar magnitudes but opposite signs.

### Build the neural network

Assign a seed to our random number generator to ensure we get reproducable results during development.

In [None]:
np.random.seed(1)

Define the hyperparameters

In [None]:
# The network learning rate.
learning_rate = 0.001

# The polarity cutoff to exclude values very close to 0.
POLARITY_CUTOFF = 0.02

# The early stopping value expressed in percentage for the validation 
EARLY_STOPPING_VALUE = 80

# The number of single pass through whole training dataset
EPOCHS = 3

Create the words indexes dictionary processing the positive and negative reviews and keeping only the words with a ratio greater than the polarity cutoff.

In [None]:
word_index = 0
words_indexes_dictionary = {}

for word in pos_neg_ratios:
    if(abs(pos_neg_ratios[word]) > POLARITY_CUTOFF):
        words_indexes_dictionary[word] = word_index
        word_index += 1

Define the data sets for training and testing the neural network.

In [None]:
NEGATIVE = 0
POSITIVE = 1

reviews = []
labels = []

# Insert positive reviews

reviews = positive_reviews[:]
labels = [POSITIVE] * len(reviews)

# Insert randomly negative reviews

for review_index in range(len(negative_reviews)):
    index = randint(0, len(reviews))
    reviews.insert(index, negative_reviews[review_index])
    labels.insert(index, NEGATIVE)

train_reviews = reviews[:16000]
valid_reviews = reviews[16000:17000]
test_reviews = reviews[-5000:]

train_labels = labels[:16000]
valid_labels = labels[16000:17000]
test_labels = labels[-5000:]

Build the neural network structure having only an hidden layer.

In [None]:
INPUT_LAYER_NODES = len(words_indexes_dictionary)
HIDDEN_LAYER_NODES = 10
OUTPUT_LAYER_NODES = 1

input_to_hidden_weights = np.zeros((INPUT_LAYER_NODES, HIDDEN_LAYER_NODES))
hidden_to_output_weights = np.random.normal(0.0, HIDDEN_LAYER_NODES ** -0.5, 
                                            (HIDDEN_LAYER_NODES, OUTPUT_LAYER_NODES))

hidden_layer = np.zeros((1, HIDDEN_LAYER_NODES))

### Train the neural network

Loop through all the given reviews and run a forward and backward pass, updating weights for every item.

In [None]:
for epoch in range(EPOCHS):
    correct_predictions = 0

    for review_index in range(len(train_reviews)):
        review = train_reviews[review_index]
        label = train_labels[review_index]

        # Prepare the list of unique word indexes found on current review

        words_indexes = get_words_indexes(words_indexes_dictionary, review)

        ## The forward pass through the network

        # Calculate the hidden layer values with the input to hidden weights

        hidden_layer = np.zeros((OUTPUT_LAYER_NODES, HIDDEN_LAYER_NODES))

        for word_index in words_indexes:
            hidden_layer += input_to_hidden_weights[word_index]

        # Calculate the output value multiplying the hidden layer values by the hidden to output weights

        output = hidden_layer.dot(hidden_to_output_weights)
        output = sigmoid(output)

        ## The network validation

        valid_correct_predictions = 0

        for valid_index in range(len(valid_reviews)):
            valid_review = valid_reviews[valid_index]
            valid_label = valid_labels[valid_index]

            words_indexes = get_words_indexes(words_indexes_dictionary, valid_review)

            hidden_layer = np.zeros((OUTPUT_LAYER_NODES, HIDDEN_LAYER_NODES))

            for word_index in words_indexes:
                hidden_layer += input_to_hidden_weights[word_index]

            valid_output = hidden_layer.dot(hidden_to_output_weights)
            valid_output = sigmoid(valid_output)

            valid_error = valid_output - valid_label

            if(np.abs(valid_error) < 0.5):
                valid_correct_predictions += 1

        valid_accuracy = valid_correct_predictions * 100 / len(valid_reviews)

        # The training will stop when chosen performance measure stops improving
        # to avoid overfitting

        if(valid_accuracy > EARLY_STOPPING_VALUE):
            print("The early stopping value has been reached during validation.")
            break

        ## The back propagation pass

        # Calculate the output error and delta

        error = output - label
        
        output_delta = error * sigmoid_derivative(output)

        # Calculate the hidden error and delta

        hidden_errors = output_delta.dot(hidden_to_output_weights.T)
        hidden_deltas = hidden_errors

        # Update the network weights using the calculated deltas

        hidden_to_output_weights -= hidden_layer.T.dot(output_delta) * learning_rate

        for word_index in words_indexes:
            input_to_hidden_weights[word_index] -= hidden_deltas[0] * learning_rate

        # Keep track of errors and correct predictions 
        
        if(np.abs(error) < 0.5):
            correct_predictions += 1

        accuracy = correct_predictions * 100 / float(review_index + 1)

        sys.stdout.write("\rCorrect predictions: " + str(correct_predictions) + 
                         " - Trained: " + str(review_index) +
                         # " - Valid accuracy: " + str(valid_accuracy) +
                         " - Testing Accuracy:" + str(accuracy)[:4] + "%")

### Test the neural network

Use the test_labels to calculate the accuracy of previous predictions

In [None]:
correct_predictions = 0

for review_index in range(len(test_reviews)):
    review = test_reviews[review_index]
    label = test_labels[review_index]
    
    # Prepare the list of unique word indexes found on current review
    
    words_indexes = get_words_indexes(words_indexes_dictionary, review)
            
    ## The forward pass through the network
            
    # Calculate the hidden layer values with the input to hidden weights
        
    hidden_layer = np.zeros((OUTPUT_LAYER_NODES, HIDDEN_LAYER_NODES))
    
    for word_index in words_indexes:
        hidden_layer += input_to_hidden_weights[word_index]
    
    # Calculate the output value multiplying the hidden layer values by the hidden to output weights
    
    output = hidden_layer.dot(hidden_to_output_weights)
    output = sigmoid(output)
    
    error = output - label
    
    # Keep track of correct predictions
    
    if(np.abs(error) < 0.5):
        correct_predictions += 1
     
    sys.stdout.write("\rCorrect predictions: " + str(correct_predictions) \
                     + " - Trained: " + str(review_index) \
                     + " - Testing Accuracy:" \
                     + str(correct_predictions * 100 / float(review_index + 1))[:4] + "%")