# Sentiment Analysis end-to-end example

This example is brought to you by Udacity - consider doing the great Udacity Deep Learning course. Find out more [here](https://www.udacity.com/course/deep-learning-nanodegree--nd101). 

> These are my own personal notes

----

In this notebook, the aium is to build `TODO`

We begin by looking at the dataset we have:
- reviews.txt: reviews of a movie
- labels.txt: positive/negative label associated with the movie 

We will use the python `open()` function to open the file, with the parameter `'r'` to read the file. Using `readlines` will return a list made up of each line in the file, returned as a list item. Hence, each character will be an item in the list. 

In [1]:
review_file = open('sentiment_data/reviews.txt', 'r')
reviews = list(map(lambda x : x[:-1], review_file.readlines()))
review_file.close()

label_file = open('sentiment_data/labels.txt', 'r')
labels = list(map(lambda x : x[:-1].upper(), label_file.readlines()))
label_file.close()

Now lets find some information out about our data.

In [2]:
print(f'Size of our data: {len(reviews)}')
print(f'No of labels: {len(labels)}')
print('\nNow, lets see one row of our data. First feature in our data:')
print(reviews[0])
print('\nPrediction:')
print(labels[0])

Size of our data: 25000
No of labels: 25000

Now, lets see one row of our data. First feature in our data:
bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   

Prediction:
POSITIVE


---

## Now, lets build up a hypothesis

We will begin by looking at our data, and trying to see what conclusions we can draw. This is ofter called the `exploratory` phase. We will begin by looking at some random predictions...

In [3]:
def print_review_with_label(ith_row):
    print(labels[ith_row] + '\t:\t' + reviews[ith_row][:80] + '...')

Using the function above, we can beautifully print our data; feature along with its prediction.

In [4]:
print("labels.txt \t : \t reviews.txt\n")
print_review_with_label(2137)
print_review_with_label(12816)
print_review_with_label(6267)
print_review_with_label(21934)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...


We will be using the `Counter` python class throughout this section, as it provides a nice way to count the occurances of words. 

In [5]:
from collections import Counter
import numpy as np 

In [6]:
positive_words_counter = Counter()
negative_words_counter = Counter()
total_words_counter = Counter()
example_counter_with_stuff = Counter([1,2,3,4,4,4])

def counter_pretty_print():
    print('positive counter: ', positive_words_counter)
    print('negative counter: ', negative_words_counter)
    print('total words counter: ', total_words_counter)

print('At this stage, our counters are empty...')
counter_pretty_print()
print('Here is a test counter: ', example_counter_with_stuff)

At this stage, our counters are empty...
positive counter:  Counter()
negative counter:  Counter()
total words counter:  Counter()
Here is a test counter:  Counter({4: 3, 1: 1, 2: 1, 3: 1})


Now, lets fill out our three counters.

In [7]:
 # for each row in our dataset
for sentence_no in range(len(reviews)):
    # for each word in our sentence
    for word in reviews[sentence_no].split(' '):
        # if it is positive - add a positive counter
        if labels[sentence_no] == 'POSITIVE':
            positive_words_counter[word] +=1
        # if it is negative - add to negative counter
        if labels[sentence_no] == 'NEGATIVE':
            negative_words_counter[word] +=1
        # regardless, add to total word counter
        total_words_counter[word] +=1

In [8]:
# lets take a look at the most common words.
print('Most common positive words:\n')
positive_words_counter.most_common()

Most common positive words:



[('', 550468),
 ('the', 173324),
 ('.', 159654),
 ('and', 89722),
 ('a', 83688),
 ('of', 76855),
 ('to', 66746),
 ('is', 57245),
 ('in', 50215),
 ('br', 49235),
 ('it', 48025),
 ('i', 40743),
 ('that', 35630),
 ('this', 35080),
 ('s', 33815),
 ('as', 26308),
 ('with', 23247),
 ('for', 22416),
 ('was', 21917),
 ('film', 20937),
 ('but', 20822),
 ('movie', 19074),
 ('his', 17227),
 ('on', 17008),
 ('you', 16681),
 ('he', 16282),
 ('are', 14807),
 ('not', 14272),
 ('t', 13720),
 ('one', 13655),
 ('have', 12587),
 ('be', 12416),
 ('by', 11997),
 ('all', 11942),
 ('who', 11464),
 ('an', 11294),
 ('at', 11234),
 ('from', 10767),
 ('her', 10474),
 ('they', 9895),
 ('has', 9186),
 ('so', 9154),
 ('like', 9038),
 ('about', 8313),
 ('very', 8305),
 ('out', 8134),
 ('there', 8057),
 ('she', 7779),
 ('what', 7737),
 ('or', 7732),
 ('good', 7720),
 ('more', 7521),
 ('when', 7456),
 ('some', 7441),
 ('if', 7285),
 ('just', 7152),
 ('can', 7001),
 ('story', 6780),
 ('time', 6515),
 ('my', 6488),
 ('g

In [9]:
# lets take a look at the most common words.
print('\nMost common negative words:\n')
negative_words_counter.most_common()


Most common negative words:



[('', 561462),
 ('.', 167538),
 ('the', 163389),
 ('a', 79321),
 ('and', 74385),
 ('of', 69009),
 ('to', 68974),
 ('br', 52637),
 ('is', 50083),
 ('it', 48327),
 ('i', 46880),
 ('in', 43753),
 ('this', 40920),
 ('that', 37615),
 ('s', 31546),
 ('was', 26291),
 ('movie', 24965),
 ('for', 21927),
 ('but', 21781),
 ('with', 20878),
 ('as', 20625),
 ('t', 20361),
 ('film', 19218),
 ('you', 17549),
 ('on', 17192),
 ('not', 16354),
 ('have', 15144),
 ('are', 14623),
 ('be', 14541),
 ('he', 13856),
 ('one', 13134),
 ('they', 13011),
 ('at', 12279),
 ('his', 12147),
 ('all', 12036),
 ('so', 11463),
 ('like', 11238),
 ('there', 10775),
 ('just', 10619),
 ('by', 10549),
 ('or', 10272),
 ('an', 10266),
 ('who', 9969),
 ('from', 9731),
 ('if', 9518),
 ('about', 9061),
 ('out', 8979),
 ('what', 8422),
 ('some', 8306),
 ('no', 8143),
 ('her', 7947),
 ('even', 7687),
 ('can', 7653),
 ('has', 7604),
 ('good', 7423),
 ('bad', 7401),
 ('would', 7036),
 ('up', 6970),
 ('only', 6781),
 ('more', 6730),
 ('

Instead of looking at the counts of the words, lets now instead look at the ratios between words. Looking at how often words occur, either positive or negative, does not really give us what we are looking for. e.g. you can see there are a lot of common words between both the positive and negative counters. Instead, by looking at a raio, we will be looking at the words that are found in positive reviews over negative, and vice versa. 

This will basically tell us how many more times a word is seen in positive reviews than in the negatives. e.g. we can imagine that positive reviews use the word "love" more, hence the ratio should be larger. Hence:
- Positive words will have a large ratio - bigger than 1
- Negative words will have a smaller ratio - less than 1
- words that are neither positive or negative, but neutral, will be centered around 0

In [10]:
positive_to_negative_ratio = Counter()

for word, count, in list(total_words_counter.most_common()):
    if count > 100:
        positive_to_negative_ratio[word] = positive_words_counter[word] / (negative_words_counter[word] + 1) # +1 so we dont divide by 0

Now, lets take a look at some words...

In [11]:
print(f'positive:negative ratio for the word and: {round(positive_to_negative_ratio["and"],2)}')
print(f'positive:negative ratio for the word good: {round(positive_to_negative_ratio["best"],2)}')
print(f'positive:negative ratio for the word bad: {round(positive_to_negative_ratio["bad"],2)}')

positive:negative ratio for the word and: 1.21
positive:negative ratio for the word good: 2.06
positive:negative ratio for the word bad: 0.26


Okay, but is a score of 2 twice as good as other scores? With the ratios as they are now, it will be difficult to actually compare the scores. So instead, we will do what every computer scientists loves to do, which is to log the numbers.

To find out more about why computer scientists love log, feel free to watch the series by Killian Weiberger on Machine Learning [here](https://www.youtube.com/watch?v=MrLPzBxG95I&list=PLl8OlHZGYOQ7bkVbuRthEsaLr7bONzbXS)

In [12]:
for word, count in positive_to_negative_ratio.most_common():
    positive_to_negative_ratio[word] = np.log(count)

Now, lets take a look at the log(words)...

In [13]:
print(f'positive:negative ratio for the word and: {round(positive_to_negative_ratio["and"],2)}')
print(f'positive:negative ratio for the word good: {round(positive_to_negative_ratio["best"],2)}')
print(f'positive:negative ratio for the word bad: {round(positive_to_negative_ratio["bad"],2)}')

positive:negative ratio for the word and: 0.19
positive:negative ratio for the word good: 0.72
positive:negative ratio for the word bad: -1.36


You can see now that:
- positive words are close to +1
- negative words are close to -1
- neutral words are centered around 0

Now, to close our hypothesis section where we wanted to draw a hypothesis from the data, we will take a peek at our ratio data.

In [14]:
positive_to_negative_ratio.most_common()[0:20]

[('edie', 4.6913478822291435),
 ('paulie', 4.07753744390572),
 ('felix', 3.152736022363656),
 ('polanski', 2.8233610476132043),
 ('matthau', 2.80672172860924),
 ('victoria', 2.681021528714291),
 ('mildred', 2.6026896854443837),
 ('gandhi', 2.538973871058276),
 ('flawless', 2.451005098112319),
 ('superbly', 2.26002547857525),
 ('perfection', 2.159484249353372),
 ('astaire', 2.1400661634962708),
 ('captures', 2.038619547159581),
 ('voight', 2.030170492673053),
 ('wonderfully', 2.0218960560332353),
 ('powell', 1.978345424808467),
 ('brosnan', 1.9547990964725592),
 ('lily', 1.9203768470501485),
 ('bakshi', 1.9029851043382795),
 ('lincoln', 1.9014583864844796)]

As we expected, some positive words like `flawless` and `perfection` have high scores....but also `lincoln`. Interesting.

In [15]:
list(reversed(positive_to_negative_ratio.most_common()))[0:20]

[('boll', -4.969813299576001),
 ('uwe', -4.624972813284271),
 ('seagal', -3.644143560272545),
 ('unwatchable', -3.258096538021482),
 ('stinker', -3.2088254890146994),
 ('mst', -2.9502698994772336),
 ('incoherent', -2.9368917735310576),
 ('unfunny', -2.6922395950755678),
 ('waste', -2.6193845640165536),
 ('blah', -2.5704288232261625),
 ('horrid', -2.4849066497880004),
 ('pointless', -2.4553061800117097),
 ('atrocious', -2.4259083090260445),
 ('redeeming', -2.3682390632154826),
 ('prom', -2.3608540011180215),
 ('drivel', -2.3470368555648795),
 ('lousy', -2.307572634505085),
 ('worst', -2.286987896180378),
 ('laughable', -2.264363880173848),
 ('awful', -2.227194247027435)]

There are some funny negative words, including `lousy` and `unwatcheable`. But again, some interesting words like `prom`.

## Transforming words into numbers

we now need to prerpare our words so that we can feed them into our neural network. In order to do that, we want to transform them so we can do the maths of neural networks.

What we want to do for our network, is build a dictionary. With this dictionary, we will count each word in our input review, and feed that into the network.

As we have already built a Count object that has every word possible from our training data, we are able to now compare each single review from our dataset, and see how often each word occurs per review. This will allow us to feed our reviews into the network whilst maintaining consistency between inputs.

We will begin by building a `vocab`, a set that contains all the words.

In [16]:
vocab = set(total_words_counter.keys())

Vocab is s Set, similar to the mathematical set. This means that it only has each word appearing only once. 

Now, lets take a look at how our Neural network will look. 

![image of our neural network](sentiment_network.png)

You can see that our NN will have:
- one input layer: 
    - This will be the Vocab
    - we will represent this as a np array
- one hidden layer
- one output layer that has one output neuron

In [17]:
layer_0 = np.zeros(shape=(1,len(vocab)))

lets take a look at the first layer...

In [18]:
layer_0.shape

(1, 74074)

This first layer now has a neuron/input per word from our vocab. With the input being a count of how many times the word occurs in the review. However, to pass words from a review into this first layer, we need to be able to build a way that will allow us to feed a new review in with the words organised the same way as the first layer in our network.

In [19]:
word_to_index_translator = {}
# lets map each word in our vocab to an index, and capture that as a dictionary
for index, word in enumerate(vocab):
    word_to_index_translator[word] = index

# lets temporarily use a Counter object to look at the first few rows in our dictionary
Counter(word_to_index_translator).most_common(5)

[('offfice', 74073),
 ('airborne', 74072),
 ('abused', 74071),
 ('trip', 74070),
 ('gcse', 74069)]

now, lets build a function that can take a new review, and spit out a vector that matches the input layer.

In [20]:
def input_for_input_layer(review):
    ''' New input layer, layer_0, for our network to train on.

    layer_0 represents how many times a word occurs in a review.

    Args:
        review (str) : a review for a movie
    Returns:
        None
    '''
    global layer_0
    # clear out previous layer 0
    layer_0 *=0
    for word in review.split(' '):
        # find index location of the word from our vocab
        index_of_word = word_to_index_translator[word]
        # add it to our layer 0
        layer_0[:, index_of_word] += 1

Lets test this by feeding it a review.
Before we test it, lets look at layer_0

In [21]:
layer_0

array([[0., 0., 0., ..., 0., 0., 0.]])

In [22]:
input_for_input_layer(reviews[200])
layer_0

array([[34.,  0.,  0., ...,  0.,  0.,  0.]])

Great, it has updated layer_0.

Now, we will build a function that can take a label (e.g. POSITIVE or NEGATIVE), and return either 1 or 0. This is needed as our network needs to be built ontop of numbers, and not strings.

In [23]:
def translate_label(label):
    '''Converts label to 0 or 1.

    Args:
        label (str) : POSITIVE or NEGATIVE label for a review
    RETURNS:
        0 : if negative
        1 : if positive
    '''
    if label == 'POSITIVE':
        return 1
    else:
        return 0

again, lets test this by running a label into our function.

In [24]:
print(f'testing +ve label: {labels[200]}')
print(f'This is the output from our function: {translate_label(labels[200])}')
print(f'\ntesting -ve label: {labels[1]}')
print(f'This is the output from our function: {translate_label(labels[1])}')

testing +ve label: POSITIVE
This is the output from our function: 1

testing -ve label: NEGATIVE
This is the output from our function: 0


Great, so it works.

Now it is finally time to build our Neural Network!

We will:
- build a basic neural network that has an input layer, hidden layer and an output layer
- we will not be adding non-linearity in our hidden layer
- we will use the same functions we defined above, to build up our training data set
- we will create a vocab from our training data
- we will train over the entire corpus

In [129]:
import sentiment_network
import importlib
importlib.reload(sentiment_network)

<module 'sentiment_network' from '/Users/abditimer/Documents/development/abdis-ml-handbook/abdis_machine_learning_handbook/abdis_machine_learning_handbook/4_1_dl_basics_implementations/sentiment_network.py'>

In [113]:
mlp = sentiment_network.SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.1)

processing input...
100%|██████████| 72810/72810 [00:00<00:00, 1627031.34it/s]
100%|██████████| 2/2 [00:00<00:00, 8516.35it/s]

Created a Neural Network with:
- 72810 input nodes
- 10 hidden nodes
- 1 output nodes

Our weights have the following shapes:
input to hidden: (72810, 10)
hidden to output: (10, 1)




In [114]:
mlp.test(reviews[-1000:],labels[-1000:])

Progress:99.9% Speed(reviews/sec):1040. #Correct:500 #Tested:1000 Testing Accuracy:50.0%

In [115]:
mlp.train(reviews[:-1000],labels[:-1000])

Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
Progress:10.4% Speed(reviews/sec):425.8 #Correct:1251 #Trained:2501 Training Accuracy:50.0%
Progress:20.8% Speed(reviews/sec):425.3 #Correct:2501 #Trained:5001 Training Accuracy:50.0%
Progress:31.2% Speed(reviews/sec):413.7 #Correct:3751 #Trained:7501 Training Accuracy:50.0%
Progress:41.6% Speed(reviews/sec):413.7 #Correct:5001 #Trained:10001 Training Accuracy:50.0%
Progress:52.0% Speed(reviews/sec):416.6 #Correct:6251 #Trained:12501 Training Accuracy:50.0%
Progress:62.5% Speed(reviews/sec):417.6 #Correct:7501 #Trained:15001 Training Accuracy:50.0%
Progress:72.9% Speed(reviews/sec):417.6 #Correct:8751 #Trained:17501 Training Accuracy:50.0%
Progress:83.3% Speed(reviews/sec):417.9 #Correct:10001 #Trained:20001 Training Accuracy:50.0%
Progress:93.7% Speed(reviews/sec):417.7 #Correct:11251 #Trained:22501 Training Accuracy:50.0%
Progress:99.9% Speed(reviews/sec):418.0 #Correct:12000 #Trained:24000 Training Ac

In [130]:
mlp2 = sentiment_network.SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.001)
mlp2.train(reviews[:-1000],labels[:-1000])

processing input...
100%|██████████| 72810/72810 [00:00<00:00, 1624391.75it/s]
100%|██████████| 2/2 [00:00<00:00, 8097.11it/s]


Created a Neural Network with:
- 72810 input nodes
- 10 hidden nodes
- 1 output nodes

Our weights have the following shapes:
input to hidden: (72810, 10)
hidden to output: (10, 1)

Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
Progress:10.4% Speed(reviews/sec):334.9 #Correct:1256 #Trained:2501 Training Accuracy:50.2%
Progress:20.8% Speed(reviews/sec):356.6 #Correct:2661 #Trained:5001 Training Accuracy:53.2%
Progress:31.2% Speed(reviews/sec):363.2 #Correct:4056 #Trained:7501 Training Accuracy:54.0%
Progress:41.6% Speed(reviews/sec):364.7 #Correct:5585 #Trained:10001 Training Accuracy:55.8%
Progress:52.0% Speed(reviews/sec):364.2 #Correct:7156 #Trained:12501 Training Accuracy:57.2%
Progress:62.5% Speed(reviews/sec):363.7 #Correct:8765 #Trained:15001 Training Accuracy:58.4%
Progress:72.9% Speed(reviews/sec):362.6 #Correct:10