# Sentiment Analysis with an RNN

In this notebook, you'll implement a recurrent neural network that performs sentiment analysis. 
>Using an RNN rather than a strictly feedforward network is more accurate since we can include information about the *sequence* of words. 

Here we'll use a dataset of movie reviews, accompanied by sentiment labels: positive or negative.

<img src="rnn_img/reviews_ex.png" width=40%>

### Network Architecture

The architecture for this network is shown below.

<img src="rnn_img/network_diagram.png" width=40%>

>**First, we'll pass in words to an embedding layer.** We need an embedding layer because we have tens of thousands of words, so we'll need a more efficient representation for our input data than one-hot encoded vectors. You should have seen this before from the Word2Vec lesson. You can actually train an embedding with the Skip-gram Word2Vec model and use those embeddings as input, here. However, it's good enough to just have an embedding layer and let the network learn a different embedding table on its own. *In this case, the embedding layer is for dimensionality reduction, rather than for learning semantic representations.*

>**After input words are passed to an embedding layer, the new embeddings will be passed to LSTM cells.** The LSTM cells will add *recurrent* connections to the network and give us the ability to include information about the *sequence* of words in the movie review data. 

>**Finally, the LSTM outputs will go to a sigmoid output layer.** We're using a sigmoid function because positive and negative = 1 and 0, respectively, and a sigmoid will output predicted, sentiment values between 0-1. 

We don't care about the sigmoid outputs except for the **very last one**; we can ignore the rest. We'll calculate the loss by comparing the output at the last time step and the training label (pos or neg).

In [1]:
### Imports

import numpy as np
from string import punctuation
from collections import Counter

---
### Load in and visualize the data

In [2]:
with open('sentiment/reviews.txt', 'r') as opins:
    reviews = opins.read()

with open('sentiment/labels.txt', 'r') as label:
    labels = label.read()

In [3]:
print(reviews[0:1000])
print()
print(labels[:27])

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turn

## Data pre-processing

The first step when building a neural network model is getting your data into the proper form to feed into the network. Since we're using embedding layers, we'll need to encode each word with an integer. We'll also want to clean it up a bit.

You can see an example of the reviews data above. Here are the processing steps, we'll want to take:
>* We'll want to get rid of periods and extraneous punctuation.
* Also, you might notice that the reviews are delimited with newline characters `\n`. To deal with those, I'm going to split the text into each review using `\n` as the delimiter. 
* Then I can combined all the reviews back together into one big string.

First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words.

In [4]:
print('Here are the punctuation that we will elim:',punctuation)

reviews = reviews.lower() # convert to lower case
all_text = ''.join([char for char in reviews if char not in punctuation])

Here are the punctuation that we will elim: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [5]:
print('Here is a sample of our cleaned up data:')
print()
print(all_text[0:1000]) 

Here is a sample of our cleaned up data:

bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t   
story of a man who has unnatural feelings for a pig  starts out with a opening scene that is a terrific example of absurd comedy  a formal orchestra aud

### Encoding the words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

> **Exercise:** Now you're going to encode the words with integers. Build a dictionary that maps words to integers. Later we're going to pad our input vectors with zeros, so make sure the integers **start at 1, not 0**.
> Also, convert the reviews to integers and store the reviews in a new list called `reviews_ints`. 

In [10]:
words = all_text.split()
word_counts = Counter(words) # Count all unique words as dictionary: [word, num of occur]

#sort from most frequent to least frequent
sorted_vocab = sorted(word_counts, key = word_counts.get, reverse = True) 
print(sorted_vocab[:10]) # most frequent words

['the', 'and', 'a', 'of', 'to', 'is', 'br', 'it', 'in', 'i']


In [12]:
# create int_to_vocab dictionaries - mapping

int_to_vocab = {index: word for index, word in enumerate(sorted_vocab, 1)} # Starting from 1

for i in range(1,10):
    print(int_to_vocab[i])
    
    
vocab_to_int = {word: index for index, word in int_to_vocab.items()}

# Convert to list
int_words = [vocab_to_int[word] for word in words]


the
and
a
of
to
is
br
it
in


In [9]:
print(int_words[:10]) # First ten words in the reviews text converted to integers

[21024, 307, 5, 2, 1049, 206, 7, 2137, 31, 0]
