# Sentiment Analysis with an RNN

In this notebook, you'll implement a recurrent neural network that performs sentiment analysis. Using an RNN rather than a feedfoward network is more accurate since we can include information about the *sequence* of words. Here we'll use a dataset of movie reviews, accompanied by labels.

The architecture for this network is shown below.

<img src="assets/network_diagram.png" width=400px>

Here, we'll pass in words to an embedding layer. We need an embedding layer because we have tens of thousands of words, so we'll need a more efficient representation for our input data than one-hot encoded vectors. You should have seen this before from the word2vec lesson. You can actually train up an embedding with word2vec and use it here. But it's good enough to just have an embedding layer and let the network learn the embedding table on it's own.

From the embedding layer, the new representations will be passed to LSTM cells. These will add recurrent connections to the network so we can include information about the sequence of words in the data. Finally, the LSTM cells will go to a sigmoid output layer here. We're using the sigmoid because we're trying to predict if this text has positive or negative sentiment. The output layer will just be a single unit then, with a sigmoid activation function.

We don't care about the sigmoid outputs except for the very last one, we can ignore the rest. We'll calculate the cost from the output of the last step and the training label.

In [1]:
import numpy as np
import tensorflow as tf

In [2]:
with open('./data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('./data/labels.txt', 'r') as f:
    labels = f.read()

In [3]:
reviews[:2000]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   \nstory of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is tu

## Data preprocessing

The first step when building a neural network model is getting your data into the proper form to feed into the network. Since we're using embedding layers, we'll need to encode each word with an integer. We'll also want to clean it up a bit.

You can see an example of the reviews data above. We'll want to get rid of those periods. Also, you might notice that the reviews are delimited with newlines `\n`. To deal with those, I'm going to split the text into each review using `\n` as the delimiter. Then I can combined all the reviews back together into one big string.

First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words.

In [4]:
from string import punctuation
print(punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [5]:
# A list of contractions from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"oughtn't": "ought not",
"shan't": "shall not",
"sha'n't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you're": "you are"
}

### Removing stop words

Note on downloading nltk corpus:

nltk is installed in the anaconda version of python but not the default python.  So first activate root environment.
```
source activate root
```

Create folder nltk_data in the home directory.  Then run this on command line to download the corpus

```
python -m nltk.downloader all
```
So now I should be able to download stopwords

In [6]:
from nltk.corpus import stopwords
stopWords = stopwords.words("english")

In [7]:
len(stopWords)

153

In [8]:
'''
for all characters in text, add to list if not in punctuation.  Join with empty string to convert list to string

get a list of reviews by splitting on newline.
Combine all reviews into one string using join on empty space.

Get a list of all words by splitting on all text
'''

'''
This runs through the data twice. Better to walk thru it once.
#all_text = ''.join([c for c in reviews if c not in punctuation])
#all_text = ' '.join([word for word in reviews.split() if word not in stopWords])
'''
     
#all_text = ' '.join([''.join(c for c in word if c not in punctuation) for word in reviews.split() if word not in stopWords])
all_text = ' '.join([ contractions[word] if word in contractions else word for word in reviews.split('[ \t]+') if word not in stopWords])
all_text = ''.join([c for c in all_text if c not in punctuation])

In [9]:
reviews = all_text.split('\n')
all_text = ' '.join(reviews)
words = all_text.split()

In [10]:
len(all_text)

33351075

In [11]:
'''
There are 33,678,267 reviews
'''
len(reviews)

25001

In [12]:
words[:10]

['bromwell', 'high', 'is', 'a', 'cartoon', 'comedy', 'it', 'ran', 'at', 'the']

### Encoding the words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

> **Exercise:** Now you're going to encode the words with integers. Build a dictionary that maps words to integers. Later we're going to pad our input vectors with zeros, so make sure the integers **start at 1, not 0**.
> Also, convert the reviews to integers and store the reviews in a new list called `reviews_ints`. 

In [13]:
# Create your dictionary that maps vocab words to integers here
from collections import Counter
counts = Counter(words) #dictionary where words are keys, counts are values

In [14]:
vocab_sorted = sorted(counts, key=counts.get, reverse=True) #list with most common words first

In [15]:
vocab_to_int = { word: ii for ii, word in enumerate(vocab_sorted, start=1)}

In [16]:
int_to_vocab = { ii: word for ii, word in enumerate(vocab_sorted, start=1)}

In [17]:
# Convert the reviews to integers, same shape as reviews list, but with integers
'''
If sep is not specified or is None, a different splitting algorithm is applied: 
runs of consecutive whitespace are regarded as a single separator, 
and the result will contain no empty strings at the start or end if 
the string has leading or trailing whitespace. 
Consequently, splitting an empty string or a string consisting of 
just whitespace with a None separator returns [].
'''
reviews_ints = []
for review in reviews:
    reviews_ints.append([vocab_to_int[word] for word in review.split()])

### Encoding the labels

Our labels are "positive" or "negative". To use these labels in our network, we need to convert them to 0 and 1.

> **Exercise:** Convert labels from `positive` and `negative` to 1 and 0, respectively.

In [18]:
# Convert labels to 1s and 0s for 'positive' and 'negative'
labels_str = labels

labels_l = labels_str.split('\n')

labels = [1 if label == 'positive' else 0 for label in labels_str.split('\n')]
    

In [19]:
labels_str.split('\n')[0:4]

['positive', 'negative', 'positive', 'negative']

In [20]:
labels[0:4]

[1, 0, 1, 0]

If you built `labels` correctly, you should see the next output.

In [21]:
from collections import Counter
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews: 1
Maximum review length: 2514


In [22]:
review_lens[0]

1

Okay, a couple issues here. We seem to have one review with zero length. And, the maximum review length is way too many steps for our RNN. Let's truncate to 200 steps. For reviews shorter than 200, we'll pad with 0s. For reviews longer than 200, we can truncate them to the first 200 characters.

> **Exercise:** First, remove the review with zero length from the `reviews_ints` list.

In [23]:
# Filter out that review with 0 length
'''
I need to remove any labels at the same position for any features (reviews) removed.
So I need the indices where reviews are zero length
'''
zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) == 0 ]

In [24]:
zero_idx

[25000]

In [25]:
reviews_ints = [review for ii, review in enumerate(reviews_ints) if ii not in zero_idx]

In [26]:
len(reviews_ints)

25000

In [27]:
labels = np.array( [label for ii, label in enumerate(labels) if ii not in zero_idx ])

In [28]:
labels = np.array(labels)

In [29]:
len(labels)

25000

In [88]:
labels.shape

(25000,)

> **Exercise:** Now, create an array `features` that contains the data we'll pass to the network. The data should come from `review_ints`, since we want to feed integers to the network. Each row should be 200 elements long. For reviews shorter than 200 words, left pad with 0s. That is, if the review is `['best', 'movie', 'ever']`, `[117, 18, 128]` as integers, the row will look like `[0, 0, 0, ..., 0, 117, 18, 128]`. For reviews longer than 200, use on the first 200 words as the feature vector.

This isn't trivial and there are a bunch of ways to do this. But, if you're going to be building your own deep learning networks, you're going to have to get used to preparing your data.



In [30]:
seq_len = 10
tmp_review = [1,2,3] #len(review) is 3
zero_pad = [0] * (seq_len - len(tmp_review))
zero_pad.extend(tmp_review)

In [31]:
zero_pad

[0, 0, 0, 0, 0, 0, 0, 1, 2, 3]

In [32]:
len(zero_pad)

10

In [33]:
seq_len = 200
features = []
for review in reviews_ints:
    if len(review) >= seq_len:
        #print("review >= ", str(seq_len))
        features.append(review[0:seq_len])
    else:
        #print("review len ", str(len(review)), " < ", str(seq_len))
        zero_pad = [0] * (seq_len - len(review))
        padded = zero_pad + review
        #print(padded)
        features.append(padded)

In [34]:
len(features[0])

200

In [35]:
features = np.array(features)

In [36]:
features[:10,:100]

array([[    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0, 21844,   308,     6,
            3,  1050,   207,     8,  2142,    32,     1,   171,    57,
           15,    49,    81,  5791,    44,   382,   110,   140,    15,
         5215,    60,   154,     9,     1,  5017,  5855,   475,    71,
            5,   260,    12, 21844,   308,    13,  1978,     6,    74,
         2406],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     

If you build features correctly, it should look like that cell output below.

In [13]:
features[:10,:100]

array([[    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0, 21282,   308,     6,
            3,  1050,   207,     8,  2143,    32,     1,   171,    57,
           15,    49,    81,  5832,    44,   382,   110,   140,    15,
         5236,    60,   154,     9,     1,  5014,  5899,   475,    71,
            5,   260,    12, 21282,   308,    13,  1981,     6,    74,
         2396],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     

## Training, Validation, Test



With our data in nice shape, we'll split it into training, validation, and test sets.

> **Exercise:** Create the training, validation, and test sets here. You'll need to create sets for the features and the labels, `train_x` and `train_y` for example. Define a split fraction, `split_frac` as the fraction of data to keep in the training set. Usually this is set to 0.8 or 0.9. The rest of the data will be split in half to create the validation and testing data.

In [37]:
from sklearn.model_selection import train_test_split
split_frac = 0.8

train_x, val_x, train_y, val_y = train_test_split(features, labels, test_size=0.2, random_state=0)

idx_half = len(val_y)//2

val_x, test_x = val_x[0:idx_half], val_x[idx_half:]
val_y, test_y = val_y[0:idx_half], val_y[idx_half:]

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
('Train set: \t\t(20000, 200)', '\nValidation set: \t(2500, 200)', '\nTest set: \t\t(2500, 200)')


With train, validation, and text fractions of 0.8, 0.1, 0.1, the final shapes should look like:
```
                    Feature Shapes:
Train set: 		 (20000, 200) 
Validation set: 	(2500, 200) 
Test set: 		  (2500, 200)
```

## Build the graph

Here, we'll build the graph. First up, defining the hyperparameters.

* `lstm_size`: Number of units in the hidden layers in the LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `lstm_layers`: Number of LSTM layers in the network. I'd start with 1, then add more if I'm underfitting.
* `batch_size`: The number of reviews to feed the network in one training pass. Typically this should be set as high as you can go without running out of memory.
* `learning_rate`: Learning rate

In [38]:
lstm_size = 512
lstm_layers = 1
batch_size = 64#500
learning_rate = 0.001

For the network itself, we'll be passing in our 200 element long review vectors. Each batch will be `batch_size` vectors. We'll also be using dropout on the LSTM layer, so we'll make a placeholder for the keep probability.

> **Exercise:** Create the `inputs_`, `labels_`, and drop out `keep_prob` placeholders using `tf.placeholder`. `labels_` needs to be two-dimensional to work with some functions later.  Since `keep_prob` is a scalar (a 0-dimensional tensor), you shouldn't provide a size to `tf.placeholder`.

In [39]:
n_words = len(vocab_to_int) + 1 # Adding 1 because we use 0's for padding, dictionary started at 1

# Create the graph object
tf.reset_default_graph()
graph = tf.Graph()

# Add nodes to the graph
with graph.as_default():
    inputs_ = tf.placeholder(dtype=tf.int32, shape=[None, seq_len], name='inputs')
    labels_ = tf.placeholder(dtype=tf.int32, shape=[None, 1], name='labels')
    keep_prob = tf.placeholder(dtype=tf.float32, name='keep_prob')

### Embedding

Now we'll add an embedding layer. We need to do this because there are 74000 words in our vocabulary. It is massively inefficient to one-hot encode our classes here. You should remember dealing with this problem from the word2vec lesson. Instead of one-hot encoding, we can have an embedding layer and use that layer as a lookup table. You could train an embedding layer using word2vec, then load it here. But, it's fine to just make a new layer and let the network learn the weights.

> **Exercise:** Create the embedding lookup matrix as a `tf.Variable`. Use that embedding matrix to get the embedded vectors to pass to the LSTM cell with [`tf.nn.embedding_lookup`](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup). This function takes the embedding matrix and an input tensor, such as the review vectors. Then, it'll return another tensor with the embedded vectors. So, if the embedding layer has 200 units, the function will return a tensor with size [batch_size, 200].



### Use Global Vector of words to set embedding layer

In [40]:
'''
Note: I'll try setting the vectors to word2vec or GloVe vectors.
embedding is the weight matrix that converts from the one-hot encoded input words
into vectorized representations of the words. 

Since the input is one-hot encoded, for each word (each row of the input),
only one column has '1' and the others are all zero.
So only one row of the weight matrix is multiplied by 1 for a given word.

So embedding_lookup saves time by looking up each row of the weight matrix
instead of multiplying all the other rows of the weight matrix by zero.

According to the Stanford NLP course, embedding sizes of 200 to 300 work best.
There is little improvement when going past 300.

I downloaded the glove weights from https://github.com/stanfordnlp/GloVe
I'm using glove.6B.300d.txt because it has vectors of length 300 per word.
Each line is the word, followed by the vector values, separated by a space.

I'll set a dictionary to lookup the GloVe vector based on the word.
key is the word as a string, value is the word vector as a list of floats
'''
glove_file = 'data/glove.6B.300d.txt'
glove_n_symbols_str = ! wc -l < {glove_file} #number of lines is number of words in the GloVe file

In [41]:
glove_n_symbols = int(glove_n_symbols_str[0].strip())
glove_n_symbols

400000

In [42]:
embedding_dim = 300

In [43]:
glove_index_dict = {} #key is word, value is index (row) in glove_embedding_weights matrix
glove_embedding_weights = np.empty((glove_n_symbols, embedding_dim)) #each row is one word's vector representation
with open(glove_file, 'r') as file_reader:
    idx = 0
    for line in file_reader:
        tokens = line.strip().split()
        word = tokens[0]
        glove_index_dict[word] = idx
        word_lower = word.lower()
        if word_lower != word: #save lower case version of word with the same vector representation
            glove_index_dict[word_lower] = idx
        glove_embedding_weights[idx,:] = np.array([float(v) for v in tokens[1:]])
        idx += 1

In [44]:
'''
create embedding for the words that exist in the data
'''
np.random.seed(0)
shape = (n_words, embedding_dim)
scale = glove_embedding_weights.std()*np.sqrt(12)/2 # uniform and not normal
embedding_glove = np.random.uniform(low=-scale, high=scale, size=shape)

In [45]:
c = 0
for word, i in vocab_to_int.items():
    glove_idx = glove_index_dict.get(word, glove_index_dict.get(word.lower()))
    if glove_idx is not None:
        embedding_glove[i,:] = glove_embedding_weights[glove_idx,:]
        c+=1
        
print("Number of words found in GloVe and in the data {}".format(c))

Number of words found in GloVe and in the data 59629


### Option 1: use GloVe to pre-set embedding layer

In [46]:
#Set the tensor with the pre-trained GloVe vectors
with graph.as_default():
    embedding = tf.Variable(embedding_glove, dtype=tf.float32, name='embedding')    
    embed = tf.nn.embedding_lookup(embedding, inputs_)

### Option 2: using random weights for embedding layer

For comparison, I'll see how the model does with random weight initialization

In [59]:
# Size of the embedding vectors (number of units in the embedding layer)
embed_size = 300 

with graph.as_default():
    embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs_)

### LSTM cell

<img src="assets/network_diagram.png" width=400px>

Next, we'll create our LSTM cells to use in the recurrent network ([TensorFlow documentation](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn)). Here we are just defining what the cells look like. This isn't actually building the graph, just defining the type of cells we want in our graph.

To create a basic LSTM cell for the graph, you'll want to use `tf.contrib.rnn.BasicLSTMCell`. Looking at the function documentation:

> **Exercise:** Below, use `tf.contrib.rnn.BasicLSTMCell` to create an LSTM cell. Then, add drop out to it with `tf.contrib.rnn.DropoutWrapper`. Finally, create multiple LSTM layers with `tf.contrib.rnn.MultiRNNCell`.

Here is [a tutorial on building RNNs](https://www.tensorflow.org/tutorials/recurrent) that will help you out.


In [47]:
 '''
    Starting with tensorflow 1.1, MultiRNNCell requires us to build a new cell object
    for each cell in the list, so we need a function to create an individual cell,
    and we'll call it in a list comprehension (a loop) once for each unique cell that we want
    
    The lstm's num_units are num of hidden units for each of the gates/pathways within the lstm;
    it's also the size of the tensor that is output from the lstm.

'''

def make_cell(lstm_size, keep_prob):
    '''
    Use this to create a single lstm cell, which represents one layer of the RNN.
    This can be used for the encoding and decoding layers.
    '''
    lstm = tf.contrib.rnn.LSTMCell(num_units=lstm_size,
                                       initializer=tf.random_uniform_initializer(minval=-0.1,
                                                                                 maxval=0.1,
                                                                                 seed=0))
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    return drop

In [48]:
with graph.as_default():
    
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell([make_cell(lstm_size, keep_prob) for _ in range(lstm_layers)])
    
    
    '''
    The initial state is the "memory" that gets modified by the gates within each lstm cell.
    '''
    # Getting an initial state of all zeros
    initial_state = cell.zero_state(batch_size, tf.float32)

### RNN forward pass

<img src="assets/network_diagram.png" width=400px>

Now we need to actually run the data through the RNN nodes. You can use [`tf.nn.dynamic_rnn`](https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn) to do this. You'd pass in the RNN cell you created (our multiple layered LSTM `cell` for instance), and the inputs to the network.

```
outputs, final_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state)
```

Above I created an initial state, `initial_state`, to pass to the RNN. This is the cell state that is passed between the hidden layers in successive time steps. `tf.nn.dynamic_rnn` takes care of most of the work for us. We pass in our cell and the input to the cell, then it does the unrolling and everything else for us. It returns outputs for each time step and the final_state of the hidden layer.

> **Exercise:** Use `tf.nn.dynamic_rnn` to add the forward pass through the RNN. Remember that we're actually passing in vectors from the embedding layer, `embed`.



In [49]:
'''
outputs is a list of the outputs as each word is fed into the RNN.
Since we set the input to have 200 words, it should have 200 outputs.
'''
with graph.as_default():
    outputs, final_state = tf.nn.dynamic_rnn(cell, embed,
                                             initial_state=initial_state)

### Output

We only care about the final output, we'll be using that as our sentiment prediction. So we need to grab the last output with `outputs[:, -1]`, the calculate the cost from that and `labels_`.

In [50]:
with graph.as_default():
    predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
    cost = tf.losses.mean_squared_error(labels_, predictions)
    
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

### Validation accuracy

Here we can add a few nodes to calculate the accuracy which we'll use in the validation pass.

In [51]:
with graph.as_default():
    correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_) #true if rounded pred equals label
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32)) #true = 1, false = 0

### Batching

This is a simple function for returning batches from our data. First it removes data such that we only have full batches. Then it iterates through the `x` and `y` arrays and returns slices out of those arrays with size `[batch_size]`.

In [52]:
def get_batches(x, y, batch_size=100):
    
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

## Training

Below is the typical training code. If you want to do this yourself, feel free to delete all this code and implement it yourself. Before you run this, make sure the `checkpoints` directory exists.

In [53]:
epochs = 3

with graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for e in range(epochs):
        state = sess.run(initial_state)
        
        for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
            feed = {inputs_: x,
                    labels_: y[:, None],
                    keep_prob: 0.5,
                    initial_state: state}
            loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)
            
            if iteration%5==0:
                print("Epoch: {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Train loss: {:.3f}".format(loss))

            if iteration%25==0:
                val_acc = []
                val_state = sess.run(cell.zero_state(batch_size, tf.float32))
                for x, y in get_batches(val_x, val_y, batch_size):
                    feed = {inputs_: x,
                            labels_: y[:, None],
                            keep_prob: 1,
                            initial_state: val_state}
                    batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
                    val_acc.append(batch_acc)
                print("Val acc: {:.3f}".format(np.mean(val_acc)))
            iteration +=1
    saver.save(sess, "checkpoints/sentiment.ckpt")

('Epoch: 0/3', 'Iteration: 5', 'Train loss: 0.228')
('Epoch: 0/3', 'Iteration: 10', 'Train loss: 0.289')
('Epoch: 0/3', 'Iteration: 15', 'Train loss: 0.271')
('Epoch: 0/3', 'Iteration: 20', 'Train loss: 0.216')
('Epoch: 0/3', 'Iteration: 25', 'Train loss: 0.247')
Val acc: 0.532
('Epoch: 0/3', 'Iteration: 30', 'Train loss: 0.263')
('Epoch: 0/3', 'Iteration: 35', 'Train loss: 0.216')
('Epoch: 0/3', 'Iteration: 40', 'Train loss: 0.253')
('Epoch: 0/3', 'Iteration: 45', 'Train loss: 0.244')
('Epoch: 0/3', 'Iteration: 50', 'Train loss: 0.225')
Val acc: 0.596
('Epoch: 0/3', 'Iteration: 55', 'Train loss: 0.226')
('Epoch: 0/3', 'Iteration: 60', 'Train loss: 0.203')
('Epoch: 0/3', 'Iteration: 65', 'Train loss: 0.246')
('Epoch: 0/3', 'Iteration: 70', 'Train loss: 0.257')
('Epoch: 0/3', 'Iteration: 75', 'Train loss: 0.253')
Val acc: 0.581
('Epoch: 0/3', 'Iteration: 80', 'Train loss: 0.243')
('Epoch: 0/3', 'Iteration: 85', 'Train loss: 0.236')
('Epoch: 0/3', 'Iteration: 90', 'Train loss: 0.226')
('

Val acc: 0.837
('Epoch: 2/3', 'Iteration: 730', 'Train loss: 0.044')
('Epoch: 2/3', 'Iteration: 735', 'Train loss: 0.023')
('Epoch: 2/3', 'Iteration: 740', 'Train loss: 0.053')
('Epoch: 2/3', 'Iteration: 745', 'Train loss: 0.025')
('Epoch: 2/3', 'Iteration: 750', 'Train loss: 0.044')
Val acc: 0.859
('Epoch: 2/3', 'Iteration: 755', 'Train loss: 0.067')
('Epoch: 2/3', 'Iteration: 760', 'Train loss: 0.066')
('Epoch: 2/3', 'Iteration: 765', 'Train loss: 0.035')
('Epoch: 2/3', 'Iteration: 770', 'Train loss: 0.035')
('Epoch: 2/3', 'Iteration: 775', 'Train loss: 0.044')
Val acc: 0.857
('Epoch: 2/3', 'Iteration: 780', 'Train loss: 0.076')
('Epoch: 2/3', 'Iteration: 785', 'Train loss: 0.032')
('Epoch: 2/3', 'Iteration: 790', 'Train loss: 0.020')
('Epoch: 2/3', 'Iteration: 795', 'Train loss: 0.050')
('Epoch: 2/3', 'Iteration: 800', 'Train loss: 0.017')
Val acc: 0.858
('Epoch: 2/3', 'Iteration: 805', 'Train loss: 0.050')
('Epoch: 2/3', 'Iteration: 810', 'Train loss: 0.043')
('Epoch: 2/3', 'Iterat

Trial 5: 1 lstm layer does just as well as 2, and runs faster.  3 epochs is enough. LSTM size 512

```
Val acc: 0.853
('Epoch: 2/3', 'Iteration: 930', 'Train loss: 0.019')
('Epoch: 2/3', 'Iteration: 935', 'Train loss: 0.045')
```

Trial 5
Removed stop words and replaced any remaining contractions with full words
Used LSTM size 512 and 2 lstm layers

Not much difference after removing stopwords and replacing contractions.
It reaches a peak of 0.854 validation accuracy at epoch 3 and stays around 0.83 and 0.84 afterwards (for 10 epochs)
```
Val acc: 0.854
('Epoch: 3/10', 'Iteration: 1005', 'Train loss: 0.053')
('Epoch: 3/10', 'Iteration: 1010', 'Train loss: 0.014')
('Epoch: 3/10', 'Iteration: 1015', 'Train loss: 0.057')
('Epoch: 3/10', 'Iteration: 1020', 'Train loss: 0.012')
('Epoch: 3/10', 'Iteration: 1025', 'Train loss: 0.022')
Val acc: 0.851
```

Trial 4

Used 3 lstm layers and GloVe embedding.
Mostly no improvement over using 2 lstm layers, and it trains much more slowly.

```
Val acc: 0.847
('Epoch: 3/10', 'Iteration: 1055', 'Train loss: 0.087')
('Epoch: 3/10', 'Iteration: 1060', 'Train loss: 0.061')
('Epoch: 3/10', 'Iteration: 1065', 'Train loss: 0.036')
('Epoch: 3/10', 'Iteration: 1070', 'Train loss: 0.088')
('Epoch: 3/10', 'Iteration: 1075', 'Train loss: 0.052')
Val acc: 0.853
```

Trial 3

Used 2 lstm layers and GLoVe embedding weights.  It hovers around 0.84 most of the time after epoch 2.
So 2 lstm layers is slightly better than 1.
```
('Epoch: 2/10', 'Iteration: 730', 'Train loss: 0.082')
('Epoch: 2/10', 'Iteration: 735', 'Train loss: 0.049')
('Epoch: 2/10', 'Iteration: 740', 'Train loss: 0.090')
('Epoch: 2/10', 'Iteration: 745', 'Train loss: 0.037')
('Epoch: 2/10', 'Iteration: 750', 'Train loss: 0.065')
Val acc: 0.851
```

Trial 2:
I used random weights for embedding.  It took longer to reach 0.80 validation accuracy, and was at most 0.82
So using GloVe weights performs better.

Used 1 lstm layer
```
('Epoch: 7/10', 'Iteration: 2455', 'Train loss: 0.001')
('Epoch: 7/10', 'Iteration: 2460', 'Train loss: 0.001')
('Epoch: 7/10', 'Iteration: 2465', 'Train loss: 0.008')
('Epoch: 7/10', 'Iteration: 2470', 'Train loss: 0.002')
('Epoch: 7/10', 'Iteration: 2475', 'Train loss: 0.015')
Val acc: 0.821
```


Trial 1: used GloVe for embedding.
By epoch 2, validation accuracy reached its peak around .84 and stayed around .83 afterwards.

1 lstm layer
GloVe vectors for embedding

```
Val acc: 0.848
('Epoch: 2/20', 'Iteration: 630', 'Train loss: 0.068')
('Epoch: 2/20', 'Iteration: 635', 'Train loss: 0.110')
('Epoch: 2/20', 'Iteration: 640', 'Train loss: 0.047')
('Epoch: 2/20', 'Iteration: 645', 'Train loss: 0.063')
('Epoch: 2/20', 'Iteration: 650', 'Train loss: 0.039')
Val acc: 0.849

```

## Testing

In [54]:
test_acc = []
with tf.Session(graph=graph) as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    test_state = sess.run(cell.zero_state(batch_size, tf.float32))
    for ii, (x, y) in enumerate(get_batches(test_x, test_y, batch_size), 1):
        feed = {inputs_: x,
                labels_: y[:, None],
                keep_prob: 1,
                initial_state: test_state}
        batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
        test_acc.append(batch_acc)
    print("Test accuracy: {:.3f}".format(np.mean(test_acc)))

INFO:tensorflow:Restoring parameters from checkpoints/sentiment.ckpt
Test accuracy: 0.856


## Prediction

### I'm using a review about the tv show "The Good Place". 
First I'll try a positive review

In [143]:
#sample_review = "I'm only 4 episodes in, but it's already my favorite comedy on television right now. The premise is that the characters are in some \"Good place\" afterlife, but it's not quite endless bliss--in fact, the show wants you to think about what that phrase really means--are you endlessly happy because you are with a bunch of other do-gooder people that make you feel inferior? Are you happy eating frozen yogurt everyday? Or maybe, just maybe, endless bliss is a journey, kind of like life. Meanwhile, you're getting this while listening to snappy dialog and interesting and different characters. ('Toughen up, nerd' was one of Kristen Bell's lines from ep. 4.) Seriously, this is not your average comedy."
sample_review = "Just watched the double-episode pilot of The Good Place and I would say it's a Good Comedy. First of all the premise is original. Another newcomer of the season Kevin Can Wait is the epitome of family sitcoms. This is almost in every way new. Only downside: it would be more fitting for a movie than a (long-running?) series. At some point might seem repetitive or loose its initial spark. Second, it seems like it has high production values, I mean a wide set and good visuals. All bright and cheerful. The script holds a few surprises to keep you interested and the blunt editing from afterlife to life and back adds to the entertainment. I like Kristen Bell as the main character. The part suits her. The otherwise great Ted Danson is OK in this one, but I think the writers have to improve his lines. Overall: Because of its very distinct premise, some might love it, some might hate it. Personally I find it entertaining enough to continue watching. "
sample_review

"Just watched the double-episode pilot of The Good Place and I would say it's a Good Comedy. First of all the premise is original. Another newcomer of the season Kevin Can Wait is the epitome of family sitcoms. This is almost in every way new. Only downside: it would be more fitting for a movie than a (long-running?) series. At some point might seem repetitive or loose its initial spark. Second, it seems like it has high production values, I mean a wide set and good visuals. All bright and cheerful. The script holds a few surprises to keep you interested and the blunt editing from afterlife to life and back adds to the entertainment. I like Kristen Bell as the main character. The part suits her. The otherwise great Ted Danson is OK in this one, but I think the writers have to improve his lines. Overall: Because of its very distinct premise, some might love it, some might hate it. Personally I find it entertaining enough to continue watching. "

In [144]:
sample_label = np.zeros(shape=(batch_size,1))
sample_label[0,0] = 1 #positive review
sample_label.shape

(64, 1)

In [145]:
def prep_feature(text, seq_len=200, batch_size=64):
    text = ' '.join([ contractions[word] if word in contractions else word for word in text.split('[ \t]+') if word not in stopWords])
    print("remove stop words and replace contractions", text)
    text = ''.join([c for c in text if c not in punctuation])
    print("remove punctuation", text)
    text_int = [vocab_to_int.get(word,0) for word in text.split()]
    print("text to int", text_int)
    if len(text_int) >= seq_len:
        text_int = text_int[0:seq_len]
    else:
        zero_pad = [0] * (seq_len - len(text_int))
        text_int = zero_pad + text_int
    print("text padded ", text_int)
    text_int = np.array(text_int).reshape([1,-1])
    
    feature_matrix = np.zeros(shape=(batch_size,seq_len))
    feature_matrix[0,:] = text_int
    return feature_matrix

In [146]:
sample_feature = prep_feature(sample_review)

('remove stop words and replace contractions', "Just watched the double-episode pilot of The Good Place and I would say it's a Good Comedy. First of all the premise is original. Another newcomer of the season Kevin Can Wait is the epitome of family sitcoms. This is almost in every way new. Only downside: it would be more fitting for a movie than a (long-running?) series. At some point might seem repetitive or loose its initial spark. Second, it seems like it has high production values, I mean a wide set and good visuals. All bright and cheerful. The script holds a few surprises to keep you interested and the blunt editing from afterlife to life and back adds to the entertainment. I like Kristen Bell as the main character. The part suits her. The otherwise great Ted Danson is OK in this one, but I think the writers have to improve his lines. Overall: Because of its very distinct premise, some might love it, some might hate it. Personally I find it entertaining enough to continue watchin

In [147]:
sample_feature.shape

(64, 200)

In [148]:
sample_label.shape

(64, 1)

In [149]:
with tf.Session(graph=graph) as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    test_state = sess.run(cell.zero_state(64, tf.float32)) #batch size, just one review
    feed = {inputs_: sample_feature,
            labels_: sample_label,
            keep_prob: 1,
            initial_state: test_state}
    predictions_out = sess.run([predictions], feed_dict=feed)

INFO:tensorflow:Restoring parameters from checkpoints/sentiment.ckpt


In [151]:
predictions_out[0][0]

array([ 0.80797893], dtype=float32)

In this sample, the prediction was 0.81. Since predictions 0.5 or greater are labeled as positive reviews, this review was labeled as positive.

### Try a negative review

In [152]:
sample_review = "When I saw this show advertised it looked good so I thought I'd give it a go. After three episodes, I couldn't watch it any more. Its just not funny. The main character played by Kristen Bell is so unlikeable and IMHO its William Jackson Harper who plays her matched partner who should have the lead. English actress Jameela Jamil who plays Bell's neighbour might look good but her character is annoying and you can tell she doesn't do much acting. Considering I watched The Good Place with a great deal of anticipation, to find out it was as funny as an emergency appointment at the dentist was disappointing. If you want to watch a new comedy that's genuinely funny, avoid this one. They may be in the good place but they're not in a funny place."
sample_review

"When I saw this show advertised it looked good so I thought I'd give it a go. After three episodes, I couldn't watch it any more. Its just not funny. The main character played by Kristen Bell is so unlikeable and IMHO its William Jackson Harper who plays her matched partner who should have the lead. English actress Jameela Jamil who plays Bell's neighbour might look good but her character is annoying and you can tell she doesn't do much acting. Considering I watched The Good Place with a great deal of anticipation, to find out it was as funny as an emergency appointment at the dentist was disappointing. If you want to watch a new comedy that's genuinely funny, avoid this one. They may be in the good place but they're not in a funny place."

In [153]:
sample_label = np.zeros(shape=(batch_size,1))
sample_label[0,0] = 0 #negative review
sample_label.shape

(64, 1)

In [154]:
sample_feature = prep_feature(sample_review)

('remove stop words and replace contractions', "When I saw this show advertised it looked good so I thought I'd give it a go. After three episodes, I couldn't watch it any more. Its just not funny. The main character played by Kristen Bell is so unlikeable and IMHO its William Jackson Harper who plays her matched partner who should have the lead. English actress Jameela Jamil who plays Bell's neighbour might look good but her character is annoying and you can tell she doesn't do much acting. Considering I watched The Good Place with a great deal of anticipation, to find out it was as funny as an emergency appointment at the dentist was disappointing. If you want to watch a new comedy that's genuinely funny, avoid this one. They may be in the good place but they're not in a funny place.")
('remove punctuation', 'When I saw this show advertised it looked good so I thought Id give it a go After three episodes I couldnt watch it any more Its just not funny The main character played by Kris

In [155]:
with tf.Session(graph=graph) as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    test_state = sess.run(cell.zero_state(64, tf.float32)) #batch size, just one review
    feed = {inputs_: sample_feature,
            labels_: sample_label,
            keep_prob: 1,
            initial_state: test_state}
    predictions_out = sess.run([predictions], feed_dict=feed)

INFO:tensorflow:Restoring parameters from checkpoints/sentiment.ckpt


In [156]:
predictions_out[0][0]

array([ 0.44261706], dtype=float32)

Since the prediction was 0.44, it's rounded to 0, or a negative review.