# Sentiment Analysis with an RNN

In this notebook, we will implement a recurrent neural network that performs sentiment analysis. Using an RNN rather than a feedfoward network is more accurate since we can include information about the sequence of words. Here we'll use a dataset of movie reviews, accompanied by labels.
The architecture for this network is shown below.

![Imgur](http://i.imgur.com/1elbD2U.png?1)


Here, we'll pass in words to an embedding layer. We need an embedding layer because we have tens of thousands of words, so we'll need a more efficient representation for our input data than one-hot encoded vectors. You should have seen this before from the word2vec lesson. You can actually train up an embedding with word2vec and use it here. But it's good enough to just have an embedding layer and let the network learn the embedding table on it's own.
From the embedding layer, the new representations will be passed to LSTM cells. These will add recurrent connections to the network so we can include information about the sequence of words in the data. Finally, the LSTM cells will go to a sigmoid output layer here. We're using the sigmoid because we're trying to predict if this text has positive or negative sentiment. The output layer will just be a single unit then, with a sigmoid activation function.
We don't care about the sigmoid outputs except for the very last one, we can ignore the rest. We'll calculate the cost from the output of the last step and the training label.

### Load Data from the binary files
Highly recommeded to look at the Data_Provider notebook attached in the same folder. 

In [1]:
import numpy as np

train_x = np.load("train_x.npy")
train_y = np.load("train_y.npy")

val_x = np.load("val_x.npy")
val_y = np.load("val_y.npy")

test_x = np.load("test_x.npy")
test_y = np.load("test_y.npy")

import pickle

vocab_to_int = pickle.load(open('vocab_to_int','rb'))

In [2]:
def get_batches(x, y, batch_size=100):    
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]


### Create the input, output placeholders
First, as always, start by defining some input data placeholders. Give you a place to anchor your ideas..

input\_ : [batch\_size, seq\_length]

output\_ : [batch\_size, 1] 

The 1 denotes a one dimensional value describing if it is a positive (value=1) or negative (value=0) sentiment. Also, I suppose the batch_size placeholder is something I haven't used before in CNNs or regular dense architectures, but is required for LSTMs.

In [3]:
import tensorflow as tf
input_ = tf.placeholder(tf.int32, shape=[None, None])
output_ = tf.placeholder(tf.int32, shape=[None])
batch_size_ = tf.placeholder(tf.int32)
dropout_ = tf.placeholder(tf.float32)

In [4]:
n_words = len(vocab_to_int) # Evaluate the total num of words we'll have
seq_length = 200 # The length of the timestep we will be using
embed_size = 300 # The size of the embedding vector for each word

### Create the embedding layer, with the embedding lookup method

In [5]:
embedding_matrix = tf.Variable(tf.truncated_normal(shape=[n_words, embed_size], mean=0., stddev=0.1))

In [6]:
embedding_vector = tf.nn.embedding_lookup(embedding_matrix, input_)

Again, the shape of the embedding_vector shape is:  [None, None, 300] (batch_size, seq_length, embed_size). I.e. it effectively takes each word and projects that in a vector of size 'embed_size'.

### Create the LSTM Cell, apply dropout, and create a multi-layered unit
Very standard way of creating a multi-layered LSTM architecture, also using dropout.

In [7]:
lstm = tf.contrib.rnn.BasicLSTMCell(num_units=100)

In [8]:
dropout_lstm = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=dropout_) 

In [9]:
num_lstm_layers = 2
cell = tf.contrib.rnn.MultiRNNCell([dropout_lstm]*num_lstm_layers)

### Use dynamic_rnn to perform dynamic time unfolding. 
This also requires that we create an initial_state (which we will set to zeros). At the end of this, we collect the outputs and final_state from the dynamic_rnn method. These values: *outputs* and *final_state* are quite interesting. It is worth checking them out in the original docs. A little bit of additional information is listed here: 
https://github.com/apiltamang/tensorflow_rtp_materials/blob/master/week-7/Tensorflow_BasicLSTMCell_and_Embeddings.ipynb. 

In order to really understand what they [outputs,final_states] are, look at the method: step(..) in the notebook: https://github.com/apiltamang/tensorflow_rtp_materials/blob/master/week-7/Vanilla_LSTM_Using_Embeddings.ipynb. What they are is a tuple containing the exposed state ($s_t$) and internal state ($c_t$)

In [10]:
initial_state = cell.zero_state(batch_size_, tf.float32)

In [11]:
outputs, final_state = tf.nn.dynamic_rnn(cell, embedding_vector, initial_state=initial_state, dtype=tf.float32)

### Output: We'll only keep the output from the final time-step. 
The outputs vector contains the encoded information from all batches, across all time-steps, projected in the size of the LSTM layer, i.e (batch_size, time_step, num_units). Only get the last time unit.

In [12]:
end_time_state = outputs[:,-1,:] # (batch_size, num_units)

### Connect LSTM to a Dense layer
Connect the output vector from LSTM to a dense layer, and squeeze it to a value between 0 and 1 using the sigmoid activation. Because this problem has only two labels (positive, negative) we can afford to have a single unit for the dense layer.


In [13]:
predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
cost = tf.losses.mean_squared_error(output_, predictions)

Massage the targets (or labels) vectors so that it is a two dimensional tensor. At the data source-level, the targets is a 1-D array, e.g. [1 0 0 0 1 0 1 0 1 0 0 1 ... 0 0 1]. Instead reshape it so that it looks as follows:
[ [1]
  [0]
  [0]
  .
  .
  .
  [0]
  [1]
]. We do this to cast it to the output placeholder (output\_) defined at the very beginning

In [14]:
output_reshape = tf.reshape(output_, [-1,1])

### Define the loss functions
Very standard way of defining costs/optimizer etc..

In [15]:
cost = tf.losses.mean_squared_error(output_reshape, predictions)

In [16]:
learning_rate = tf.placeholder(tf.float32)

In [17]:
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

In [18]:
correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), output_)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

### Start training

In [26]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    train_loss = 0.
    n_epochs = 10
    for i in range(n_epochs):
        counter = 0
        for xs, ys in get_batches(train_x[0:2000], train_y[0:2000], batch_size=100):
            counter += 1
            _, train_loss_val = sess.run([optimizer, cost], 
                                         feed_dict={input_: xs, output_: ys, dropout_: 0.8, batch_size_:100, learning_rate: 0.01})
            if(counter%10==0):
                print("epochs: ",i, "batch: ",counter, ", loss evaluated: ",train_loss_val)
            
        # do validation at the end of epochs
        val_acc = []
        val_state = sess.run(cell.zero_state(batch_size_, tf.float32), feed_dict={batch_size_:100})
        for xs, ys in get_batches(val_x, val_y, batch_size=100):

            batch_acc, val_state = sess.run([accuracy, final_state], 
                                        feed_dict={input_: xs, output_: ys, dropout_: 1, initial_state: val_state})
            val_acc.append(batch_acc)
        print("Val acc: {:.3f}".format(np.mean(val_acc)))
            
    saver.save(sess, "checkpoints/sentiment_rnn.ckpt")            
 

epochs:  0 batch:  10 , loss evaluated:  0.226663
epochs:  0 batch:  20 , loss evaluated:  0.284504
Val acc: 0.500
epochs:  1 batch:  10 , loss evaluated:  0.0556807
epochs:  1 batch:  20 , loss evaluated:  0.208611
Val acc: 0.500
epochs:  2 batch:  10 , loss evaluated:  0.041939
epochs:  2 batch:  20 , loss evaluated:  0.0575101
Val acc: 0.500
epochs:  3 batch:  10 , loss evaluated:  0.0292734
epochs:  3 batch:  20 , loss evaluated:  0.056139
Val acc: 0.500
epochs:  4 batch:  10 , loss evaluated:  0.00526734
epochs:  4 batch:  20 , loss evaluated:  0.02884
Val acc: 0.500
epochs:  5 batch:  10 , loss evaluated:  0.000238991
epochs:  5 batch:  20 , loss evaluated:  0.0200906
Val acc: 0.500
epochs:  6 batch:  10 , loss evaluated:  0.0159842
epochs:  6 batch:  20 , loss evaluated:  0.00475118
Val acc: 0.500
epochs:  7 batch:  10 , loss evaluated:  9.49389e-05
epochs:  7 batch:  20 , loss evaluated:  0.000919348
Val acc: 0.500
epochs:  8 batch:  10 , loss evaluated:  0.0047845
epochs:  8 b

NameError: name 'saver' is not defined