# Sentiment Analysis Using RNN
- [Introduction](#intro)
- [Part 1: Data Preprocessing](#part1)
- [Part 2: RNN Building](#part2)
- [Part 3: RNN Training and Testing](#part3)

<a id='intro'></a>
## Introduction

In this project, I'll build and train an RNN using Tensorflow to recognize between positive and negative movie reviews from the Internet Movie Database (IMDB).  

The notebook consists of 3 parts: in Part 1 I'll preprocess the data. In Part 2, I'll build the network, which will then be trained and tested in Part 3. An accuracy of ~84% is achieved. 

In [1]:
import numpy as np
import tensorflow as tf
from collections import Counter
from string import punctuation

<a id='part1'></a>
## Part 1: Data Preprocessing

First, a bit of preprocessing, e.g., removing punctuation, attributing each word/label to an integer code, getting rid of potential empty reviews, and adjusting the review size to a fixed length.

In [2]:
with open('reviews.txt', 'r') as f:
    reviews = f.read()
with open('labels.txt', 'r') as f:
    labels = f.read()

# Remove punctuation
all_text = ''.join([c for c in reviews if c not in punctuation])

# Split reviews using delimiter \n
reviews = all_text.split('\n')

# Extract individual words
all_text = ' '.join(reviews)
words = all_text.split()

# Create dictionary that maps each word to an integer
tmp = Counter(words)
tmp_sorted = sorted(tmp, key=tmp.get, reverse=True)
vocab_to_int = {word: num for num, word in enumerate(tmp_sorted)}

# Convert the reviews to lists of integers using the vocab_to_int code
reviews_ints = []
for review in reviews:
    reviews_ints.append([vocab_to_int[word] for word in review.split()])
    
# Convert labels to 1s and 0s for 'positive' and 'negative'
labels = labels.split('\n')
labels = np.array([1 if label == 'positive' else 0 for label in labels])

# Filter out any empty review
non_zero_idx = [idx for idx, review in enumerate(reviews_ints) if len(review) != 0]
reviews_ints = [reviews_ints[idx] for idx in non_zero_idx]
labels = np.array([labels[idx] for idx in non_zero_idx])

# Truncate reviews to <seq_len> words, and pad shorter reviews with 0s
seq_len = 200
features = np.zeros((len(reviews_ints), seq_len), dtype=int)
for i, row in enumerate(reviews_ints):
    features[i, -len(row):] = np.array(row)[:seq_len]
    
# Split data between training (80%), validation (10%), and test (10%) sets
split_frac = 0.8
split_idx = int(len(features)*0.8)
train_x, val_x = features[:split_idx], features[split_idx:]
train_y, val_y = labels[:split_idx], labels[split_idx:]

test_idx = int(len(val_x)*0.5)
val_x, test_x = val_x[:test_idx], val_x[test_idx:]
val_y, test_y = val_y[:test_idx], val_y[test_idx:]

<a id='part2'></a>
## Part 2: RNN Building

Let's then build the RNN. Here, we will use an embedding layer for our words, LSTM cells, and some dropout. 

In [17]:
lstm_size = 256 # Number of units in LSTM cell
lstm_layers = 1
batch_size = 500
learning_rate = 0.001

n_words = len(vocab_to_int)

tf.reset_default_graph()

# Define placeholders for inputs and labels
inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs') # size: batch_size*length of review
labels_ = tf.placeholder(tf.int32, [None, None], name='labels')

# Define dropout probability
keep_prob = tf.placeholder(tf.float32, name='keep_prob')

# Define the embedding layer
embed_size = 300 # Size of the embedding vectors (number of units in the embedding layer)
# Initialize embedding layer with random values between -1 and 1
embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
embed = tf.nn.embedding_lookup(embedding, inputs_)
    
# Build LSTM cell
def build_cell(lstm_size, keep_prob):
    
    # Define LSTM cell
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    
    # Add dropout to the cell
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob = keep_prob)
    return drop
    
# Stack up multiple LSTM layers
cell = tf.contrib.rnn.MultiRNNCell([build_cell(lstm_size, keep_prob) for _ in range (lstm_layers)])
    
# Set initial state to 0s
initial_state = cell.zero_state(batch_size, tf.float32)
    
# Pass embedded data through RNN
outputs, final_state = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)
    
# Get last output of RNN
predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)

# Define cost
cost = tf.losses.mean_squared_error(labels_, predictions)
    
# Define optimizer
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
    
# Define accuracy
correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

<a id='part3'></a>
## Part 3: RNN Training and Testing

Finally, let's train and test the model:


In [18]:
epochs = 20

# Define batches
def get_batches(x, y, batch_size):
    
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for e in range(epochs):
        state = sess.run(initial_state)
        
        for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
            feed = {inputs_: x,
                    labels_: y[:, None],
                    keep_prob: 0.5,
                    initial_state: state}
            loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)
            
            if iteration%5==0:
                print("Epoch: {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Train loss: {:.3f}".format(loss))
                
            iteration +=1
                
        # Print accuracy for current epoch
        val_acc = []
        # Necessary to initialize cell state with 0s
        val_state = sess.run(cell.zero_state(batch_size, tf.float32))
        for x, y in get_batches(val_x, val_y, batch_size):
            feed = {inputs_: x,
                    labels_: y[:, None],
                    keep_prob: 1,
                    initial_state: val_state}
            batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
            val_acc.append(batch_acc)
        print("Val acc: {:.3f}".format(np.mean(val_acc)))

    saver.save(sess, "checkpoints/sentiment.ckpt")

Epoch: 0/20 Iteration: 5 Train loss: 0.243
Epoch: 0/20 Iteration: 10 Train loss: 0.244
Epoch: 0/20 Iteration: 15 Train loss: 0.210
Epoch: 0/20 Iteration: 20 Train loss: 0.223
Epoch: 0/20 Iteration: 25 Train loss: 0.229
Epoch: 0/20 Iteration: 30 Train loss: 0.223
Epoch: 0/20 Iteration: 35 Train loss: 0.206
Epoch: 0/20 Iteration: 40 Train loss: 0.182
Val acc: 0.726
Epoch: 1/20 Iteration: 45 Train loss: 0.184
Epoch: 1/20 Iteration: 50 Train loss: 0.193
Epoch: 1/20 Iteration: 55 Train loss: 0.177
Epoch: 1/20 Iteration: 60 Train loss: 0.182
Epoch: 1/20 Iteration: 65 Train loss: 0.224
Epoch: 1/20 Iteration: 70 Train loss: 0.210
Epoch: 1/20 Iteration: 75 Train loss: 0.209
Epoch: 1/20 Iteration: 80 Train loss: 0.218
Val acc: 0.658
Epoch: 2/20 Iteration: 85 Train loss: 0.178
Epoch: 2/20 Iteration: 90 Train loss: 0.172
Epoch: 2/20 Iteration: 95 Train loss: 0.147
Epoch: 2/20 Iteration: 100 Train loss: 0.137
Epoch: 2/20 Iteration: 105 Train loss: 0.143
Epoch: 2/20 Iteration: 110 Train loss: 0.161


Testing the model:

In [19]:
test_acc = []
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    test_state = sess.run(cell.zero_state(batch_size, tf.float32))
    for ii, (x, y) in enumerate(get_batches(test_x, test_y, batch_size), 1):
        feed = {inputs_: x,
                labels_: y[:, None],
                keep_prob: 1,
                initial_state: test_state}
        batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
        test_acc.append(batch_acc)
    print("Test accuracy: {:.3f}".format(np.mean(test_acc)))

INFO:tensorflow:Restoring parameters from checkpoints/sentiment.ckpt
Test accuracy: 0.839
