# Sentiment Analysis with an RNN

The architecture for this network is shown below.

<img src="images/RNNArch.png" width=600px>


In [1]:
import numpy as np
import tensorflow as tf


## Collecting Data

In [2]:
with open('data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('data/labels.txt', 'r') as f:
    labels = f.read()

In [3]:
reviews[:2000]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   \nstory of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is tu

## Normalization and Tokenized

Normalize reviews by removing "punction and \n" and preserving each review separation, also create a list of words

In [4]:
from string import punctuation
# Exclude characters that are punctuation.
all_text = ''.join([c for c in reviews if c not in punctuation])

# Reviews are separated by line feed \n 
print("Does all_text has linefeed: {}".format('\n' in all_text[:2000]),'\n')

# Get rid of the line feed by separating reviews using it as separator
reviews_nolinefeed = all_text.split('\n')
print("First review: {} \n".format(reviews_nolinefeed[0]))
print("Second review: {} \n".format(reviews_nolinefeed[1]))

# Put all the reviews back in one big string for word counting
all_text = ' '.join(reviews_nolinefeed)

# Count words
words = all_text.split()

Does all_text has linefeed: True 

First review: bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t    

Second review: story of a man who has unnatural feelings for a pig  starts out with a opening scene that is a terrific example of absurd comedy

In [5]:
all_text[:2000]

'bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t    story of a man who has unnatural feelings for a pig  starts out with a opening scene that is a terrific example of absurd comedy  a formal orchestra audience is turned into an insane  violent m

In [6]:
words[:100]

['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other',
 'programs',
 'about',
 'school',
 'life',
 'such',
 'as',
 'teachers',
 'my',
 'years',
 'in',
 'the',
 'teaching',
 'profession',
 'lead',
 'me',
 'to',
 'believe',
 'that',
 'bromwell',
 'high',
 's',
 'satire',
 'is',
 'much',
 'closer',
 'to',
 'reality',
 'than',
 'is',
 'teachers',
 'the',
 'scramble',
 'to',
 'survive',
 'financially',
 'the',
 'insightful',
 'students',
 'who',
 'can',
 'see',
 'right',
 'through',
 'their',
 'pathetic',
 'teachers',
 'pomp',
 'the',
 'pettiness',
 'of',
 'the',
 'whole',
 'situation',
 'all',
 'remind',
 'me',
 'of',
 'the',
 'schools',
 'i',
 'knew',
 'and',
 'their',
 'students',
 'when',
 'i',
 'saw',
 'the',
 'episode',
 'in',
 'which',
 'a',
 'student',
 'repeatedly',
 'tried',
 'to',
 'burn',
 'down',
 'the',
 'school',
 'i',
 'immediately',
 'recalled',
 'at',
 'high']

### Encoding the words


In [7]:
from collections import Counter

# Count The words
counts = Counter(words)
#print(list(counts.items())[:3])

# Sort in alphabetic order
vocab = sorted(counts, key=counts.get, reverse=True)
print("Vocabulary size: {}".format(len(vocab)))
# Assign an integuer number to each word
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}
#print(list(vocab_to_int.keys())[1],vocab_to_int[list(vocab_to_int.keys())[1]])

# Using the reviews_nolinefeed and the vocab_to_int to create 
reviews_ints = []
for each in reviews_nolinefeed:
    reviews_ints.append([vocab_to_int[word] for word in each.split()])
print(reviews_ints[0])

Vocabulary size: 74072
[22151, 308, 6, 3, 1051, 207, 8, 2143, 32, 1, 171, 57, 15, 49, 81, 5832, 44, 382, 110, 140, 15, 5242, 60, 154, 9, 1, 5002, 5899, 475, 71, 5, 260, 12, 22151, 308, 13, 1980, 6, 74, 2402, 5, 614, 73, 6, 5242, 1, 24833, 5, 1988, 10379, 1, 5850, 1501, 36, 51, 66, 204, 145, 67, 1204, 5242, 19904, 1, 44453, 4, 1, 221, 883, 31, 3005, 71, 4, 1, 5827, 10, 686, 2, 67, 1501, 54, 10, 216, 1, 383, 9, 62, 3, 1406, 3703, 784, 5, 3505, 180, 1, 382, 10, 1212, 13832, 32, 308, 3, 349, 341, 2921, 10, 143, 127, 5, 7816, 30, 4, 129, 5242, 1406, 2336, 5, 22151, 308, 10, 528, 12, 109, 1449, 4, 60, 543, 102, 12, 22151, 308, 6, 227, 4173, 48, 3, 2217, 12, 8, 215, 23]


### Encoding the labels and make all reviews the same size

Our labels are "positive" or "negative". To use these labels in our network, we need to convert them to 0 and 1.


In [8]:
labels = labels.split('\n')
labels = np.array([1 if each == 'positive' else 0 for each in labels])

In [9]:
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews: 1
Maximum review length: 2514


### Truncate reviews to 200 or pad them to 200

In [10]:
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]
len(non_zero_idx)

25000

In [11]:
reviews_ints[-1]

[]

Turns out its the final review that has zero length. But that might not always be the case, so let's make it more general.

In [12]:
reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
labels = np.array([labels[ii] for ii in non_zero_idx])

Using reviews_int to pad the reviews to 200 characters



In [13]:
seq_len = 200
features = np.zeros((len(reviews_ints), seq_len), dtype=int)
for i, row in enumerate(reviews_ints):
    features[i, -len(row):] = np.array(row)[:seq_len]

In [14]:
print("The shape on one review: {}".format(features[0].shape))
print("One review: \n {}".format(features[0]))

The shape on one review: (200,)
One review: 
 [    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
 22151   308     6     3  1051   207     8  2143    32     1   171    57
    15    49    81  5832    44   382   110   140    15  5242    60   154
     9     1  5002  5899   475    71     5   260    12 22151   308    13
  1980     6    74  2402     5   614    73     6  5242     1 24833     5
  1988 10379     1  5850  1501    36    51    66   204   145    67  1204
  5242 19904     1 44453     4     1   221   883    31  3005    71     4
     1  5827    10   686     2    67  1501    54    10   216     1   383
     9    62     3  1406  3703   784     5  3505   180     1   382    10
  121

## Training, Validation, Test



With our data in nice shape, we'll split it into training, validation, and test sets.

In [15]:
split_frac = 0.8
split_idx = int(len(features)*0.8)
train_x, val_x = features[:split_idx], features[split_idx:]
train_y, val_y = labels[:split_idx], labels[split_idx:]

test_idx = int(len(val_x)*0.5)
val_x, test_x = val_x[:test_idx], val_x[test_idx:]
val_y, test_y = val_y[:test_idx], val_y[test_idx:]

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
Train set: 		(20000, 200) 
Validation set: 	(2500, 200) 
Test set: 		(2500, 200)


With train, validation, and text fractions of 0.8, 0.1, 0.1, the final shapes should look like:
```
                    Feature Shapes:
Train set: 		 (20000, 200) 
Validation set: 	(2500, 200) 
Test set: 		  (2500, 200)
```

## Build the graph

Here, we'll build the graph. First up, defining the hyperparameters.

* `lstm_size`: Number of units in the hidden layers in the LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `lstm_layers`: Number of LSTM layers in the network. I'd start with 1, then add more if I'm underfitting.
* `batch_size`: The number of reviews to feed the network in one training pass. Typically this should be set as high as you can go without running out of memory.
* `learning_rate`: Learning rate

In [16]:
lstm_size = 256
lstm_layers = 1
batch_size = 100
learning_rate = 0.001

For the network itself, we'll be passing in our 200 element long review vectors. Each batch will be `batch_size` vectors. We'll also be using dropout on the LSTM layer, so we'll make a placeholder for the keep probability.

- INPUT x_train
- LABEL y_train
- PARAMETERS weight, bias, embedding
- LABEL prediction Logistic Regression
- LOSS function R^2 mean square differece
- ACCURACY
- OPTIOMIZATION function Adam

In [17]:
n_words = len(vocab_to_int) + 1 # Adding 1 because we use 0's for padding, dictionary started at 1

# Create the graph object
graph = tf.Graph()
# Add nodes to the graph
with graph.as_default():
    inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')
    labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

### Embedding

Now we'll add an embedding layer. We need to do this because there are 74000 words in our vocabulary. It is massively inefficient to one-hot encode our classes here. We will create an embedding lookup matrix as a [`tf.Variable`], then we will use that embedding matrix to get the embedded vectors to pass to the LSTM cell with [`tf.nn.embedding_lookup`]. This function takes the embedding matrix and an input tensor, such as the review vectors. Then, it'll return another tensor with the embedded vectors. So, if the embedding layer as 200 units, the function will return a tensor with size [batch_size, 200].



In [18]:
# Size of the embedding vectors (number of units in the embedding layer)
#            30074072r i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age thin
#          --------
#         +       +
# 74072   +       +
#         +       +
#          --------
embed_size = 300 

with graph.as_default():
    embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs_)

In [19]:
with graph.as_default():
    # Your basic LSTM cell
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    
    # Add dropout to the cell
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)
    
    # Getting an initial state of all zeros
    initial_state = cell.zero_state(batch_size, tf.float32)

In [20]:
with graph.as_default():
    outputs, final_state = tf.nn.dynamic_rnn(cell, embed,
                                             initial_state=initial_state)

### Output

We only care about the final output, we'll be using that as our sentiment prediction. So we need to grab the last output with `outputs[:, -1]`, the calculate the cost from that and `labels_`.

In [21]:
with graph.as_default():
    predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
    cost = tf.losses.mean_squared_error(labels_, predictions)
    
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

### Validation accuracy

Here we can add a few nodes to calculate the accuracy which we'll use in the validation pass. Remeber in each epoch you feed at least 100 training samples and test using one accuracy set of samples and then you calculate how many you predict correctly and add them up and divided between the number of instances in the training set.

In [22]:
with graph.as_default():
    correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

### Batching

We first calculate a range from 0 to index that accomodate 100 batches of 300 instances, the reste is reserved for validataion.

In [23]:
def get_batches(x, y, batch_size=100):
    
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

## Training

Below is the typical training code.

In [24]:
epochs = 10

with graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for e in range(epochs):
        state = sess.run(initial_state)
        
        for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
            feed = {inputs_: x,
                    labels_: y[:, None],
                    keep_prob: 0.5,
                    initial_state: state}
            # Run training
            loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)
            
            # Every 5 iterations of the trainin print loss
            if iteration%5==0:
                print("Epoch: {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Train loss: {:.3f}".format(loss))
            
            # Every 25 iterations reset the lstm cell states to zero and get the
            # Accuracy using a validation batch instead of a traing batch
            if iteration%25==0:
                val_acc = []
                val_state = sess.run(cell.zero_state(batch_size, tf.float32))
                for x, y in get_batches(val_x, val_y, batch_size):
                    feed = {inputs_: x,
                            labels_: y[:, None],
                            keep_prob: 1,
                            initial_state: val_state}
                    batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
                    val_acc.append(batch_acc)
                print("Val acc: {:.3f}".format(np.mean(val_acc)))
            iteration +=1
    saver.save(sess, "checkpoints/fake_sentiment.ckpt")

Epoch: 0/10 Iteration: 5 Train loss: 0.246
Epoch: 0/10 Iteration: 10 Train loss: 0.233
Epoch: 0/10 Iteration: 15 Train loss: 0.234
Epoch: 0/10 Iteration: 20 Train loss: 0.236
Epoch: 0/10 Iteration: 25 Train loss: 0.215
Val acc: 0.629
Epoch: 0/10 Iteration: 30 Train loss: 0.252
Epoch: 0/10 Iteration: 35 Train loss: 0.251
Epoch: 0/10 Iteration: 40 Train loss: 0.235
Epoch: 0/10 Iteration: 45 Train loss: 0.244
Epoch: 0/10 Iteration: 50 Train loss: 0.235
Val acc: 0.588
Epoch: 0/10 Iteration: 55 Train loss: 0.203
Epoch: 0/10 Iteration: 60 Train loss: 0.229
Epoch: 0/10 Iteration: 65 Train loss: 0.214
Epoch: 0/10 Iteration: 70 Train loss: 0.292
Epoch: 0/10 Iteration: 75 Train loss: 0.250
Val acc: 0.571
Epoch: 0/10 Iteration: 80 Train loss: 0.239
Epoch: 0/10 Iteration: 85 Train loss: 0.243
Epoch: 0/10 Iteration: 90 Train loss: 0.210
Epoch: 0/10 Iteration: 95 Train loss: 0.245
Epoch: 0/10 Iteration: 100 Train loss: 0.236
Val acc: 0.581
Epoch: 0/10 Iteration: 105 Train loss: 0.242
Epoch: 0/10 Ite

Epoch: 4/10 Iteration: 865 Train loss: 0.100
Epoch: 4/10 Iteration: 870 Train loss: 0.131
Epoch: 4/10 Iteration: 875 Train loss: 0.129
Val acc: 0.740
Epoch: 4/10 Iteration: 880 Train loss: 0.107
Epoch: 4/10 Iteration: 885 Train loss: 0.108
Epoch: 4/10 Iteration: 890 Train loss: 0.091
Epoch: 4/10 Iteration: 895 Train loss: 0.112
Epoch: 4/10 Iteration: 900 Train loss: 0.118
Val acc: 0.798
Epoch: 4/10 Iteration: 905 Train loss: 0.059
Epoch: 4/10 Iteration: 910 Train loss: 0.068
Epoch: 4/10 Iteration: 915 Train loss: 0.089
Epoch: 4/10 Iteration: 920 Train loss: 0.074
Epoch: 4/10 Iteration: 925 Train loss: 0.052
Val acc: 0.830
Epoch: 4/10 Iteration: 930 Train loss: 0.069
Epoch: 4/10 Iteration: 935 Train loss: 0.065
Epoch: 4/10 Iteration: 940 Train loss: 0.036
Epoch: 4/10 Iteration: 945 Train loss: 0.114
Epoch: 4/10 Iteration: 950 Train loss: 0.056
Val acc: 0.776
Epoch: 4/10 Iteration: 955 Train loss: 0.115
Epoch: 4/10 Iteration: 960 Train loss: 0.124
Epoch: 4/10 Iteration: 965 Train loss: 0

Val acc: 0.940
Epoch: 8/10 Iteration: 1705 Train loss: 0.007
Epoch: 8/10 Iteration: 1710 Train loss: 0.000
Epoch: 8/10 Iteration: 1715 Train loss: 0.000
Epoch: 8/10 Iteration: 1720 Train loss: 0.000
Epoch: 8/10 Iteration: 1725 Train loss: 0.001
Val acc: 0.941
Epoch: 8/10 Iteration: 1730 Train loss: 0.001
Epoch: 8/10 Iteration: 1735 Train loss: 0.000
Epoch: 8/10 Iteration: 1740 Train loss: 0.000
Epoch: 8/10 Iteration: 1745 Train loss: 0.001
Epoch: 8/10 Iteration: 1750 Train loss: 0.001
Val acc: 0.935
Epoch: 8/10 Iteration: 1755 Train loss: 0.000
Epoch: 8/10 Iteration: 1760 Train loss: 0.000
Epoch: 8/10 Iteration: 1765 Train loss: 0.000
Epoch: 8/10 Iteration: 1770 Train loss: 0.000
Epoch: 8/10 Iteration: 1775 Train loss: 0.000
Val acc: 0.924
Epoch: 8/10 Iteration: 1780 Train loss: 0.001
Epoch: 8/10 Iteration: 1785 Train loss: 0.002
Epoch: 8/10 Iteration: 1790 Train loss: 0.000
Epoch: 8/10 Iteration: 1795 Train loss: 0.000
Epoch: 8/10 Iteration: 1800 Train loss: 0.000
Val acc: 0.922
Epoch

## Testing

In [25]:
test_acc = []
with graph.as_default():
    saver = tf.train.Saver()
    
with tf.Session(graph=graph) as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    test_state = sess.run(cell.zero_state(batch_size, tf.float32))
    for ii, (x, y) in enumerate(get_batches(test_x, test_y, batch_size), 1):
        feed = {inputs_: x,
                labels_: y[:, None],
                keep_prob: 1,
                initial_state: test_state}
        batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
        test_acc.append(batch_acc)
    print("Test accuracy: {:.3f}".format(np.mean(test_acc)))

INFO:tensorflow:Restoring parameters from checkpoints/fake_sentiment.ckpt
Test accuracy: 0.863
