# Sentiment Analysis with RNN and `word2vec` Embeddings

In this notebook, we are going to build RNN model with pre-trained embedding `word2vec`. Let's first load some required packages.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import nltk, re
from sklearn.model_selection import train_test_split
from keras.preprocessing.sequence import pad_sequences
from datetime import datetime
from gensim.models import *
import logging
import time
from rnn_utils import *

Using TensorFlow backend.


## Data Preparation

#### Load train data

Same with our previous RNN model and `GloVe`, we will load data and clean it up.

In [2]:
train = pd.read_csv("./data/labeledTrainData.tsv", delimiter="\t")

#### Cleaning data

After finishing loading the dataset, we will now define a function <span style="color:blue; font-family:Courier"> format_train_review</span> to clean up our data. This function will cut the review longer than our maximal length and looking up index for each word in our `word_list`. Padding will be done after we are done cleaning.

In [3]:
def format_train_review(train_review, max_seq_len, word_list):
    '''
    Converting review into array of index corresponding to its position in the vocabulary
    
    Arguments:
        train_review: review data
        max_seq_len: maximal sequence length
        word_list: words array containing vocabularies from the learned embedding
    
    Return:
        review array with the same max_seq_len containing index for each word
    '''
    count = 0
    review_clean = clean_sentence(train_review)
    review_split = review_clean.split()
    if len(review_split) > max_seq_len:
        review_max_len = review_split[-max_seq_len:]
    else: 
        review_max_len = review_split

    len_rev = len(review_max_len)
    temp_rvw = np.zeros(len_rev, dtype = 'int32')
    for word in review_max_len:
        try:
            temp_rvw[count] = word_list.index(word)
        except ValueError:
            temp_rvw[count] = word_list.index('unk') # if not found, values will be zero
        count += 1
    
    return temp_rvw

Let's load our saved `word_list` and `word_vector` to prepare for the clean up.

In [4]:
load_word_list = np.load('./data/word_list_gensim_w2v.npy')
load_word_vector = np.load('./data/word_vector_gensim_w2v.npy')
load_word_list = load_word_list.tolist()

Next, we will loop through all reviews in our dataset and call the function <span style="color:blue; font-family:Courier"> format_train_review</span> to clean up.

In [5]:
max_seq_len = 230
train_seq = []
tic = time.time()
for review in train.review:
    f = format_train_review(train_review = review, max_seq_len = max_seq_len, word_list = load_word_list)
    train_seq.append(f)
time = np.round((time.time() - tic)/60)
print("Processing time: {} minutes.".format(time))

Processing time: 3.0 minutes.


Example of review after formatting

In [40]:
train_seq[9]

array([   10,    16,     6,   366,     4,  2068,    37,  1204,  2183,
        1337,     1,  1301,    26,     2,   106,   371,     1, 24801,
         383,    28,     3,   726,  2710,    40,  3156,     5,   816,
       10481,    10,    16,     6,     3,   978,    68,   699,    71,
          49,    41,     7,     1,   679], dtype=int32)

#### Reviews truncating

When we clean up data as above, we only truncate reviews that are longer than our maximal sequence length but we dind't do padding for shorter reviews. In order to make all of them into the same length, we have to apply padding.

In [7]:
train_pad = pad_sequences(train_seq, maxlen = max_seq_len, padding='pre')
print("Length of review 9th: {}\n".format(len(train_pad[9])))
print("Total number of reviews: {}".format(len(train_pad)))

Length of review 9th: 230

Total number of reviews: 25000


In [11]:
print("Total number of words in our vocabulary: {}".format(len(load_word_list)))

Total number of words in our vocabulary: 24802


Let's check for indices of random review using our `load_word_list`:

In [10]:
print(clean_sentence(train.review[9]).split())

['this', 'movie', 'is', 'full', 'of', 'references', 'like', 'mad', 'max', 'ii', 'the', 'wild', 'one', 'and', 'many', 'others', 'the', 'ladybugs', 'face', 'its', 'a', 'clear', 'reference', 'or', 'tribute', 'to', 'peter', 'lorre', 'this', 'movie', 'is', 'a', 'masterpiece', 'well', 'talk', 'much', 'more', 'about', 'in', 'the', 'future']


In [8]:
print("Index of word 'this':       {}\n".format(load_word_list.index('this')))
print("Index of word 'movie':      {}\n".format(load_word_list.index('movie')))
print("Index of word 'is':         {}\n".format(load_word_list.index('is')))
print("Index of word 'full':       {}\n".format(load_word_list.index('full')))
print("Index of word 'of':         {}\n".format(load_word_list.index('of')))
print("Index of word 'references': {}\n".format(load_word_list.index('references')))
print("Index of word 'like':       {}\n".format(load_word_list.index('like')))
print("Index of word 'mad':        {}\n".format(load_word_list.index('mad')))

Index of word 'this':       10

Index of word 'movie':      16

Index of word 'is':         6

Index of word 'full':       366

Index of word 'of':         4

Index of word 'references': 2068

Index of word 'like':       37

Index of word 'mad':        1204



Below is a sample review in the form of index, we can see that each word is now converted into index corresponding to its position in the `load_word_list`. 

In [9]:
train_pad[9]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,

#### Splitting data

Original training clean data will be splitted into training set, validation set and test set because our test data doesn't have labels. First, I will take around 2,000 examples as our test data and the rest will be used as training and validation data.

In [10]:
x_test = train_pad[0:1984]
y_test = train.sentiment[0:1984]
print("Shape of x_test: "+ str(x_test.shape))
print("Shape of y_test: "+ str(y_test.shape))

Shape of x_test: (1984, 230)
Shape of y_test: (1984,)


In [11]:
x_train, x_valid, y_train, y_valid = train_test_split(train_pad[1984:], 
                                                    train.sentiment[1984:], 
                                                    test_size = 0.15, 
                                                    random_state = 789)

print("Length of x_train: "+ str(x_train.shape))
print("Length of y_train: "+ str(y_train.shape) +" \n")
print("Shape of x_valid: "+ str(x_valid.shape))
print("Shape of y_valid: "+ str(y_valid.shape))

Length of x_train: (19563, 230)
Length of y_train: (19563,) 

Shape of x_valid: (3453, 230)
Shape of y_valid: (3453,)


#### Batching training data

Spliting them into batches. Function <span style="color:blue; font-family:Courier">mini_batch</span> is defined in our file `rnn_utils` because they are just the same with the function we used in previous RNN model so I copied them into another python file for later use.

In [12]:
batch_size = 64
mini_batches_train = mini_batch(x_train, y_train, batch_size)
mini_batches_valid = mini_batch(x_valid, y_valid, batch_size)
mini_batches_test = mini_batch(x_test, y_test, batch_size)
print("Number of train batches: {}\n".format(len(mini_batches_train)))
print("Number of validation batches: {}\n".format(len(mini_batches_valid)))
print("Number of test batches: {}\n".format(len(mini_batches_test)))

Number of train batches: 305

Number of validation batches: 53

Number of test batches: 31



In [15]:
print("First x train mini batch: (Shape: {1}) \n{0}\n".format(mini_batches_train[0][0], mini_batches_train[0][0].shape))
print("First y train mini batch: (Shape: {1}) \n{0}".format(mini_batches_train[0][1], mini_batches_train[0][1].shape))

First x train mini batch: (Shape: (64, 230)) 
[[    0     0     0 ...,   153 13430  1218]
 [  474     2    21 ...,     3  5809   561]
 [  524    10   208 ...,   473    25   669]
 ..., 
 [    0     0     0 ...,   675    11  2388]
 [    0     0     0 ...,     2    64    89]
 [    0     0     0 ...,  5129  2051 22874]]

First y train mini batch: (Shape: (64,)) 
7136     0
8636     0
13793    0
17384    0
16192    1
12175    1
7585     1
3196     0
23543    1
22862    0
17241    1
17189    1
20505    1
17857    0
4012     1
9368     1
21231    1
22919    0
22428    1
2913     1
5762     1
3893     1
11735    0
2198     0
3293     1
15584    0
11576    1
14954    0
10586    0
9157     0
        ..
11532    0
12445    1
13802    1
20905    1
4659     1
15565    0
20667    1
6950     0
10922    1
6686     1
4427     1
10779    1
22493    0
4315     0
2726     0
22101    1
4628     1
7226     1
6263     0
19972    0
16865    1
11839    1
12638    0
16780    1
19065    0
19045    0
22109    0
5

## Building RNN Model

### TensorFlow Graph & Embeddings

Values we used here are the same with our RNN with `GloVe`.

In [13]:
n_words = len(load_word_list)
embed_size = 50
num_layers = 1
lstm_size = 64
n_epochs = 80
prob = 0.5
seq_len = 230
batch_size = 64

#### Build TensorGraph

The only difference here is that we will use `load_word_vector` as our embedding vector which created as a result of learning embedding with `Gensim word2vec` before.

In [14]:
g = tf.Graph()
with g.as_default():
    # Define Placeholders
    tf_x = tf.placeholder(dtype = tf.int32, shape = (batch_size, seq_len), name = "tf_x")
    tf_y = tf.placeholder(dtype = tf.float32, shape = (batch_size), name = "tf_y")
    tf_keep_prob = tf.placeholder(tf.float32, name='keep_prob')

    # Create Embedded layer
    embedding = tf.nn.embedding_lookup(tf.cast(load_word_vector, tf.float32), tf_x, name='embedding')

    # Define LSTM cells
    drop_prob = tf.contrib.rnn.DropoutWrapper(tf.contrib.rnn.BasicLSTMCell(lstm_size), output_keep_prob=tf_keep_prob)
    lstm_cells = tf.contrib.rnn.MultiRNNCell([drop_prob] * num_layers)

    # Set Initial state
    init_state = lstm_cells.zero_state(batch_size, tf.float32)
    lstm_outputs, final_state = tf.nn.dynamic_rnn(lstm_cells, embedding, initial_state=init_state)
    
    logits = tf.squeeze(tf.layers.dense(inputs=lstm_outputs[:,-1], units = 1, activation=None, name = 'logits'))
    y_prob = tf.nn.sigmoid(logits, name = 'probabilities')
    
    predictions = {'probabilities': y_prob,
                   'labels' : tf.cast(tf.round(y_prob), tf.int32,name='labels')}
    # Cost
    cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels = tf_y))
    tf.summary.scalar('cost', cost)
    
    # Optimizer
    global_step = tf.Variable(0, trainable=False)
    starter_learning_rate = 0.1
    learning_rt = tf.train.exponential_decay(starter_learning_rate, global_step,
                                           1000, 0.96, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rt)
    train_op = optimizer.minimize(cost, name = 'train_op')
    
    # Accuracy
    correct_pred = tf.equal(tf.round(y_prob), tf_y)
    acc = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
    tf.summary.scalar('accuracy', acc)
    
    merged = tf.summary.merge_all()

## Train RNN model

It's time to train the model. Here we will build single layer RNN with 64 LSTM hidden units. The output activation function will be sigmoid function as we only need `1` (for positive review) or `0` (negative review).

Changes in the accuracy and loss can be monitored via TensorBoard. Later when the training is done, I will display the screenshots of them.

In [51]:
with tf.Session(graph = g) as sess:
    saver = tf.train.Saver(max_to_keep=None)
    dt = datetime.now().strftime("%Y%m%d-%H%M%S")
    logdir = "tensorboard/" + dt + "/"
    writer = tf.summary.FileWriter(logdir, g)
    writer_valid = tf.summary.FileWriter('tensorboard/'+ dt +'_valid', g)
    writer_train = tf.summary.FileWriter('tensorboard/'+ dt +'_train', g)
    iteration = 1
    sess.run(tf.global_variables_initializer())
    
    for epoch in range(n_epochs):
        tic = datetime.now()
        # Running train data
        state = sess.run(init_state)
        for batch_x_train, batch_y_train in mini_batches_train:
            summary, c, _, state, a = sess.run([merged, cost, train_op, final_state, acc],
                                     feed_dict = {'tf_x:0' : batch_x_train,
                                                 'tf_y:0' : batch_y_train,
                                                 init_state : state,
                                                 tf_keep_prob : prob})

            writer.add_summary(summary, iteration)
            iteration +=1
        writer_train.add_summary(summary, epoch+1)
        print("Epoch: {0}/{1} | Train loss: {2:.4f} | Train accuracy: {3:.4f}".format(epoch+1, 
                                                                                      n_epochs, 
                                                                                      c, 
                                                                                      a))
        # Running validation data
        valid_state = sess.run(init_state)
        for batch_x_valid, batch_y_valid in mini_batches_valid:
            summary, c_valid, valid_state, a_valid = sess.run([merged, cost, final_state, acc],
                                     feed_dict = {'tf_x:0' : batch_x_valid,
                                                 'tf_y:0' : batch_y_valid,
                                                 init_state : valid_state,
                                                 tf_keep_prob : 1})
        writer_valid.add_summary(summary, epoch+1)
        print("Epoch: {0}/{1} | Validation loss: {2:.4f} | Validation accuracy: {3:.4f}".format(epoch+1, 
                                                                                                n_epochs, 
                                                                                                c_valid, 
                                                                                                a_valid))
        # Save model every epoch
        saver.save(sess,"./model/w2vgensim/w2v_review_sentiment_epoch_{}.ckpt".format(epoch+1))
        
        toc = datetime.now()
        time = (toc - tic)
        print("Time: {}".format(time.seconds))
writer.close()
writer_train.close()
writer_valid.close()

Epoch: 1/80 | Train loss: 0.4844 | Train accuracy: 0.7812
Epoch: 1/80 | Validation loss: 0.4805 | Validation accuracy: 0.7656
INFO:tensorflow:./model/w2vgensim/w2v_review_sentiment_epoch_1.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 179
Epoch: 2/80 | Train loss: 0.4664 | Train accuracy: 0.7812
Epoch: 2/80 | Validation loss: 0.4742 | Validation accuracy: 0.7656
INFO:tensorflow:./model/w2vgensim/w2v_review_sentiment_epoch_2.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 190
Epoch: 3/80 | Train loss: 0.3852 | Train accuracy: 0.8438
Epoch: 3/80 | Validation loss: 0.3588 | Validation accuracy: 0.8438
INFO:tensorflow:./model/w2vgensim/w2v_review_sentiment_epoch_3.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 183
Epoch: 4/80 | Train loss: 0.3879 | Train accuracy: 0.8281
Epoch: 4/80 | Validation loss: 0.3939 | Validation accuracy: 0.7969
INFO:tensorflow:./model/w2vgensim/w2v_review_sentiment_epoch_4.ckpt is not in all_model_

Epoch: 32/80 | Validation loss: 0.2665 | Validation accuracy: 0.8906
INFO:tensorflow:./model/w2vgensim/w2v_review_sentiment_epoch_32.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 232
Epoch: 33/80 | Train loss: 0.1605 | Train accuracy: 0.9219
Epoch: 33/80 | Validation loss: 0.2272 | Validation accuracy: 0.9219
INFO:tensorflow:./model/w2vgensim/w2v_review_sentiment_epoch_33.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 219
Epoch: 34/80 | Train loss: 0.1079 | Train accuracy: 0.9688
Epoch: 34/80 | Validation loss: 0.3111 | Validation accuracy: 0.8750
INFO:tensorflow:./model/w2vgensim/w2v_review_sentiment_epoch_34.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 226
Epoch: 35/80 | Train loss: 0.1905 | Train accuracy: 0.9219
Epoch: 35/80 | Validation loss: 0.2512 | Validation accuracy: 0.8750
INFO:tensorflow:./model/w2vgensim/w2v_review_sentiment_epoch_35.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 266

Epoch: 63/80 | Validation loss: 0.2833 | Validation accuracy: 0.9375
INFO:tensorflow:./model/w2vgensim/w2v_review_sentiment_epoch_63.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 174
Epoch: 64/80 | Train loss: 0.0585 | Train accuracy: 1.0000
Epoch: 64/80 | Validation loss: 0.3403 | Validation accuracy: 0.9062
INFO:tensorflow:./model/w2vgensim/w2v_review_sentiment_epoch_64.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 177
Epoch: 65/80 | Train loss: 0.0974 | Train accuracy: 0.9375
Epoch: 65/80 | Validation loss: 0.1970 | Validation accuracy: 0.9688
INFO:tensorflow:./model/w2vgensim/w2v_review_sentiment_epoch_65.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 176
Epoch: 66/80 | Train loss: 0.0927 | Train accuracy: 0.9688
Epoch: 66/80 | Validation loss: 0.3519 | Validation accuracy: 0.9219
INFO:tensorflow:./model/w2vgensim/w2v_review_sentiment_epoch_66.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 175

### Training accuracy and loss via TensorBoard

TensorBoard for Accuracy and the Loss during the training time after 80 epochs. 


![](./images/acc_w2v.png)


![](./images/cost_w2v.png)

Below is our graph for detecting over-fitting during training. We will pick model 19 as our last model. However, it may not clearly because of the smoothing value in the tensorgraph that after epoch 19, model began to over-fit trianing data.

- <span style="color:rgb(70,173,193)">Blue line </span>: training cost
- <span style="color:rgb(173,73,190)">Purple line</span>: validation cost

![](./images/cost_trvd_w2v.png)

## Testing Model

### Test with our test dataset

Let's use our test set to test the trained model and see what is the accuracy rate. Same with the training, we also create batches for test set and run several iterations through it and keep track of accuracy in every iteration and the average of them.

Below is the number of test batches we have:

In [20]:
len(mini_batches_test)

31

Now, let's pick our 19th model and restore it in TensorFlow session:

In [43]:
with tf.Session(graph = g) as sess:
    saver = tf.train.Saver()
    accuracy = []
    saver.restore(sess, './data/model/w2vgensim/w2v_review_sentiment_epoch_19.ckpt')
    test_state = sess.run(init_state)
    for batch_x, batch_y in mini_batches_test:
        feed = {'tf_x:0': batch_x, 
            'tf_y:0': batch_y,
            'keep_prob:0' : 1, 
            init_state : test_state}
        a, test_state = sess.run([acc, final_state], feed_dict=feed)
        accuracy.append(a)
    print("Overall accuracy: {0:.4f}".format(np.mean(accuracy)*100))

Overall accuracy: 88.4073


### Test model with our own review

Same reviews will be used to test the model.

In [36]:
str_review_neg = "Content is very boring and this is a waste of time to see it."
str_review_pos = "Movie is about a spy, which is not new subject. Content is very good and this is a great time to see it."
user_review_neg = format_user_review(train_review = str_review_neg, batch_size = 64,  max_seq_len = 230, word_list = load_word_list)
user_review_pos = format_user_review(train_review = str_review_pos, batch_size = 64,  max_seq_len = 230, word_list = load_word_list)

Before running this code, we have to make sure that our TensorGraph <span style="color:red">was already created</span>. 
Let's try with negative review first:
- `1` is for positive
- `0` is for negative

In [42]:
with tf.Session(graph = g) as sess:
    saver.restore(sess, './data/model/w2vgensim/w2v_review_sentiment_epoch_19.ckpt')
    own_state = sess.run(init_state)
    
    feed = {'tf_x:0': user_review_neg,
            'keep_prob:0' : 1,
            init_state : own_state}
    
    lbl, own_state = sess.run(['labels:0', final_state], feed_dict=feed)
    if lbl[-1] == 0:
        print("This is a negative review.")
    else:
        print("This is a positive review.")

This is a negative review.


Below is the result with our positive review:

In [41]:
with tf.Session(graph = g) as sess:
    saver.restore(sess, './data/model/w2vgensim/w2v_review_sentiment_epoch_19.ckpt')
    own_state = sess.run(init_state)
    
    feed = {'tf_x:0': user_review_pos,
            'keep_prob:0' : 1,
            init_state : own_state}
    
    lbl, own_state = sess.run(['labels:0', final_state], feed_dict=feed)
    if lbl[-1] == 0:
        print("This is a negative review.")
    else:
        print("This is a positive review.")

This is a positive review.


This is the end of our notebook for RNN with `word2vec` embedding vectors. In the next part, we will look at the models for our own trained embedding vectors using `fastText`.