# Sentiment Analysis with RNN and `fastText` Embeddings

In this notebook, we are going to build RNN model with embedding `fastText`. Let's first load some required packages.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import nltk, re
from sklearn.model_selection import train_test_split
from keras.preprocessing.sequence import pad_sequences
from datetime import datetime
from gensim.models import *
import logging
import time
from rnn_utils import *

Using TensorFlow backend.


## Data Preparation

#### Load train data

Same with our previous RNN model and `GloVe`, we will load data and clean it up.

In [4]:
train = pd.read_csv("./data/labeledTrainData.tsv", delimiter="\t")

#### Cleaning data

After finishing loading the dataset, we will now define a function <span style="color:blue; font-family:Courier"> format_train_review</span> ( in `rnn_utils` file) to clean up our data. This function will cut the review longer than our maximal length and looking up index for each word in our `word_list`. Padding will be done after we are done cleaning.

Let's load our saved `word_list` and `word_vector` to prepare for the clean up.

In [5]:
load_word_list = np.load('./data/word_list_gensim_fT.npy')
load_word_vector = np.load('./data/word_vector_gensim_fT.npy')
load_word_list = load_word_list.tolist()

Next, we will loop through all reviews in our dataset and call the function <span style="color:blue; font-family:Courier"> format_train_review</span> to clean up.

In [6]:
max_seq_len = 230
train_seq = []
tic = time.time()
for review in train.review:
    f = format_train_review(train_review = review, max_seq_len = max_seq_len, word_list = load_word_list)
    train_seq.append(f)
time = np.round((time.time() - tic)/60)
print("Processing time: {} minutes.".format(time))

Processing time: 3.0 minutes.


Example of review after formatting:

In [12]:
train_seq[9]

array([   10,    16,     6,   366,     4,  2068,    37,  1204,  2183,
        1337,     1,  1301,    26,     2,   106,   371,     1, 24801,
         383,    28,     3,   726,  2710,    40,  3156,     5,   816,
       10481,    10,    16,     6,     3,   978,    68,   699,    71,
          49,    41,     7,     1,   679], dtype=int32)

#### Reviews truncating

This will be the same when we do it for `word2vec` RNN model.

In [10]:
train_pad = pad_sequences(train_seq, maxlen = max_seq_len, padding='pre')
print("Length of review 9th: {}\n".format(len(train_pad[9])))
print("Total number of reviews: {}".format(len(train_pad)))

Length of review 9th: 230

Total number of reviews: 25000


As we can see from output below that review 9th has been padded with zeros at the beginning because it isn't long enough.

In [15]:
print("Review 9th after padding: \n{}".format(train_pad[9]))

Review 9th after padding: 
[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0

#### Splitting data

We will split data into 3 parts: training, validation and test sets.

In [18]:
x_test = train_pad[0:1984]
y_test = train.sentiment[0:1984]
print("Shape of x_test: "+ str(x_test.shape))
print("Shape of y_test: "+ str(y_test.shape))

Shape of x_test: (1984, 230)
Shape of y_test: (1984,)


In [16]:
x_train, x_valid, y_train, y_valid = train_test_split(train_pad[1984:], 
                                                    train.sentiment[1984:], 
                                                    test_size = 0.15, 
                                                    random_state = 789)

print("Length of x_train: "+ str(x_train.shape))
print("Length of y_train: "+ str(y_train.shape) +" \n")
print("Shape of x_valid: "+ str(x_valid.shape))
print("Shape of y_valid: "+ str(y_valid.shape))

Length of x_train: (19563, 230)
Length of y_train: (19563,) 

Shape of x_valid: (3453, 230)
Shape of y_valid: (3453,)


#### Batching training data

In [19]:
batch_size = 64
mini_batches_train = mini_batch(x_train, y_train, batch_size)
mini_batches_valid = mini_batch(x_valid, y_valid, batch_size)
mini_batches_test = mini_batch(x_test, y_test, batch_size)
print("Number of train batches: {}\n".format(len(mini_batches_train)))
print("Number of validation batches: {}\n".format(len(mini_batches_valid)))
print("Number of test batches: {}\n".format(len(mini_batches_test)))

Number of train batches: 305

Number of validation batches: 53

Number of test batches: 31



## Building RNN Model

### TensorFlow Graph & Embeddings

Values we used here are the same with our RNN with `GloVe`.

In [25]:
n_words = len(load_word_list)
embed_size = 50
num_layers = 1
lstm_size = 64
n_epochs = 80
prob = 0.5
seq_len = 230
batch_size = 64

#### Build TensorGraph

The only difference here is that we will use `load_word_vector` as our embedding vector which created as a result of learning embedding with `Gensim fastText` before.

In [26]:
g = tf.Graph()
with g.as_default():
    # Define Placeholders
    tf_x = tf.placeholder(dtype = tf.int32, shape = (batch_size, seq_len), name = "tf_x")
    tf_y = tf.placeholder(dtype = tf.float32, shape = (batch_size), name = "tf_y")
    tf_keep_prob = tf.placeholder(tf.float32, name='keep_prob')

    # Create Embedded layer
    embedding = tf.nn.embedding_lookup(tf.cast(load_word_vector, tf.float32), tf_x, name='embedding')

    # Define LSTM cells
    drop_prob = tf.contrib.rnn.DropoutWrapper(tf.contrib.rnn.BasicLSTMCell(lstm_size), output_keep_prob=tf_keep_prob)
    lstm_cells = tf.contrib.rnn.MultiRNNCell([drop_prob] * num_layers)

    # Set Initial state
    init_state = lstm_cells.zero_state(batch_size, tf.float32)
    lstm_outputs, final_state = tf.nn.dynamic_rnn(lstm_cells, embedding, initial_state=init_state)
    
    logits = tf.squeeze(tf.layers.dense(inputs=lstm_outputs[:,-1], units = 1, activation=None, name = 'logits'))
    y_prob = tf.nn.sigmoid(logits, name = 'probabilities')
    
    predictions = {'probabilities': y_prob,
                   'labels' : tf.cast(tf.round(y_prob), tf.int32,name='labels')}
    # Cost
    cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels = tf_y))
    tf.summary.scalar('cost', cost)
    
    # Optimizer
    global_step = tf.Variable(0, trainable=False)
    starter_learning_rate = 0.1
    learning_rt = tf.train.exponential_decay(starter_learning_rate, global_step,
                                           1000, 0.96, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rt)
    train_op = optimizer.minimize(cost, name = 'train_op')
    
    # Accuracy
    correct_pred = tf.equal(tf.round(y_prob), tf_y)
    acc = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
    tf.summary.scalar('accuracy', acc)
    
    merged = tf.summary.merge_all()

## Train RNN model

It's time to train the model. Here we will build single layer RNN with 64 LSTM hidden units. The output activation function will be sigmoid function as we only need `1` (for positive review) or `0` (negative review).

Changes in the accuracy and loss can be monitored via TensorBoard. Later when the training is done, I will display the screenshots of them.

In [21]:
with tf.Session(graph = g) as sess:
    saver = tf.train.Saver(max_to_keep=None)
    dt = datetime.now().strftime("%Y%m%d-%H%M%S")
    logdir = "tensorboard/" + dt + "/"
    writer = tf.summary.FileWriter(logdir, g)
    writer_valid = tf.summary.FileWriter('tensorboard/'+ dt +'_valid', g)
    writer_train = tf.summary.FileWriter('tensorboard/'+ dt +'_train', g)
    iteration = 1
    sess.run(tf.global_variables_initializer())
    
    for epoch in range(n_epochs):
        tic = datetime.now()
        # Running train data
        state = sess.run(init_state)
        for batch_x_train, batch_y_train in mini_batches_train:
            summary, c, _, state, a = sess.run([merged, cost, train_op, final_state, acc],
                                     feed_dict = {'tf_x:0' : batch_x_train,
                                                 'tf_y:0' : batch_y_train,
                                                 init_state : state,
                                                 tf_keep_prob : prob})

            writer.add_summary(summary, iteration)
            iteration +=1
        writer_train.add_summary(summary, epoch+1)
        print("Epoch: {0}/{1} | Train loss: {2:.4f} | Train accuracy: {3:.4f}".format(epoch+1, 
                                                                                      n_epochs, 
                                                                                      c, 
                                                                                      a))
        # Running validation data
        valid_state = sess.run(init_state)
        for batch_x_valid, batch_y_valid in mini_batches_valid:
            summary, c_valid, valid_state, a_valid = sess.run([merged, cost, final_state, acc],
                                     feed_dict = {'tf_x:0' : batch_x_valid,
                                                 'tf_y:0' : batch_y_valid,
                                                 init_state : valid_state,
                                                 tf_keep_prob : 1})
        writer_valid.add_summary(summary, epoch+1)
        print("Epoch: {0}/{1} | Validation loss: {2:.4f} | Validation accuracy: {3:.4f}".format(epoch+1, 
                                                                                                n_epochs, 
                                                                                                c_valid, 
                                                                                                a_valid))
        # Save model every epoch
        saver.save(sess,"./model/fastTextgensim/fastText_review_sentiment_epoch_{}.ckpt".format(epoch+1))
        
        toc = datetime.now()
        time = (toc - tic)
        print("Time: {}".format(time.seconds))
writer.close()
writer_train.close()
writer_valid.close()

Epoch: 1/80 | Train loss: 0.5945 | Train accuracy: 0.6406
Epoch: 1/80 | Validation loss: 0.5668 | Validation accuracy: 0.7500
INFO:tensorflow:./model/fastTextgensim/fastText_review_sentiment_epoch_1.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 175
Epoch: 2/80 | Train loss: 0.4950 | Train accuracy: 0.7656
Epoch: 2/80 | Validation loss: 0.4356 | Validation accuracy: 0.7969
INFO:tensorflow:./model/fastTextgensim/fastText_review_sentiment_epoch_2.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 173
Epoch: 3/80 | Train loss: 0.5239 | Train accuracy: 0.6875
Epoch: 3/80 | Validation loss: 0.4325 | Validation accuracy: 0.7812
INFO:tensorflow:./model/fastTextgensim/fastText_review_sentiment_epoch_3.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 174
Epoch: 4/80 | Train loss: 0.5376 | Train accuracy: 0.7031
Epoch: 4/80 | Validation loss: 0.4367 | Validation accuracy: 0.8281
INFO:tensorflow:./model/fastTextgensim/fastText_review_sen

Time: 220
Epoch: 31/80 | Train loss: 0.2967 | Train accuracy: 0.8906
Epoch: 31/80 | Validation loss: 0.2448 | Validation accuracy: 0.9062
INFO:tensorflow:./model/fastTextgensim/fastText_review_sentiment_epoch_31.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 188
Epoch: 32/80 | Train loss: 0.2405 | Train accuracy: 0.9062
Epoch: 32/80 | Validation loss: 0.2480 | Validation accuracy: 0.8906
INFO:tensorflow:./model/fastTextgensim/fastText_review_sentiment_epoch_32.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 182
Epoch: 33/80 | Train loss: 0.2356 | Train accuracy: 0.9062
Epoch: 33/80 | Validation loss: 0.2357 | Validation accuracy: 0.8906
INFO:tensorflow:./model/fastTextgensim/fastText_review_sentiment_epoch_33.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 181
Epoch: 34/80 | Train loss: 0.2549 | Train accuracy: 0.9219
Epoch: 34/80 | Validation loss: 0.2408 | Validation accuracy: 0.8594
INFO:tensorflow:./model/fastTextgensi

Time: 190
Epoch: 61/80 | Train loss: 0.1810 | Train accuracy: 0.9219
Epoch: 61/80 | Validation loss: 0.3805 | Validation accuracy: 0.8594
INFO:tensorflow:./model/fastTextgensim/fastText_review_sentiment_epoch_61.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 194
Epoch: 62/80 | Train loss: 0.1590 | Train accuracy: 0.9219
Epoch: 62/80 | Validation loss: 0.2534 | Validation accuracy: 0.9062
INFO:tensorflow:./model/fastTextgensim/fastText_review_sentiment_epoch_62.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 223
Epoch: 63/80 | Train loss: 0.2176 | Train accuracy: 0.9062
Epoch: 63/80 | Validation loss: 0.2708 | Validation accuracy: 0.8906
INFO:tensorflow:./model/fastTextgensim/fastText_review_sentiment_epoch_63.ckpt is not in all_model_checkpoint_paths. Manually adding it.
Time: 203
Epoch: 64/80 | Train loss: 0.2191 | Train accuracy: 0.9062
Epoch: 64/80 | Validation loss: 0.2933 | Validation accuracy: 0.9219
INFO:tensorflow:./model/fastTextgensi

### Training accuracy and loss via TensorBoard

TensorBoard for Accuracy and the Loss during the training time after 80 epochs. 


![](./images/ft_acc.png)


![](./images/ft_cost.png)

Below is our graph for detecting over-fitting during training. We will pick model 37 as our last model.

- <span style="color:rgb(70,173,193)">Blue line </span>: training cost
- <span style="color:rgb(173,73,190)">Purple line</span>: validation cost

![](./images/ft_trvd_cost.png)

## Testing Model

### Test with our test dataset

In [27]:
with tf.Session(graph = g) as sess:
    saver = tf.train.Saver()
    accuracy = []
    saver.restore(sess, './data/model/fastTextgensim/fastText_review_sentiment_epoch_37.ckpt')
    test_state = sess.run(init_state)
    for batch_x, batch_y in mini_batches_test:
        feed = {'tf_x:0': batch_x, 
            'tf_y:0': batch_y,
            'keep_prob:0' : 1, 
            init_state : test_state}
        a, test_state = sess.run([acc, final_state], feed_dict=feed)
        accuracy.append(a)
    print("Overall accuracy: {0:.4f}".format(np.mean(accuracy)*100))

Overall accuracy: 86.9960


### Test model with our own review

Same reviews will be used to test the model.

In [35]:
str_review_neg = "Content is very boring and this is a waste of time to see it."
str_review_pos = "Movie is about a spy, which is not new subject. Content is very good and this is a great time to see it."
user_review_neg = format_user_review(train_review = str_review_neg, batch_size = 64,  max_seq_len = 230, word_list = load_word_list)
user_review_pos = format_user_review(train_review = str_review_pos, batch_size = 64,  max_seq_len = 230, word_list = load_word_list)

Let's try with negative review first:

In [36]:
with tf.Session(graph = g) as sess:
    saver.restore(sess, './data/model/fastTextgensim/fastText_review_sentiment_epoch_37.ckpt')
    own_state = sess.run(init_state)
    
    feed = {'tf_x:0': user_review_neg,
            'keep_prob:0' : 1,
            init_state : own_state}
    
    lbl, own_state = sess.run(['labels:0', final_state], feed_dict=feed)
    if lbl[-1] == 0:
        print("This is a negative review.")
    else:
        print("This is a positive review.")

This is a negative review.


With positive review :

In [37]:
with tf.Session(graph = g) as sess:
    saver.restore(sess, './data/model/fastTextgensim/fastText_review_sentiment_epoch_37.ckpt')
    own_state = sess.run(init_state)
    
    feed = {'tf_x:0': user_review_pos,
            'keep_prob:0' : 1,
            init_state : own_state}
    
    lbl, own_state = sess.run(['labels:0', final_state], feed_dict=feed)
    if lbl[-1] == 0:
        print("This is a negative review.")
    else:
        print("This is a positive review.")

This is a positive review.


This is the end of our notebook. For more information please kindly check inside the report.