# Sentiment analysis on IMDB reviews: TensorFlow GloVe and LSTM (improved 1)

In this notebook I will try to perform sentiment analysis using TensorFlow. Most of the notebook is a copy of what was done on this blog:
https://www.oreilly.com/learning/perform-sentiment-analysis-with-lstms-using-tensorflow

Everything is implemented "manualy" and it will be a good basis to go toward something more refined. The difference with the original notebook is that at the end of this notebook, we test the accuracy of the model on the whole testing set (once). This allows us to compare its performance with other models.

## Libraries

In [1]:
import numpy as np
import csv
import io
import tensorflow as tf
import pickle

## Preprocessing and Data exploration

The Data exploration part (measuring the average number of words in the reviews) and the data preprocessing, turning texts into sequence of indexes corresponding the GloVes word embeddings, are done in another notebook called IMDB_sent_an_data_preprocessing. The variable created there are then loaded in the next sections.

## Loading matrices of embedding indexes and lists of labels

The pretrained embeddins from GloVe can be downloaded here: https://nlp.stanford.edu/projects/glove/

There are different **word embedding sizes**. The possibilities are 50, 100, 200, 300. We define the one we use next.

In [2]:
word_emb_size = '100'

In [3]:
prepr_dir = '/home/aritz/Documents/CS_Programming_Machine_Learning/Projects/IMDB_sentiment_analysis/IMDB_sent_an_data_preprocessing/'

In [4]:
ids_train = np.load(prepr_dir+'Saved_embeddings/idsMatrixTrain'+word_emb_size+'.npy')
ids_test = np.load(prepr_dir+'Saved_embeddings/idsMatrixTest'+word_emb_size+'.npy')

In [5]:
max_seq_len = ids_train.shape[1]

Next we load the **labels** with and without **one-hot-encoding** ([1, 0] for positive and [0, 1] for negative).

In [6]:
with open(prepr_dir+"y_train_ord.txt", "rb") as fp:
    y_train_ord = pickle.load(fp)

with open(prepr_dir+"y_test_ord.txt", "rb") as fp:
    y_test_ord = pickle.load(fp)

with open(prepr_dir+"y_train.txt", "rb") as fp:
    y_train = pickle.load(fp)

with open(prepr_dir+"y_test.txt", "rb") as fp:
    y_test = pickle.load(fp)

Next we load the **list of words in the GloVe table** and a numpy array containing the **GloVe look-up table**:

In [7]:
with open(prepr_dir+"words_list.txt", "rb") as fp:
    words_list = pickle.load(fp)

In [8]:
word_vectors = np.load(prepr_dir+'word_vectors.npy')

## Creating batching functions

In this section we create two functions which will help feeding the model with batches of samples.

In [9]:
from random import randint

The original implementation of getTrainBatch would only use part of the training data for the training phase (indices 0 to 11498 for positive, instead of 0 to 12498, and indices 13499 to 24999 for negatives instead of 12499 to 24999). Since it is not clear exactly where the indices of the positive reviews stop and when the indices of the negative reviews start, I take stop at 12499 for positive and starts at 12502 for negative

In [10]:
def getTrainBatch():
    labels = []
    arr = np.zeros([batch_size, max_seq_len])
    for i in range(batch_size):
        if (i % 2 == 0): 
            num = randint(1,12499)
            labels.append([1,0])
        else:
            num = randint(12502,24999)
            labels.append([0,1])
        arr[i] = ids_train[num-1:num]
    return arr, labels

The next function is actually pretty useless. I will replace it by a function which tests the model against the whole test data (and not just against some part of the training data).

In [11]:
def getTestBatch():
    labels = []
    arr = np.zeros([batch_size, max_seq_len])
    for i in range(batch_size):
        num = randint(11499,13499)
        if (num <= 12499):
            labels.append([1,0])
        else:
            labels.append([0,1])
        arr[i] = ids_train[num-1:num]
    return arr, labels

## Definition of the model

In [12]:
batch_size = 24
lstm_units = 64
num_classes = 2

In [13]:
import tensorflow as tf
tf.reset_default_graph()

These placeholders are here to take the input of the model (labels and samples turned into arrays of indices).

In [14]:
labels = tf.placeholder(tf.float32, [batch_size, num_classes])
input_data = tf.placeholder(tf.int32, [batch_size, max_seq_len])

Then we embed the indices into vectors. The next cell is commented out because I think that it is useless. It was in the tutorial but I suspect that its author forgot to remove it.
As explained in the tutorial, we were using pretrained embeddings where vectors have length 50. But here numDimensions are of length 300. And in the following cell, 'data' defined again... I ran the notebook with and without it and it gives similar results so I commented it out.

In [15]:
#data = tf.Variable(tf.zeros([batch_size, max_seq_len, numDimensions]),dtype=tf.float32)

In [16]:
data = tf.nn.embedding_lookup(word_vectors, input_data)

The following cell is a fix comming from
https://github.com/tgjeon/TensorFlow-Tutorials-for-Time-Series/issues/2
If I don't put it, errors appear in the cell after this one.

In [17]:
data = tf.cast(data, tf.float32)

Next we define the LSTM with dropout layer. According to the tutorial, the parameter lstm_units needs some tuning to find the optimal value.

In [18]:
lstmCell = tf.contrib.rnn.BasicLSTMCell(lstm_units)
lstmCell = tf.contrib.rnn.DropoutWrapper(cell=lstmCell, output_keep_prob=0.75)

Instructions for updating:
Use the retry module or similar alternatives.


If I understood it right, 'value' in the next cell represents the outputs of the lstm (for each sample of the batch and each word of each sample). According to the documentation it should have dimensions equal to [batch_size, max_time, cell.output_size].

In [19]:
value, _ = tf.nn.dynamic_rnn(lstmCell, data, dtype=tf.float32)

Next we add some afine transformation.

In [20]:
weight = tf.Variable(tf.truncated_normal([lstm_units, num_classes]))
bias = tf.Variable(tf.constant(0.1, shape=[num_classes]))

If I'm not mistaken the next cell swaps the two first dimensions so it has dimensions [max_time, batch_size, cell.output_size].

In [21]:
value = tf.transpose(value, [1, 0, 2])

If I'm not mistaken the next cell slices the part of the output which corresponds to the last output of the lstm, or in other words the output corresponding to the last word for every sample (if I'm right we used 0 padding and cut everything which goes beyound 250 words, so technically it is the 250th output). My guess is that last has dimensions [batch_size, cell.output_size] which we can then use to do matrix multiplication with weight which has dimensions [cell.output_size, num_classes] (remember that cell.output_size=lstm_units).

In [22]:
last = tf.gather(value, int(value.get_shape()[0]) - 1)

In [23]:
prediction = (tf.matmul(last, weight) + bias)

Next we compute accuracy.

In [24]:
correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))

Next we compute the cross-entropy loss using the logits (i.e. unnormalized probabilities), and we define the optimizer. Note that I replaced tf.nn.softmax_cross_entropy_with_logits (as in the original script) by tf.nn.softmax_cross_entropy_with_logits_v2 as indicated by a warning.

In [25]:
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=prediction, labels=labels))
optimizer = tf.train.AdamOptimizer().minimize(loss)

The next cell allows us to use TensorBoard to visualize the loss and accuracy.

In [26]:
sess = tf.InteractiveSession()

In [27]:
import datetime

tf.summary.scalar('Loss', loss)
tf.summary.scalar('Accuracy', accuracy)
merged = tf.summary.merge_all()
logdir = "tensorboard/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + "/"
writer = tf.summary.FileWriter(logdir, sess.graph)

## Training

iterations gives the number of batches against whom we are going to train our model.

In [28]:
iterations = 10001 #originally 100'000

In [29]:
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer())

In [30]:
for i in range(iterations):
    #Next Batch of reviews
    nextBatch, nextBatchLabels = getTrainBatch();
    sess.run(optimizer, {input_data: nextBatch, labels: nextBatchLabels})

    #Write summary to Tensorboard
    if (i % 50 == 0):
        summary = sess.run(merged, {input_data: nextBatch, labels: nextBatchLabels})
        writer.add_summary(summary, i)

    #Save the network every 10,000 training iterations
    if (i % 10000 == 0 and i != 0):
        save_path = saver.save(sess, "models/pretrained_lstm.ckpt", global_step=i)
        print("saved to %s" % save_path)
writer.close()

saved to models/pretrained_lstm.ckpt-10000


## TensorBoard

The author of the tutorial also mentions the possibility of tracking the progress of the model on TensorBoard by entering "tensorboard --logdir=tensorboard" in a terminal, and visiting http://localhost:6006/ with a browser.

## Using a pretrained model

After one has trained a first time the model, one can reuse it during the next executions.

In [28]:
sess = tf.InteractiveSession()
saver = tf.train.Saver()
saver.restore(sess, tf.train.latest_checkpoint('models'))

INFO:tensorflow:Restoring parameters from models/pretrained_lstm.ckpt-10000


## Testing our model on test data

To make sure one improves the quality of the model without overfitting, one has to test it agains test data. In the tutorial, they advise to alternate training phases on training data and testing phases on test data, and stop when the accuracy on test data starts decreasing.

Below we test the accuracy of the model against the whole test data.

In [29]:
num_files_test = ids_test.shape[0]

In [30]:
n_iter_test = int(num_files_test/batch_size)

In [31]:
labels_test = y_test
accuracy_test = np.zeros(n_iter_test)

In [32]:
def get_test_batch_order(i):
    next_batch_labels = labels_test[(i):(i+batch_size)]
    next_batch = np.zeros([batch_size, max_seq_len])
    for j in range(batch_size):
        next_batch[j] = ids_test[(i+j):(i+j+1)]
    return next_batch, next_batch_labels

In [33]:
for i in range(n_iter_test):
    next_batch, next_batch_labels = get_test_batch_order(i)
    accuracy_test[i] = (sess.run(accuracy, {input_data: next_batch, labels: next_batch_labels})) * 100

In [34]:
print("Accuracy on test set = ", accuracy_test.mean())

Accuracy on test set =  85.11847582261913
