# Capstone Project III: Sentiment Analysis of IMDB reviews

The following is a Sentiment Analysis for reviews of movies from the IMDB data set. The data set contains a collection of 50 000 reviews, evenly split between positive and negative reviews (25,000 positive and 25,000negative reviews). A negative review is defined as having a score of <= 4 out of 10, and a positive review has a score of >= 7 out of 10. Neutral reviews(scores  between 4 and 7) have not been included in the datatset. 

The aim of this analysis is to create a model which will be able to distinguish between positive and negative reviews. To accomplish this, we will be creating a Recurrent Neural Network in Keras.


In [1]:
#import relevant libraries
import tensorflow as tf
import matplotlib.pyplot as plt

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

# import the data set from Keras
from keras.datasets import imdb


Using TensorFlow backend.


In [2]:
# Load the data set
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.imdb.load_data()


The data has already been pre-processed. By printing one review, we see that all the words in the review has been mapped to integers. Note that the integers represent words sorted by their frequency. The label, which is also an integer, represents whether the review was negative(0) or positive(1).

In [16]:
print('---review---')
print(X_train[15000])
print('---label---')
print(y_train[15000])


---review---
[1, 13, 391, 1786, 8, 97, 6, 20, 15, 9, 5057, 5, 275, 13, 391, 4, 960, 2215, 795, 15, 14, 9, 6, 717, 2717, 20, 61, 213, 9, 17, 515, 17, 14, 20, 1054, 61, 86, 931, 16, 14, 9, 51, 571, 54, 6, 1026, 2549, 494, 8, 30, 6, 20, 323, 5, 47, 57, 676, 59, 1077, 65681, 278, 8, 97, 6, 20, 59, 1040, 4189, 5, 4104, 18, 540, 8, 491, 8, 4, 20, 8, 923, 50, 16, 57, 109, 943, 5, 57, 5069, 141, 17, 6, 454, 655, 5, 277, 4, 14772, 173, 16, 52, 5, 5057, 21, 51, 16, 4, 213, 13, 219, 6, 372, 140, 8, 6, 313, 169, 49, 1268, 5817, 4, 7541, 216, 46, 499, 23, 6, 55, 1043, 314, 24, 867, 8, 808, 23, 3361, 5, 1129, 120, 41, 7541, 1461, 20, 630, 49, 31, 12854, 61, 4447, 70321, 13, 66, 181, 8, 124, 51, 4, 213, 9, 51, 16, 4, 17839, 1772, 138, 57, 943, 7, 4, 351, 1461, 138, 57, 978, 23, 4, 7541, 51, 9, 4, 213, 7, 4, 314, 1772, 51, 9, 4, 213, 7, 4, 13430, 23, 4, 519, 138, 6, 351, 7541, 138, 376, 178, 44, 35, 3954, 1752, 3127, 14329, 827, 803, 19, 4, 47060, 449, 5, 4, 13155, 9, 15, 1408, 8, 24720, 61475, 313, 1

We can also map the review back to words, as shown below.

In [17]:
word2id = imdb.get_word_index()
id2word = {i: word for word, i in word2id.items()}
print('---review with words---')
print([id2word.get(i, ' ') for i in X_train[15000]])
print('---label---')
print(y_train[15000])


---review with words---
['the', 'was', 'terrible', 'torture', 'in', 'could', 'is', 'on', 'for', 'it', 'greg', 'to', 'money', 'was', 'terrible', 'of', 'e', 'chosen', 'avoid', 'for', 'as', 'it', 'is', 'sequence', 'position', 'on', 'only', 'come', 'it', 'movie', 'sometimes', 'movie', 'as', 'on', 'filmmakers', 'only', 'how', 'sounds', 'with', 'as', 'it', 'when', 'involved', 'no', 'is', 'list', 'occasional', 'tries', 'in', 'at', 'is', 'on', 'idea', 'to', 'there', 'even', 'turned', 'would', 'la', "'humour'", 'sense', 'in', 'could', 'is', 'on', 'would', 'cold', 'cheating', 'to', 'dating', 'but', 'city', 'in', 'wants', 'in', 'of', 'on', 'in', 'society', 'more', 'with', 'even', 'being', 'quickly', 'to', 'even', 'spider', 'should', 'movie', 'is', '5', 'husband', 'to', 'once', 'of', "fuller's", 'lot', 'with', 'very', 'to', 'greg', 'not', 'when', 'with', 'of', 'come', 'was', 'least', 'is', 'next', 'through', 'in', 'is', 'everyone', 'same', 'good', 'appeal', 'renaissance', 'of', 'shin', 'saw', 'som

## Pre-processing

To ensure that all input documents are the same length, we will trim each review down to 500 words. For reviews shorter than 500 words, we will pad them with zeros. We will use the pad_sequence() function to achieve this. 

In [36]:
X_train = tf.keras.preprocessing.sequence.pad_sequences(
    sequences = X_train, 
    maxlen=500,
    dtype="int32",
    padding="pre",
    truncating="pre",
    value=0.0
)


In [37]:
X_test = tf.keras.preprocessing.sequence.pad_sequences(
    sequences = X_test, 
    maxlen=500,
    dtype="int32",
    padding="pre",
    truncating="pre",
    value=0.0
)


To find the vocabulary size(input size) in our data set, which is required for our embedding layer, we need to find the max integer.

In [38]:
print("Max for train set: ", X_train.max())


Max for train set:  88586


Our max integer is 88586, therefore, our vocabulary size is 88586 + 1.

## Build and compile model

Our first layer is an Embedding Layer, which turns positive integers (indexes) into dense vectors of fixed size.

In [26]:
model = Sequential()
model.add(Embedding(input_dim=88587,
                    output_dim=64,
                    input_length=500))
model.add(LSTM(64))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.add(Dropout(0.2))
model.compile(optimizer='adam',
              loss='mse',
              metrics=['accuracy'])


In [27]:
model.summary()


Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 500, 64)           5669568   
_________________________________________________________________
lstm_7 (LSTM)                (None, 64)                33024     
_________________________________________________________________
dropout_9 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 65        
_________________________________________________________________
dropout_10 (Dropout)         (None, 1)                 0         
Total params: 5,702,657
Trainable params: 5,702,657
Non-trainable params: 0
_________________________________________________________________


## Train the model

In [28]:
model.fit(X_train, y_train, batch_size=35, epochs=10)


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x2ef0cbd7c08>

In [29]:
scores = model.evaluate(X_test, y_test, verbose=1)
scores[1]




0.8458799719810486

Our training model has an accuracy of about 89% and our test accuracy is about 84%. Our model performs very well on the test data.

Note: Increasing epoch improves accuracy up to a certain point, however, it also significantly icreases the total training time

## Predictions

We will now use our model to make predictions on the data set. This will be done by passing a few reviews through our model and checking if they predicted the reviews correctly.

In [30]:
predict=model.predict_classes(X_test)
predict_classes=predict.reshape(len(X_test))


In [31]:
# The function below is used to convert reviews back to words,
# as well as pad the reviews
def get_original_text(i):
    word_to_id = imdb.get_word_index()
    word_to_id = {k:(v+3) for k, v in word_to_id.items()}
    word_to_id["<PAD>"] = 0
    word_to_id["<START>"] = 1
    word_to_id["<UNK>"] = 2

    id_to_word = {value:key for key, value in word_to_id.items()}
    return ' '.join(id_to_word[id] for id in X_test[i])


In [32]:
SentimentDict={1:'positive', 0:'negative'}
def display_test_sentiment(i):
    print(get_original_text(i))
    print('label: ', SentimentDict[y_test[i]], ', prediction: ', SentimentDict[predict_classes[i]])
    

Below we pass three seperate reviews through our model. Below we show the label the review, as well as the predicted label.

In [33]:
display_test_sentiment(3)


<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD

In [34]:
display_test_sentiment(87)


<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD

In [35]:
display_test_sentiment(1206)


<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD

## Conclusion

- To ensure that our input data was equal sizes, we limited each review to 500 words. For reviews that had less than 500 words, we padded those reviews with zeros using Keras' pad_sequence() function.
- We included an Embedding layer in our RNN, which turns positive integers (indexes) into dense vectors of fixed size.
- Using a batch size 35 and epoch size of 10, our train accuracy was about 89% and our test accuracy was about 85%.
- When using our model to predict the sentiment of 3 reviews, it predicted all 3 correctly.