# Recurrent Neural Networks (RNN)

In this notebook, we are going to use RNNs to pefrom sentiment analysis using Keras, a deep learning API written in Python, running on top of TensorFlow. We will be using the IMDB movie review sentiment classification datset. This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a list of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. More information on this dataset can be found here: https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification

We first import all the requiered libraries.

In [25]:
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.datasets import imdb

And we load our dataset.

In [26]:
data = imdb.load_data

Now we are going to split our data into training and testing sets and set the maximun number of words to 10000, so we will consider just the 10000 first most common words.

In [27]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)

Let's exlore our data.

In [28]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
print(x_train)
print(y_train)

(25000,)
(25000,)
(25000,)
(25000,)
[list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32])
 list([1, 194, 11

We have 25000 movie review as expected. However, the words in the reviews have been converted to integer-based indices. Each number in the training features represent some specific word. Movie reviews that have been converted into vectors of words represented by integers.

Labels are 0 or 1, corresponding to negative or positive review respectively. This makes our problem a binary sentiment classification problem.

In [29]:
y_train[0]

1

# To make things easier for our computer, we are going to limit the reviews to the first 150 words.

In [30]:
x_train = sequence.pad_sequences(x_train, maxlen=150)
x_test = sequence.pad_sequences(x_test, maxlen=150)

Now we are going to build our RNN.

We are going to start with an embedding layer which will convert the input data into dense vectors of fixed size that's better suited for a neural network. We have set the vocabulary size to 25000 and the output dimension to 128 units.

Our next layer is going to be a LSTM. The first argument is 128 to match the output of the embedding layer. Dropout terms are added in order to avoid overfitting, something that RNN's are particularly prone to.

The last layer is going to be an activation function, in particular a sigmoid activation function which will determing the output of our deep learning model.

In [35]:
model = Sequential()
model.add(Embedding(25000, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, None, 128)         3200000   
_________________________________________________________________
lstm_4 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 129       
Total params: 3,331,713
Trainable params: 3,331,713
Non-trainable params: 0
_________________________________________________________________


We are going to use the binary crossentropy loss function, which omputes the cross-entropy loss between true labels and predicted labels. This cross-entropy loss suits this problems as we only have two label classes. We are also going to use an Adam optimizer.

In [32]:
model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy'])

Now we are goin to train our model in a CPU although it would be great if we could take advantage of a GPU, as this takes a long time to run.

In [33]:
model.fit(x_train, y_train, batch_size=32, epochs=15, verbose=2, validation_data=(x_test, y_test))

Train on 25000 samples, validate on 25000 samples
Epoch 1/15
25000/25000 - 954s - loss: 0.4717 - accuracy: 0.7776 - val_loss: 0.4157 - val_accuracy: 0.8178
Epoch 2/15
25000/25000 - 1532s - loss: 0.3546 - accuracy: 0.8510 - val_loss: 0.4864 - val_accuracy: 0.7688
Epoch 3/15
25000/25000 - 890s - loss: 0.2892 - accuracy: 0.8832 - val_loss: 0.3727 - val_accuracy: 0.8514
Epoch 4/15
25000/25000 - 925s - loss: 0.2294 - accuracy: 0.9106 - val_loss: 0.3674 - val_accuracy: 0.8512
Epoch 5/15
25000/25000 - 1258s - loss: 0.1872 - accuracy: 0.9271 - val_loss: 0.4017 - val_accuracy: 0.8495
Epoch 6/15
25000/25000 - 2324s - loss: 0.1529 - accuracy: 0.9424 - val_loss: 0.4332 - val_accuracy: 0.8407
Epoch 7/15
25000/25000 - 2166s - loss: 0.1365 - accuracy: 0.9508 - val_loss: 0.4763 - val_accuracy: 0.8461
Epoch 8/15
25000/25000 - 1308s - loss: 0.0987 - accuracy: 0.9651 - val_loss: 0.4894 - val_accuracy: 0.8447
Epoch 9/15
25000/25000 - 1455s - loss: 0.0796 - accuracy: 0.9725 - val_loss: 0.5733 - val_accurac

<tensorflow.python.keras.callbacks.History at 0x646111650>

We now evaluate the accurracy of our model:

In [34]:
score, acc = model.evaluate(x_test, y_test, batch_size=32, verbose=2)
print('Test score:', score)
print('Test accuracy:', acc)

25000/1 - 212s - loss: 0.9275 - accuracy: 0.8344
Test score: 0.8324115474414825
Test accuracy: 0.83444


83% is not bad considering that we limited ourselves to just the first 150 words of each review but again, with a GPU the results could be better.

The validation acurracy didn't improved from the third epoch so perhaps it would had been a good idea to include an early stopping as we might have overfit.