## Recurring Neural Networks for Movie Reviews (imdb)

## Sentiment analysis from movie reviews



We are going to use an RNN to do sentiment analysis on movie reviews.

Since understanding written language requires keeping track of all the words in a sentence, we need a recurrent neural network to keep a "memory" of the words that have come before as it "reads" sentences over time.

We'll use LSTM (Long Short-Term Memory) cells because we want our model to have a "memory" of the words it reads

In [2]:
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.datasets import imdb

Import our training and testing data. We are only considering the 20,000 most popular words in the dataset to limit the load and run time. The dataset includes 25,000 training reviews and 25,000 testing reviews.

In [3]:
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=20000)

Loading data...
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


Let's get a feel for what this data looks like. Let's look at the first training feature, which should represent a written movie review:

In [10]:
print(len(x_train), len(x_test))
print(y_train[0], y_test[0])
print(x_test[0])

25000 25000
1 0
[1, 591, 202, 14, 31, 6, 717, 10, 10, 18142, 10698, 5, 4, 360, 7, 4, 177, 5760, 394, 354, 4, 123, 9, 1035, 1035, 1035, 10, 10, 13, 92, 124, 89, 488, 7944, 100, 28, 1668, 14, 31, 23, 27, 7479, 29, 220, 468, 8, 124, 14, 286, 170, 8, 157, 46, 5, 27, 239, 16, 179, 15387, 38, 32, 25, 7944, 451, 202, 14, 6, 717]


These reviews already come vectorized. So we don't need to apply an autoencoder like word2vec or bag of words.

The review sentiment is encoded as 0 or 1, which indicates whether the reviewer said they liked the movie or not.

So this is a binary classification problem.

RNN's can become computationally heavy very quickly, so finding any way to limit the amount of data to deal with helps signifficantly. So why don't we just try to determine the sentiment based on the beginning of the reviews only. Say the first 80 words.

In [5]:
x_train = sequence.pad_sequences(x_train, maxlen=80)
x_test = sequence.pad_sequences(x_test, maxlen=80)

Now let's set up our neural network model.

We will start with an Embedding layer, which converts the input data into dense vectors of fixed size. These are better suited for a neural network. One generally sees this in relation with index-based text data like we have here. The 20,000 indicates the vocabulary size and 128 is the output dimension of 128 units.

Next we set up a LSTM layer for the RNN itself. We specify 128 to match the output size of the Embedding layer, and dropout terms to avoid overfitting, because RNN's are particularly prone to this.

Finally we just need to finish with a single neuron with a sigmoid activation function to choose our binay sentiment classification of 0 or 1.

In [6]:
model = Sequential()
model.add(Embedding(20000, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

Use the binary_crossentropy loss function, and the adam optimizer.

In [7]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Now we train!

In [8]:
model.fit(x_train, y_train,
          batch_size=32,
          epochs=15,
          verbose=2,
          validation_data=(x_test, y_test))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 25000 samples, validate on 25000 samples
Epoch 1/15
 - 139s - loss: 0.6580 - acc: 0.5869 - val_loss: 0.5437 - val_acc: 0.7200
Epoch 2/15
 - 138s - loss: 0.4652 - acc: 0.7772 - val_loss: 0.4024 - val_acc: 0.8153
Epoch 3/15
 - 136s - loss: 0.3578 - acc: 0.8446 - val_loss: 0.4024 - val_acc: 0.8172
Epoch 4/15
 - 134s - loss: 0.2902 - acc: 0.8784 - val_loss: 0.3875 - val_acc: 0.8276
Epoch 5/15
 - 135s - loss: 0.2342 - acc: 0.9055 - val_loss: 0.4063 - val_acc: 0.8308
Epoch 6/15
 - 132s - loss: 0.1818 - acc: 0.9292 - val_loss: 0.4571 - val_acc: 0.8308
Epoch 7/15
 - 124s - loss: 0.1394 - acc: 0.9476 - val_loss: 0.5458 - val_acc: 0.8177
Epoch 8/15
 - 126s - loss: 0.1062 - acc: 0.9609 - val_loss: 0.5950 - val_acc: 0.8133
Epoch 9/15
 - 133s - loss: 0.0814 - acc: 0.9712 - val_loss: 0.6440 - val_acc: 0.8218
Epoch 10/15
 - 134s - loss: 0.0628 - acc: 0.9783 - val_loss: 0.6525 - val_acc: 0.8138
Epoch 11/15
 - 136s - loss: 0.0514 - acc: 0.9822 - val_loss: 0.7252 - val_acc: 0.8143
Epoch 12/15
 

<tensorflow.python.keras.callbacks.History at 0x21c29ab8630>

Checking the accuracy:

In [9]:
score, acc = model.evaluate(x_test, y_test,
                            batch_size=32,
                            verbose=2)
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 0.9316869865119457
Test accuracy: 0.80904


81% is not perfect, but we only used the first 80 words to save time. With more computing power, this model could be expanded.