0

In [4]:
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.datasets import imdb

In [5]:
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=20000)

Loading data...


In [6]:
x_train[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 2,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 2,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 2,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5,
 144,
 30,

This doesn't look like a movie review, but is formated to spare a lot of trouble. It is already converted from words to integer-based indices. The actual letters that make up a word don't really matter as far as our model is concerned, what matters are the words themselves - and our model needs numbers to work with, not letters.

In [7]:
y_train[0]

1

The labels are just 0 or 1, which indicates whether reviewer said they like the movie or not.

---

So to recap, we have a bunch of movie reviews that have been converted into vectors of words represented by integers and a binary sentiment classification to learn from.

RNNs can blow up quickly, so again to keep things manageable on our little PC let's limit the reviews to their first 80 words.

In [8]:
x_train = sequence.pad_sequences(x_train, maxlen=80)
x_test = sequence.pad_sequences(x_test, maxlen=80)

Now let's set up our model. Considering how complicated LSTM nn is under the hood - it's amazon on how easy this is to do with keras.

We'll start with an embedding layer - this is just a step that converts the input data into dense vectors of fixed size that's better suited for a neural network. You generally see this in conjunction with index-based text data like we have here. The 20,000 indicates the vocabulary size (remember we only wanted the top 20,000 words) and 128 is the output dimension of 128 units.

Next we just have to set up a LSTM layer for the RNN itself. It's that easy. We specify 128 to match the output size of the embedding layer, and dropout terms to avoid overfitting, which RNNs are particularly prone to.

Finally we just need to boil down to a single neuron with a sigmoid activation function to choose our binary sentiment classification of 0 or 1.

In [9]:
model = Sequential()
model.add(Embedding(20000, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))



As this is a binary classification problem, we'll use the binary_crossentropy loss function. And the Adam optimizer is usually a good choice.

In [10]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Now we'll train our model. RNN's like CNN's are very resource heavy. Keeping the batch size relatively small is the key to enabling this to run on your PC at all. In the real word of course, you'd be taking advantage of GPU's installed across many computers on a cluster to make this cale a lot better.

In [11]:
model.fit(x_train, y_train,
          batch_size=32,
          epochs=15,
          verbose=2,
          validation_data=(x_test, y_test))

Epoch 1/15
782/782 - 235s - loss: 0.4276 - accuracy: 0.7987 - val_loss: 0.3918 - val_accuracy: 0.8329
Epoch 2/15
782/782 - 235s - loss: 0.2519 - accuracy: 0.8994 - val_loss: 0.3795 - val_accuracy: 0.8329
Epoch 3/15
782/782 - 241s - loss: 0.1649 - accuracy: 0.9385 - val_loss: 0.4371 - val_accuracy: 0.8308
Epoch 4/15
782/782 - 237s - loss: 0.1069 - accuracy: 0.9608 - val_loss: 0.5495 - val_accuracy: 0.8218
Epoch 5/15
782/782 - 238s - loss: 0.0732 - accuracy: 0.9742 - val_loss: 0.6398 - val_accuracy: 0.8218
Epoch 6/15
782/782 - 243s - loss: 0.0492 - accuracy: 0.9832 - val_loss: 0.7561 - val_accuracy: 0.8187
Epoch 7/15
782/782 - 237s - loss: 0.0398 - accuracy: 0.9866 - val_loss: 0.8226 - val_accuracy: 0.8156
Epoch 8/15
782/782 - 235s - loss: 0.0312 - accuracy: 0.9904 - val_loss: 0.8085 - val_accuracy: 0.8140
Epoch 9/15
782/782 - 233s - loss: 0.0251 - accuracy: 0.9925 - val_loss: 0.8965 - val_accuracy: 0.8088
Epoch 10/15
782/782 - 236s - loss: 0.0249 - accuracy: 0.9922 - val_loss: 0.8134 - 

<tensorflow.python.keras.callbacks.History at 0x7f4ebf1085b0>

80% eh? Not too bad, considering we limited ourselves to just the first 80 words of each review.

Note that the validation accuracy while we were training never really improved after the first epoch; we're likely just overfitting. This is a case where early stopping would have been beneficial.

But again - stop and think about what we just made here! A neural network that can "read" reviews and deduce whether the author liked the movie or not based on that text. And it takes the context of each word and its position in the review into account - and setting up the model itself was just a few lines of code! It's pretty incredible what you can do with Keras.