On the Imdb movie reviews dataset.

Imdb has released a database of 50,000 movie reviews classified in two categories: Negative and Positive. This is a typical sequence binary classification problem.

### How to represent the words

Movie reviews are sequences of words. So first we need to encode them.

We map movie reviews to sequences of word embeddings. Word embeddings are just vectors that represent multiple features of a word. In Word2Vec, vectors represent relative position between words. One simple way to understand this is to look at the following image:

<img src='https://cdn-images-1.medium.com/max/1000/1*Bjtqi5sgc-pE8bB80IAkeA.jpeg'/>

In [1]:
# LSTM for sequence classification in the IMDB dataset
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, Convolution1D, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.callbacks import TensorBoard

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
# Using keras to load the dataset with the top_words
top_words = 10000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

After mapping every movie review to sequences of word embeddings, we need to pad the sequences to get the same length on all of them. i.e. we add zeroes to the small sequences and truncate the larger ones.

In [4]:
# Pad the sequence to the same length
max_review_length = 1600
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

### The model
Here we used a 3-layered convolution neural network with 2 dense layers.

Why Convolutional? Because it works. Convolutional layers are really powerful to extract higher level feature in images. And quite amazingly, they actually work in most 2D problems. Another big reason that should convince you is the training time, CNN train 50% to 60% faster than LSTMs on this problem.

In [5]:
# Using embedding from Keras
embedding_vecor_length = 300
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))

# Convolutional model (3x conv, flatten, 2x dense)
model.add(Convolution1D(64, 3, padding='same'))
model.add(Convolution1D(32, 3, padding='same'))
model.add(Convolution1D(16, 3, padding='same'))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(180,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(1,activation='sigmoid'))

In [6]:
# Log to tensorboard
tensorBoardCallback = TensorBoard(log_dir='./logs', write_graph=True)

In [7]:
# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [8]:
model.fit(X_train, y_train, epochs=1, callbacks=[tensorBoardCallback], batch_size=64)

Epoch 1/1


<keras.callbacks.History at 0x244f0fbe780>

In [9]:
# Evaluation on the test set
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 87.86%
