In [84]:
import keras
import pickle
import numpy as np

In [106]:
from keras.models import Sequential, Model

from keras.layers import Embedding, Conv1D, MaxPooling1D, Dense, Flatten, Input

from keras.optimizers import Adam

In [86]:
from keras.datasets import imdb

Load the IMDB dataset

In [87]:
word_to_idx = imdb.get_word_index()

In [88]:
idx_to_word = {v:k for k,v in word_to_idx.items()}

In [89]:
path = keras.utils.data_utils.get_file('imdb_full.pkl',
                origin='https://s3.amazonaws.com/text-datasets/imdb_full.pkl',
                md5_hash='d091312047c43cf9e4e38fef92437263')
with open(path, 'rb') as f:
    (x_train, labels_train), (x_test, labels_test) = pickle.load(f)

Let's have a look at one of the reviews. We review is an array of word ids, so we need to convert them to the words themselves.

In [90]:
' '.join([idx_to_word[idx] for idx in x_train[0]])

"bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell high's satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers' pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled at high a classic line inspector i'm here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isn't"

Let's limit our vocabulary. First note that the ids are in order of frequency, so the higher the id, the rarer the word is. So if we set all words with ids above a vocab limit to the same, that word becomes a stand-in for 'rare word'  - which we hypothesize wont have too much effect on the sentiment of the sentence.

In [91]:
vocab_size = 5000

In [98]:
trn = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in r]) for r in x_train]
test = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in r]) for r in x_test]

Our reviews are of different lengths:

In [93]:
lens = np.array([len(x) for x in trn])
lens.max(), lens.min(), lens.mean()

(2493, 10, 237.71364)

In [99]:
sequence_len = 500
trn = keras.preprocessing.sequence.pad_sequences(trn, sequence_len, value=0)
test = keras.preprocessing.sequence.pad_sequences(test, sequence_len, value=0)

In [95]:
model = Sequential([
    Embedding(vocab_size, 32, input_length=sequence_len),
    Conv1D(64, 5, padding='same', activation='relu'),
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation='relu'),
    Dense(1, activation='sigmoid')
])

In [96]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [100]:
model.fit(trn, labels_train, validation_data=(test, labels_test), epochs=4, batch_size=128)

Train on 25000 samples, validate on 25000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x149451b00>

It's overfitting, but that's to be expected as we have literally no regularization. We could easily chuck in some dropout. I wanna have a go quickly just building the conv block instead as a group of 3 sizes - 3,4,5 - using the functional API, just as practice.

In [128]:
inp = Input(shape=(500,32))
convs = [Conv1D(64, size, padding='same', activation='relu')(inp) for size in range(3,6)]
output = keras.layers.Concatenate()(convs)
ConvBlock = Model(inputs=inp, output=output)



In [117]:
model2 = Sequential([
    Embedding(vocab_size, 32, input_length=sequence_len),
    ConvBlock,
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation='relu'),
    Dense(1, activation='sigmoid')
])

In [118]:
model2.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
model_1 (Model)              (None, 500, 192)          24768     
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 250, 192)          0         
_________________________________________________________________
flatten_3 (Flatten)          (None, 48000)             0         
_________________________________________________________________
dense_5 (Dense)              (None, 100)               4800100   
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 101       
Total params: 4,984,969
Trainable params: 4,984,969
Non-trainable params: 0
_________________________________________________________________


In [119]:
model2.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [120]:
model.fit(trn, labels_train, validation_data=(test, labels_test), epochs=4, batch_size=128)

Train on 25000 samples, validate on 25000 samples
Epoch 1/4
Epoch 2/4

KeyboardInterrupt: 