# Analyzing IMDB Data in Keras

In [1]:
# Imports
import numpy as np
import keras
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(84)

Using TensorFlow backend.


## 1. Loading the data
This dataset comes preloaded with Keras, so one simple command will get us training and testing data. There is a parameter for how many words we want to look at. We've set it at 1000, but feel free to experiment.

In [2]:
# Loading the data (it's preloaded in Keras)
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=1000)

print(x_train.shape)
print(x_test.shape)

(25000,)
(25000,)


## 2. Examining the data
Notice that the data has been already pre-processed, where all the words have numbers, and the reviews come in as a vector with the words that the review contains. For example, if the word 'the' is the first one in our dictionary, and a review contains the word 'the', then there is a 1 in the corresponding vector.

The output comes as a vector of 1's and 0's, where 1 is a positive sentiment for the review, and 0 is negative.

In [3]:
print(x_train[2])
print(y_train[2])

[1, 14, 47, 8, 30, 31, 7, 4, 249, 108, 7, 4, 2, 54, 61, 369, 13, 71, 149, 14, 22, 112, 4, 2, 311, 12, 16, 2, 33, 75, 43, 2, 296, 4, 86, 320, 35, 534, 19, 263, 2, 2, 4, 2, 33, 89, 78, 12, 66, 16, 4, 360, 7, 4, 58, 316, 334, 11, 4, 2, 43, 645, 662, 8, 257, 85, 2, 42, 2, 2, 83, 68, 2, 15, 36, 165, 2, 278, 36, 69, 2, 780, 8, 106, 14, 2, 2, 18, 6, 22, 12, 215, 28, 610, 40, 6, 87, 326, 23, 2, 21, 23, 22, 12, 272, 40, 57, 31, 11, 4, 22, 47, 6, 2, 51, 9, 170, 23, 595, 116, 595, 2, 13, 191, 79, 638, 89, 2, 14, 9, 8, 106, 607, 624, 35, 534, 6, 227, 7, 129, 113]
0


## 3. One-hot encoding the output
Here, we'll turn the input vectors into (0,1)-vectors. For example, if the pre-processed vector contains the number 14, then in the processed vector, the 14th entry will be 1.

In [4]:
# One-hot encoding the output into vector mode, each of length 1000
tokenizer = Tokenizer(num_words=1000)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')
print(x_train[0])

[ 0.  1.  1.  0.  1.  1.  1.  1.  1.  1.  0.  0.  1.  1.  1.  1.  1.  1.
  1.  1.  0.  1.  1.  0.  0.  1.  1.  0.  1.  0.  1.  0.  1.  1.  0.  1.
  1.  0.  1.  1.  0.  0.  0.  1.  0.  0.  1.  0.  1.  0.  1.  1.  1.  0.
  0.  0.  1.  0.  0.  0.  0.  0.  1.  0.  0.  1.  1.  0.  0.  0.  0.  1.
  0.  0.  0.  0.  1.  1.  0.  0.  0.  0.  1.  0.  0.  0.  0.  1.  1.  0.
  0.  0.  1.  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  1.  1.  0.  1.  1.
  0.  0.  0.  0.  1.  1.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  1.  0.
  0.  0.  0.  0.  1.  0.  0.  0.  1.  1.  0.  0.  0.  0.  0.  1.  0.  0.
  1.  0.  0.  1.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  1.  1.  0.  0.  0.  0.  1.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.
  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0

And we'll also one-hot encode the output.

In [5]:
# One-hot encoding the output
num_classes = 2
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print(y_train.shape)
print(y_test.shape)

(25000, 2)
(25000, 2)


## 4. Building the  model architecture
Build a model here using sequential. Feel free to experiment with different layers and sizes! Also, experiment adding dropout to reduce overfitting.

In [17]:
from keras.layers import Dense, Dropout, Activation, PReLU

model = Sequential()
model.add(Dense(128, input_shape=(1000,)))
model.add(PReLU())
model.add(Dropout(0.3))
#model.add(Dense(128))
#model.add(PReLU())
#model.add(Dropout(0.25))
model.add(Dense(2))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
# TODO: Compile the model using a loss function and an optimizer.


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_12 (Dense)             (None, 128)               128128    
_________________________________________________________________
p_re_lu_8 (PReLU)            (None, 128)               128       
_________________________________________________________________
dropout_8 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_13 (Dense)             (None, 2)                 258       
_________________________________________________________________
activation_5 (Activation)    (None, 2)                 0         
Total params: 128,514.0
Trainable params: 128,514.0
Non-trainable params: 0.0
_________________________________________________________________


###### 5. Training the model
Run the model here. Experiment with different batch_size, and number of epochs!

In [18]:
# TODO: Run the model. Feel free to experiment with different batch sizes and number of epochs.
#len(y_train)
model.fit(x_train, y_train, epochs=20, batch_size=200, verbose=2)

Epoch 1/20
1s - loss: 0.4291 - acc: 0.7976
Epoch 2/20
1s - loss: 0.3204 - acc: 0.8656
Epoch 3/20
0s - loss: 0.2936 - acc: 0.8770
Epoch 4/20
1s - loss: 0.2668 - acc: 0.8896
Epoch 5/20
1s - loss: 0.2386 - acc: 0.9048
Epoch 6/20
1s - loss: 0.2061 - acc: 0.9215
Epoch 7/20
0s - loss: 0.1691 - acc: 0.9404
Epoch 8/20
0s - loss: 0.1360 - acc: 0.9567
Epoch 9/20
1s - loss: 0.1097 - acc: 0.9679
Epoch 10/20
1s - loss: 0.0868 - acc: 0.9757
Epoch 11/20
1s - loss: 0.0702 - acc: 0.9817
Epoch 12/20
1s - loss: 0.0613 - acc: 0.9840
Epoch 13/20
1s - loss: 0.0509 - acc: 0.9877
Epoch 14/20
1s - loss: 0.0458 - acc: 0.9884
Epoch 15/20
0s - loss: 0.0384 - acc: 0.9908
Epoch 16/20
1s - loss: 0.0344 - acc: 0.9920
Epoch 17/20
1s - loss: 0.0317 - acc: 0.9926
Epoch 18/20
1s - loss: 0.0304 - acc: 0.9922
Epoch 19/20
1s - loss: 0.0276 - acc: 0.9930
Epoch 20/20
1s - loss: 0.0258 - acc: 0.9933


<keras.callbacks.History at 0x12151af60>

## 6. Evaluating the model
This will give you the accuracy of the model, as evaluated on the testing set. Can you get something over 85%?

In [19]:
score = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: ", score[1])

Accuracy:  0.8518
