In [1]:
import keras
keras.__version__

Using TensorFlow backend.


'2.3.1'

In [None]:
from keras.datasets import imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

In [None]:
train_data[0]

In [None]:
train_labels[0]

In [None]:
max([max(sequence) for sequence in train_data])

In [None]:
# decoding a review back to English
# word_index is a dictionary mapping words to an integer index
word_index = imdb.get_word_index()
# We reverse it, mapping integer indices to words
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
# We decode the review; note that our indices were offset by 3
# because 0, 1 and 2 are reserved indices for "padding", "start of sequence", and "unknown".
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])

In [None]:
decoded_review

## Preparing the data

In [None]:
#     We cannot feed lists of integers into a neural network. We have to turn our lists into tensors. There are two ways we could do that:

# * We could pad our lists so that they all have the same length, and turn them into an integer tensor of shape `(samples, word_indices)`, 
# then use as first layer in our network a layer capable of handling such integer tensors (the `Embedding` layer, which we will cover in 
# detail later in the book).
# * We could one-hot-encode our lists to turn them into vectors of 0s and 1s. Concretely, this would mean for instance turning the sequence 
# `[3, 5]` into a 10,000-dimensional vector that would be all-zeros except for indices 3 and 5, which would be ones. Then we could use as 
# first layer in our network a `Dense` layer, capable of handling floating point vector data.

# We will go with the latter solution. Let's vectorize our data, which we will do manually for maximum clarity:

In [None]:
import numpy as np

def vectorize_sequences(sequences, dimension=10000):
    # Create an all-zero matrix of shape (len(sequences), dimension)
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.  # set specific indices of results[i] to 1s
    return results

# Our vectorized training data
x_train = vectorize_sequences(train_data)
# Our vectorized test data
x_test = vectorize_sequences(test_data)

In [None]:
x_train[0]

In [None]:
# Our vectorized labels
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

# Building our network

In [None]:
# Our input data is simply vectors, and our labels are scalars (1s and 0s): this is the easiest setup you will ever encounter. A type of 
# network that performs well on such a problem would be a simple stack of fully-connected (`Dense`) layers with `relu` activations: `Dense(16, 
# activation='relu')`

# The argument being passed to each `Dense` layer (16) is the number of "hidden units" of the layer. What's a hidden unit? It's a dimension 
# in the representation space of the layer. You may remember from the previous chapter that each such `Dense` layer with a `relu` activation implements 
# the following chain of tensor operations:

# `output = relu(dot(W, input) + b)`

# Having 16 hidden units means that the weight matrix `W` will have shape `(input_dimension, 16)`, i.e. the dot product with `W` will project the 
# input data onto a 16-dimensional representation space (and then we would add the bias vector `b` and apply the `relu` operation). You can 
# intuitively understand the dimensionality of your representation space as "how much freedom you are allowing the network to have when 
# learning internal representations". Having more hidden units (a higher-dimensional representation space) allows your network to learn more 
# complex representations, but it makes your network more computationally expensive and may lead to learning unwanted patterns (patterns that 
# will improve performance on the training data but not on the test data).

# There are two key architecture decisions to be made about such stack of dense layers:

# * How many layers to use.
# * How many "hidden units" to chose for each layer.

# In the next chapter, you will learn formal principles to guide you in making these choices. 
# For the time being, you will have to trust us with the following architecture choice: 
# two intermediate layers with 16 hidden units each, 
# and a third layer which will output the scalar prediction regarding the sentiment of the current review. 
# The intermediate layers will use `relu` as their "activation function", 
# and the final layer will use a sigmoid activation so as to output a probability 
# (a score between 0 and 1, indicating how likely the sample is to have the target "1", i.e. how likely the review is to be positive). 
# A `relu` (rectified linear unit) is a function meant to zero-out negative values, 
# while a sigmoid "squashes" arbitrary values into the `[0, 1]` interval, thus outputting something that can be interpreted as a probability.

In [None]:
# And here's the Keras implementation

In [None]:
from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [None]:
# Lastly, we need to pick a loss function and an optimizer. Since we are facing a binary classification problem and the output of our network is a probability (we end our network with a single-unit layer with a sigmoid activation), is it best to use the binary_crossentropy loss. It isn't the only viable choice: you could use, for instance, mean_squared_error. But crossentropy is usually the best choice when you are dealing with models that output probabilities. Crossentropy is a quantity from the field of Information Theory, that measures the "distance" between probability distributions, or in our case, between the ground-truth distribution and our predictions.

# Here's the step where we configure our model with the rmsprop optimizer and the binary_crossentropy loss function. Note that we will also monitor accuracy during training.


In [None]:
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

We are passing our optimizer, loss function and metrics as strings, which is possible because `rmsprop`, `binary_crossentropy` and 
`accuracy` are packaged as part of Keras. Sometimes you may want to configure the parameters of your optimizer, or pass a custom loss 
function or metric function. This former can be done by passing an optimizer class instance as the `optimizer` argument:

In [None]:
from keras import optimizers

model.compile(optimizer=optimizers.RMSprop(lr=0.001),
              loss='binary_crossentropy',
              metrics=['accuracy'])

The latter can be done by passing function objects as the `loss` or `metrics` arguments:

In [None]:
from keras import losses
from keras import metrics

model.compile(optimizer=optimizers.RMSprop(lr=0.001),
              loss=losses.binary_crossentropy,
              metrics=[metrics.binary_accuracy])

## Validating our approach

In order to monitor during training the accuracy of the model on data that it has never seen before, we will create a "validation set" by 
setting apart 10,000 samples from the original training data:

In [None]:
x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

In [None]:
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))