# Deep Learning Tutorial 2: Training

In [17]:
from keras.models import Model
from keras.layers import *
from keras.utils.visualize_util import plot

import numpy as np

from callbacks import AUCHistory

## Preparation

### Loading our data

Data has been saved in a file named *data.npz*. It can be loaded using *np.load* and then behaves similar to a Python dictionary, where they keys *records* and *labels* corresponds to a numpy array for our input data and our labels, respectively.

In [18]:
saved = np.load("data.npz")
data = saved["records"]
labels = saved["labels"]

The input data is a 3-dimensional array, where the first dimension corresponds to the number of instances, the second dimension to the number of time steps, and the third dimension to the number of features.

In [19]:
data.shape

(74574, 70, 15)

More specifically, 74574 is the number of drives in our data, 70 the number of records per drive, and 15 features per record.

In [20]:
n_drives = data.shape[0]
n_records = data.shape[1]
n_features = data.shape[2]

First, we define the input layer which just takes in our data. It does not contain any neurons other than defining the shape of our input. Since we use the functional API, this also means that all matrix shapes in the following layers will automatically be inferred.

In [21]:
input = Input(shape=(n_records, n_features), name="inputs")

Next, we add a [Masking](https://keras.io/layers/core/#masking) layer. This layer masks an input sequence by using a mask value to identify timesteps to be skipped. This is needed since we padded our sequence with vectors of zeroes in case we don't have enough observations.

In [22]:
x = Masking()(input)

Now, the fun parts starts. We will add a LSTM layer that summarizes each drive by performing the same computation on a vector of size n_features for each n_records.

Remember the unfolding in time computation graph for an RNN

![image](http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/09/rnn.jpg)

where $x$ are the observations for a particular drive, e.g. $x_1$ is the first observation and $x_2$ is the second observation. In our case, we are only interested in the last output $o_n$ where $n = \text{n_records}$.

The LSTM is reset after every drive in n_drives. The output of this LSTM will be a vector of size 20. In other words, the LSTM has 20 neurons in the output layer.

In [23]:
x = LSTM(5)(x)

We're almost done! Let's wire up the 20 output neurons of the LSTM to just a single output neuron using a [Dense](https://keras.io/layers/core/#dense) layer. A Dense layer is just your regular fully connected NN layer.

We will use sigmoid as our activation function because its output lies naturally between 0 and 1, which matches our target well.

The output of the dense layer will be $\sigma(x)$, with $\sigma(x) = \frac{1}{1 + exp(-z)}$ where $z$ is just a linear combination of the LSTM output, i.e. $\sum\limits w_j x$ of the previous layer, where $w_j$ are the learnt weights from the Dense layer.

In [24]:
output = Dense(1, activation='sigmoid', name='output')(x)

Let's define the inputs and outputs of our model.

In [25]:
model = Model(input=input, output=output)

Now, we will compile our model. Here, we specify two parameters:

- optimizer: an optimizer does all the work for us. Given the input and the computed errors, it decides which direction to take. There are quite a few [optimizers available in Keras](https://keras.io/optimizers/)
- loss: the loss or objective function tells the model how well we are doing on our data. In our case, this is simply binary crossentropy, but in other cases this may be e.g. mean squared error. Note that this function needs to be differentiable because during training we need to be able to compute the weight updates. Hence, we cannot optimize for e.g. ROCAUC directly.

In [26]:
model.compile(optimizer="rmsprop", loss="binary_crossentropy")

Let's print out a nice summary of our model.

In [27]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
inputs (InputLayer)              (None, 70, 15)        0                                            
____________________________________________________________________________________________________
masking_5 (Masking)              (None, 70, 15)        0           inputs[0][0]                     
____________________________________________________________________________________________________
lstm_5 (LSTM)                    (None, 5)             420         masking_5[0][0]                  
____________________________________________________________________________________________________
output (Dense)                   (None, 1)             6           lstm_5[0][0]                     
Total params: 426
_________________________________________________________________________

Note that *None* simply means that the model does not really care how many instances we input. The total number of parameters or weights for our model is 2901.

## Let's train!

We train using a mini-batch size of 30 instances at a time. This speeds up things, because a mini-batch can be computed in parallel on a GPU. We train for three epochs, i.e. we go over our training set three times.

Conveniently, Keras will create a hold-out validation set automatically for us when giving the *validation_split* parameter. Let's set it to 20% of our data. Please leave *verbose* at 1 in the following call, otherwise your notebook may freeze.

In [None]:
model.fit(data, labels, verbose=2, nb_epoch=3, batch_size=30, validation_split=0.2, callbacks=[AUCHistory()])

Train on 59659 samples, validate on 14915 samples
Epoch 1/3


A ROCAUC of roughly 60! That's not that great. :(

## Task 1: Increase the number of neurons

In the previous run, we could see that the loss on our training data did not decrease anymore. The reason for this could be that our model is simply to small to accomodate patterns in our data.

Let's try to increase our neurons to 40.

For this task, please mark this chunk and select *Cell* and then *Run All Above*.

In [16]:
x = Masking()(input)
x = LSTM(100)(x)
output = Dense(1, activation='sigmoid', name='output')(x)
model = Model(input=input, output=output)
model.compile(optimizer="rmsprop", loss="binary_crossentropy")
model.fit(data, labels, verbose=2, nb_epoch=3, batch_size=30, validation_split=0.2, callbacks=[AUCHistory()])

Train on 59659 samples, validate on 14915 samples
Epoch 1/3

Epoch validation AUC: 0.591752913753

108s - loss: 0.0535 - val_loss: 0.0279
Epoch 2/3

Epoch validation AUC: 0.591752913753

98s - loss: 0.0314 - val_loss: 0.0280
Epoch 3/3

Epoch validation AUC: 0.591752913753

99s - loss: 0.0313 - val_loss: 0.0281


<keras.callbacks.History at 0x7fae2cd76110>