# Deep Learning Tutorial 1: Training

Welcome to the first deep learning tutorial!

In this notebook, we are going to apply neural networks to detect failures of harddrives based on S.M.A.R.T. status observations.

Note that you can interrupt the training process at any time by clicking on *Kernel* and then *Interrupt*.


## Framework

We will be using the [Keras](http://keras.io) framework that abstracts away a lot of the tedious details of deep learning. There are two ways to build neural networks in Keras, the [sequential API](https://keras.io/getting-started/sequential-model-guide/) and the [funcational API](https://keras.io/getting-started/functional-api-guide/)

We will only use the funcational API due to its expressive power.

#### Sequential API:

```Python
from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential()
model.add(Dense(64, input_dim=784))
model.add(Activation('relu'))
model.add(Dense(64, input_dim=784))
model.add(Activation('relu'))
```

#### Functional API
```Python
from keras.layers import Input, Dense
from keras.models import Model

# this returns a tensor
inputs = Input(shape=(784,))

# a layer instance is callable on a tensor, and returns a tensor
x = Dense(64, activation='relu')(inputs)
x = Dense(64, activation='relu')(x)
predictions = Dense(10, activation='softmax')(x)

# this creates a model that includes
# the Input layer and three Dense layers
model = Model(input=inputs, output=predictions)
```

#### Same in both APIs

```Python
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(data, labels) 
```

#### Why is the Funcional API better?

It allows us to do more, for example when using the functional API we can reuse trained layers and we can train multi input and multi output models

## Let's start

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.cross_validation import train_test_split
np.random.seed(42)

from keras.models import Model
from keras.layers import *
from keras.layers.wrappers import *
from keras.optimizers import *
from keras.utils.visualize_util import plot, model_to_dot
from IPython.display import SVG

from callbacks import AUCHistory

Using Theano backend.


### Loading our data

Data Set Information:

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.


Input variables:

#### Bank client data:
1.  age (numeric)
2.  job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4. education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5. default: has credit in default? (categorical: 'no','yes','unknown')
6. housing: has housing loan? (categorical: 'no','yes','unknown')
7. loan: has personal loan? (categorical: 'no','yes','unknown')
#### related with the last contact of the current campaign:
8. contact: contact communication type (categorical: 'cellular','telephone')
9. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10. day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
#### other attributes:
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
#### social and economic context attributes
16. emp.var.rate: employment variation rate - quarterly indicator (numeric)
17. cons.price.idx: consumer price index - monthly indicator (numeric)
18. cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19. euribor3m: euribor 3 month rate - daily indicator (numeric)
20. nr.employed: number of employees - quarterly indicator (numeric)
#### Output variable (desired target):
21. y - has the client subscribed a term deposit? (binary: 'yes','no')

### Citation:
[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

Description and data download location:https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

In [2]:
bank = pd.read_csv('bank-additional/bank-additional-full.csv', sep=';')
bank.shape

(41188, 21)

In [3]:
bank.dtypes

age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
y                  object
dtype: object

In [5]:
bank = pd.get_dummies(bank)

In [6]:
bank.dtypes

age                        int64
duration                   int64
campaign                   int64
pdays                      int64
previous                   int64
emp.var.rate             float64
cons.price.idx           float64
cons.conf.idx            float64
euribor3m                float64
nr.employed              float64
job_admin.               float64
job_blue-collar          float64
job_entrepreneur         float64
job_housemaid            float64
job_management           float64
job_retired              float64
job_self-employed        float64
job_services             float64
job_student              float64
job_technician           float64
job_unemployed           float64
job_unknown              float64
marital_divorced         float64
marital_married          float64
marital_single           float64
marital_unknown          float64
education_basic.4y       float64
education_basic.6y       float64
education_basic.9y       float64
education_high.school    float64
          

In [7]:
X = bank.drop(['y_no', 'y_yes'], axis=1)
Y = bank[['y_no', 'y_yes']]

Data is already ordered by time so we can split in trait, validation, and test sets manually

In [8]:
X_train = X[:int(0.6*X.shape[0])]
X_validation = X[int(0.6*X.shape[0]):int(0.8*X.shape[0])]
X_test = X[int(0.8*X.shape[0]):]
X_train.shape, X_validation.shape, X_test.shape

((24712, 63), (8238, 63), (8238, 63))

In [9]:
Y_train = Y[:int(0.6*X.shape[0])]
Y_validation = Y[int(0.6*X.shape[0]):int(0.8*X.shape[0])]
Y_test = Y[int(0.8*X.shape[0]):]
Y_train.shape, Y_validation.shape, Y_test.shape

((24712, 2), (8238, 2), (8238, 2))

In [10]:
Y_train['y_yes'].value_counts()

0.0    23524
1.0     1188
Name: y_yes, dtype: int64

In [11]:
Y_validation['y_yes'].value_counts()

0.0    7326
1.0     912
Name: y_yes, dtype: int64

In [12]:
Y_test['y_yes'].value_counts()

0.0    5698
1.0    2540
Name: y_yes, dtype: int64

In [13]:
n_records = X_train.shape[0]
n_features = X_train.shape[1]

First, we define the input layer, which just takes in our data. It does not contain any logic other than defining the shape of our input. Since we use the functional API, this also means that all matrix shapes in the following layers will be inferred automatically.

In [14]:
inputs = Input(shape=(n_features,), name="inputs")

Note that the first dimension, *n_records*, is automatically inferred.

In [None]:
model.compile?

In [44]:
from __future__ import print_function
import keras
from sklearn.metrics import roc_auc_score, confusion_matrix
import numpy as np

        
class AUCHistory(keras.callbacks.Callback):
    def __init__(self, input_len=1, *args, **kwargs):
        self.input_len = input_len
        super(AUCHistory, self).__init__(*args, **kwargs)
 
    def on_epoch_end(self, epoch, logs={}):
        y_pred = self.model.predict(self.model.validation_data[0])        
        auc = roc_auc_score(self.model.validation_data[1][:,1], y_pred[:,1])
        print("\nEpoch validation AUC: {}\n".format(auc))


In [45]:
# inputs = Input(shape=(n_features,), name="inputs")

predictions = Dense(2, activation='softmax')(inputs)

# this creates a model that includes
# the Input layer and three Dense layers
model = Model(input=inputs, output=predictions)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'],)

model.fit(X_train.as_matrix(), Y_train.as_matrix(), 
          validation_data=(X_validation.as_matrix(), Y_validation.as_matrix()), 
          callbacks=[AUCHistory()])  # starts training

Train on 24712 samples, validate on 8238 samples
Epoch 1/10
Epoch validation AUC: 0.5

Epoch 2/10
Epoch validation AUC: 0.5

Epoch 3/10
Epoch validation AUC: 0.5

Epoch 4/10
Epoch validation AUC: 0.5

Epoch 5/10
Epoch validation AUC: 0.5

Epoch 6/10
Epoch validation AUC: 0.5

Epoch 7/10
Epoch validation AUC: 0.5

Epoch 8/10
Epoch validation AUC: 0.5

Epoch 9/10
Epoch validation AUC: 0.5

Epoch 10/10
Epoch validation AUC: 0.5



<keras.callbacks.History at 0x7f1b5eb46b10>

In [40]:
x = Dense(64, activation='relu')(inputs)
# x = Dense(64, activation='relu')(x)
predictions = Dense(2, activation='softmax')(x)

# this creates a model that includes
# the Input layer and three Dense layers
model = Model(input=inputs, output=predictions)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(X_train.as_matrix(), Y_train.as_matrix(), 
          validation_data=(X_validation.as_matrix(), Y_validation.as_matrix()), 
          callbacks=[AUCHistory()])  # starts training

Train on 24712 samples, validate on 8238 samples
Epoch 1/10
Epoch validation AUC: 0.515157352328

Epoch 2/10
Epoch validation AUC: 0.515157352328

Epoch 3/10
Epoch validation AUC: 0.515157352328

Epoch 4/10
Epoch validation AUC: 0.515157352328

Epoch 5/10
Epoch validation AUC: 0.515157352328

Epoch 6/10
Epoch validation AUC: 0.515157352328

Epoch 7/10
Epoch validation AUC: 0.515157352328

Epoch 8/10
Epoch validation AUC: 0.515157352328

Epoch 9/10
Epoch validation AUC: 0.515157352328

Epoch 10/10
Epoch validation AUC: 0.515157352328



<keras.callbacks.History at 0x7f1b5efbc050>

Now, the fun parts starts. We will add an LSTM layer that summarizes each drive by performing the same computation on vectors of size n_features for each n_records.

Remember the unfolding in time computation graph for an RNN

![image](http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/09/rnn.jpg)

where $x_t$ are observations for a particular drive, e.g. $x_1$ is the first observation and $x_2$ is the second observation. In our case, we are only interested in the last output $o_t$ where $t = \text{n_records}$. The output of this LSTM will be a vector of size 10. In other words, the LSTM has 10 neurons in the output layer.

In [None]:
x = LSTM(10)(input)

We're almost done! Let's wire up the 10 output neurons of the LSTM to just a single output neuron using a [Dense](https://keras.io/layers/core/#dense) layer. A Dense layer is just your regular fully connected NN layer.

The output of the dense layer will be $\sigma(x)$, with $\sigma(x) = \frac{1}{1 + exp(-z)}$ where $z$ is just a linear combination of the LSTM outputs, i.e. $\sum\limits w_j x_j$ of the previous layer, where $w_j$ are the learnt weights  for the connection from the LSTM to the Dense layer and $x_j$ is the output of the LSTM. Conveniently, the output of $\sigma(x)$ lies between $0$ and $1$ and matches our target well.

In [None]:
output = Dense(1, activation='sigmoid', name='output')(x)

Let's wrap up the input and output of our Model.

In [None]:
model = Model(input=input, output=output)

Now, we will compile our model. Here, we specify two parameters:

- optimizer: an optimizer does all the work for us. Given the input and the computed errors, it decides which direction to take. There are quite a few [optimizers available in Keras](https://keras.io/optimizers/).
- loss: the loss or objective function tells the model how well we are doing on our data. In our case, this is simply binary crossentropy, but in other cases this may be e.g. mean squared error. Note that this function needs to be differentiable because during training we need to be able to compute the weight updates. Hence, we cannot optimize for e.g. ROCAUC directly.

In [None]:
model.compile(optimizer=Nadam(), loss="binary_crossentropy")

Let's print out a nice plot of our model.

In [None]:
SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format='svg'))

Note that *None* simply means that the model does not really care how many instances we input.

## Let's train!

We train using a mini-batch size of 20 instances at a time. This speeds up things, because a mini-batch can be computed in parallel on a GPU. We train for eight epochs, i.e. we go over our training set three times.

Conveniently, Keras will create a hold-out validation set automatically for us when giving the *validation_split* parameter. Let's set it to 20% of our data. **Please leave *verbose* at 2** in the following call, otherwise your notebook may freeze.

In [None]:
model.fit(data, labels, verbose=2, nb_epoch=8, batch_size=20, validation_split=0.2, callbacks=[AUCHistory()])

That wasn't so bad. You can see that our network converges after the second epoch and neither our training loss nor the AUC improve anymore. 

## Task 1: Increase the number of neurons

Maybe the model is simply to small to accomodate patterns in our data? Let's try to increase our neurons to 50.

**Your task** is to:

- Mark this chunk and select *Cell* and then *Run All Above*
- Increase the number of neurons to 20

In [None]:
input = Input(shape=(n_records, n_features), name="inputs")
x = input
x = LSTM(20)(x)
output = Dense(1, activation='sigmoid', name='output')(x)
model1 = Model(input=input, output=output)
model1.compile(optimizer=Nadam(), loss="binary_crossentropy")

In [None]:
model1.fit(data, labels, verbose=2, nb_epoch=8, batch_size=20, validation_split=0.2, callbacks=[AUCHistory()])

This is much better! If you want, you can try a different number of neurons.

## Task 2: Add another layer

Let's make our model deeper! This is deep learning after all. Note that our network is already deep in time, i.e. we take into consideration 90 time steps. But we can also make it deeper vertically.

Your task is to

- stack another LSTM layer on top of the layer we already have.

At each time step, the first LSTM will feed into the second LSTM. This is called stacking.

Note that for this, you have to set *return_sequences=True* in the first LSTM. Do you understand why this is required?

In [None]:
input = Input(shape=(n_records, n_features), name="inputs")
x = input

### Your code goes here:
x = LSTM(30, return_sequences=True)(x)
x = LSTM(30)(x)
##

output = Dense(1, activation='sigmoid', name='output')(x)
model2 = Model(input=input, output=output)
model2.compile(optimizer=Nadam(), loss="binary_crossentropy")

In [None]:
SVG(model_to_dot(model2, show_shapes=True).create(prog='dot', format='svg'))

In [None]:
model2.fit(data, labels, verbose=2, nb_epoch=8, batch_size=20, validation_split=0.2, callbacks=[AUCHistory()])

## Task 3: Change the architecture

This task will be a bit more challenging. We are going to use an additional attribute: the disk's model. However, because this attribute is constant in a disks's time series, we will not add it to LSTM that summarizes the time series.

Instead, we will *merge* the $n$-dimensional vector output of the LSTM with a $m$-dimensional vector, where merging means concatenating the two vectors into a vector of dimensionality $n+m$.

Conveniently, we have already encoded the disk's model as a one-hot vector, i.e. the columns in the following matrix correspond to unique disk models and the rows to individual disks.

In [None]:
models = saved["models"]

In [None]:
models.shape

In this task, you have to do the following:

- Create a second Input with shape `(number_of_models, )`. Note that you do not need masking here because we are at this step no longer working with a time series with missing observations.
- Introduce a [Merge](https://keras.io/getting-started/sequential-model-guide/#the-merge-layer) layer that merges `[x, your_new_input]`
- Modify the `Model` instantiation to take two inputs simultaneously, similar to what you have done in the previous step.

In [None]:
input = Input(shape=(n_records, n_features), name="inputs")
input2 = Input(shape=(models.shape[1], ))

x = input
x = LSTM(20, return_sequences=True)(x)
x = LSTM(20)(x)

x2 = input2

x = merge([x, x2], mode="concat")

output = Dense(1, activation='sigmoid', name='output')(x)
model3 = Model(input=[input, input2], output=output)
model3.compile(optimizer=Nadam(), loss="binary_crossentropy")

In [None]:
SVG(model_to_dot(model3, show_shapes=True).create(prog='dot', format='svg'))

In [None]:
model3.fit([data, models], labels, verbose=2, nb_epoch=10, batch_size=20, validation_split=0.2, callbacks=[AUCHistory()])

It's still getting better! This concludes the second deep learning tutorial.

If you still have some free time, you are welcome to experiment further with our architecture. Things you may want to try:

- Use GRU instead of LSTM units
- Introduce regularization such as dropout