# Deep Learning Tutorial 1: Training

Welcome to the first deep learning tutorial!

In this notebook, we are going to apply neural networks to detect failures of harddrives based on S.M.A.R.T. status observations.

Note that you can interrupt the training process at any time by clicking on *Kernel* and then *Interrupt*.


## Framework

We will be using the [Keras](http://keras.io) framework that abstracts away a lot of the tedious details of deep learning. There are two ways to build neural networks in Keras, the [sequential API](https://keras.io/getting-started/sequential-model-guide/) and the [funcational API](https://keras.io/getting-started/functional-api-guide/)

We will only use the funcational API due to its expressive power.

#### Sequential API:

```Python
from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential()
model.add(Dense(64, input_dim=784))
model.add(Activation('relu'))
model.add(Dense(64, input_dim=784))
model.add(Activation('relu'))
```

#### Functional API
```Python
from keras.layers import Input, Dense
from keras.models import Model

# this returns a tensor
inputs = Input(shape=(784,))

# a layer instance is callable on a tensor, and returns a tensor
x = Dense(64, activation='relu')(inputs)
x = Dense(64, activation='relu')(x)
predictions = Dense(10, activation='softmax')(x)

# this creates a model that includes
# the Input layer and three Dense layers
model = Model(input=inputs, output=predictions)
```

#### Same in both APIs

```Python
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(data, labels) 
```

#### Why is the Funcional API better?

It allows us to do more, for example when using the functional API we can reuse trained layers and we can train multi input and multi output models

## Let's start

In [111]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
np.random.seed(42)

from keras.models import Model
from keras.layers import *
from keras.layers.wrappers import *
from keras.optimizers import *
from keras.regularizers import l2, activity_l2
from keras.utils.visualize_util import plot, model_to_dot
from IPython.display import SVG

from callbacks import AUCHistory

Keras configuration

In [5]:
! cat ~/.keras/keras.json

{
    "image_dim_ordering": "tf", 
    "epsilon": 1e-07, 
    "floatx": "float32", 
    "backend": "theano"
}


To use TensorFlow we can edit the file and change the backend to "tensorflow". There is no need to change this for this tutorial.

### Loading our data

Data Set Information:

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.


Input variables:

#### Bank client data:
1.  age (numeric)
2.  job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4. education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5. default: has credit in default? (categorical: 'no','yes','unknown')
6. housing: has housing loan? (categorical: 'no','yes','unknown')
7. loan: has personal loan? (categorical: 'no','yes','unknown')
#### related with the last contact of the current campaign:
8. contact: contact communication type (categorical: 'cellular','telephone')
9. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10. day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
#### other attributes:
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
#### social and economic context attributes
16. emp.var.rate: employment variation rate - quarterly indicator (numeric)
17. cons.price.idx: consumer price index - monthly indicator (numeric)
18. cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19. euribor3m: euribor 3 month rate - daily indicator (numeric)
20. nr.employed: number of employees - quarterly indicator (numeric)
#### Output variable (desired target):
21. y - has the client subscribed a term deposit? (binary: 'yes','no')

### Citation:
[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

Description and data download location:https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

In [6]:
bank = pd.read_csv('bank-additional/bank-additional-full.csv', sep=';')
bank.shape

(41188, 21)

In [3]:
bank.dtypes

age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
y                  object
dtype: object

Standardize the numerical inputs!

In [27]:
for c in bank.dtypes[bank.dtypes!='object'].index:
    bank[[c]] = StandardScaler().fit_transform(bank[[c]].as_matrix())



In [28]:
bank = pd.get_dummies(bank)

In [29]:
bank.dtypes

age                      float64
duration                 float64
campaign                 float64
pdays                    float64
previous                 float64
emp.var.rate             float64
cons.price.idx           float64
cons.conf.idx            float64
euribor3m                float64
nr.employed              float64
job_admin.               float64
job_blue-collar          float64
job_entrepreneur         float64
job_housemaid            float64
job_management           float64
job_retired              float64
job_self-employed        float64
job_services             float64
job_student              float64
job_technician           float64
job_unemployed           float64
job_unknown              float64
marital_divorced         float64
marital_married          float64
marital_single           float64
marital_unknown          float64
education_basic.4y       float64
education_basic.6y       float64
education_basic.9y       float64
education_high.school    float64
          

In [30]:
X = bank.drop(['y_no', 'y_yes'], axis=1)
Y = bank[['y_no', 'y_yes']]

Data is already ordered by time so we can split in trait, validation, and test sets manually

In [31]:
X_train = X[:int(0.6*X.shape[0])]
X_validation = X[int(0.6*X.shape[0]):int(0.8*X.shape[0])]
X_test = X[int(0.8*X.shape[0]):]
X_train.shape, X_validation.shape, X_test.shape

((24712, 63), (8238, 63), (8238, 63))

In [32]:
Y_train = Y[:int(0.6*X.shape[0])]
Y_validation = Y[int(0.6*X.shape[0]):int(0.8*X.shape[0])]
Y_test = Y[int(0.8*X.shape[0]):]
Y_train.shape, Y_validation.shape, Y_test.shape

((24712, 2), (8238, 2), (8238, 2))

In [33]:
Y_train['y_yes'].value_counts()

0.0    23524
1.0     1188
Name: y_yes, dtype: int64

In [34]:
Y_validation['y_yes'].value_counts()

0.0    7326
1.0     912
Name: y_yes, dtype: int64

In [35]:
Y_test['y_yes'].value_counts()

0.0    5698
1.0    2540
Name: y_yes, dtype: int64

In [36]:
n_records = X_train.shape[0]
n_features = X_train.shape[1]

First, we define the input layer, which just takes in our data. It does not contain any logic other than defining the shape of our input. Since we use the functional API, this also means that all matrix shapes in the following layers will be inferred automatically.

In [37]:
inputs = Input(shape=(n_features,), name="inputs")

Note that the first dimension, *n_records*, is automatically inferred.

In [None]:
model.compile?

For more information on callbacks see 
* https://keras.io/callbacks/
* https://keunwoochoi.wordpress.com/2016/07/16/keras-callbacks/

In [112]:
from __future__ import print_function
import keras
from sklearn.metrics import roc_auc_score, confusion_matrix
import numpy as np

        
class AUCHistory(keras.callbacks.Callback):
    def __init__(self, input_len=1, *args, **kwargs):
        self.input_len = input_len
        super(AUCHistory, self).__init__(*args, **kwargs)
 
    def on_epoch_end(self, epoch, logs={}):
        # self.model.training_data cannot be used!
        y_pred_train = self.model.predict(X_train.as_matrix())
        auc_train = roc_auc_score(Y_train['y_yes'], y_pred_train[:, 1])
        
        y_pred_val = self.model.predict(self.model.validation_data[0])
        auc_val = roc_auc_score(self.model.validation_data[1][:, 1], y_pred_val[:, 1])
        print("\nAUC train: {0}, validation: {1}\n".format(auc_train, auc_val))


In [98]:
inputs = Input(shape=(n_features,), name="inputs")

x = Dense(32, activation='relu')(inputs)
predictions = Dense(2, activation='softmax')(x)

model = Model(input=inputs, output=predictions)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
#               loss='kullback_leibler_divergence',
              metrics=['accuracy'],)

model.fit(X_train.as_matrix(), Y_train.as_matrix(), 
          validation_data=(X_validation.as_matrix(), Y_validation.as_matrix()), 
          callbacks=[AUCHistory()])  # starts training

Train on 24712 samples, validate on 8238 samples
Epoch 1/10
AUC train: 0.958223355387, validation: 0.862175722373

Epoch 2/10
AUC train: 0.961879607731, validation: 0.856278227989

Epoch 3/10
AUC train: 0.963083067397, validation: 0.830024327557

Epoch 4/10
AUC train: 0.963628484299, validation: 0.806826563406

Epoch 5/10
AUC train: 0.964460860089, validation: 0.799608445168

Epoch 6/10
AUC train: 0.964681388504, validation: 0.785223920092

Epoch 7/10
AUC train: 0.964859138772, validation: 0.785706460049

Epoch 8/10
AUC train: 0.965248042403, validation: 0.775941611468

Epoch 9/10
AUC train: 0.96534907469, validation: 0.772176333032

Epoch 10/10
AUC train: 0.965448800909, validation: 0.763061805825



<keras.callbacks.History at 0x7f45789677d0>

## Task 1: Try different configurations

Using the below configuratin play with different values for learning rate (lr), momentum, decay, and nesterov

In [125]:
inputs = Input(shape=(n_features,), name="inputs")
x = Dense(256, activation='tanh', init='uniform')(inputs)
x = Dense(256, activation='tanh', init='uniform')(x)
x = Dense(256, activation='tanh', init='uniform')(x)
predictions = Dense(2, activation='softmax')(x)

model = Model(input=inputs, output=predictions)
model.compile(optimizer=SGD(lr=0.0001, momentum=0.0, decay=0.0, nesterov=False),
              loss='categorical_crossentropy',
              metrics=['accuracy'],)

model.fit(X_train.as_matrix(), Y_train.as_matrix(), 
          validation_data=(X_validation.as_matrix(), Y_validation.as_matrix()), 
          nb_epoch=10,
          callbacks=[AUCHistory()])  

Train on 24712 samples, validate on 8238 samples
Epoch 1/10
AUC train: 0.780957423238, validation: 0.660926776058

Epoch 2/10
AUC train: 0.817491266889, validation: 0.670111573894

Epoch 3/10
AUC train: 0.84007252855, validation: 0.675361366151

Epoch 4/10
AUC train: 0.855723640932, validation: 0.679904336154

Epoch 5/10
AUC train: 0.867776683545, validation: 0.684175727761

Epoch 6/10
AUC train: 0.877786358455, validation: 0.688494415468

Epoch 7/10
AUC train: 0.886307171356, validation: 0.692860923124

Epoch 8/10
AUC train: 0.893766975285, validation: 0.697379646393

Epoch 9/10
AUC train: 0.900296788379, validation: 0.70187509579

Epoch 10/10
AUC train: 0.90607155197, validation: 0.706418739313



<keras.callbacks.History at 0x7f456ecafdd0>

## Task 2: Regularization

We will use rmsprop from simplicity from now on!

Have a look at the performance of the below network per epoch and notice how it degregates. Use dropout and other regularization techniques to fix this problem.

Dropout is a layer that can be added before every layer except Input.

* Dropout: https://keras.io/layers/core/#dropout
* Regularizers: https://keras.io/regularizers/

In [59]:
inputs = Input(shape=(n_features,), name="inputs")
x = Dense(64, activation='relu')(inputs)
x = Dense(32, activation='relu')(x)
predictions = Dense(2, activation='softmax')(x)

# this creates a model that includes
# the Input layer and three Dense layers
model = Model(input=inputs, output=predictions)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(X_train.as_matrix(), Y_train.as_matrix(), 
          validation_data=(X_validation.as_matrix(), Y_validation.as_matrix()), 
          callbacks=[AUCHistory()])  

Train on 24712 samples, validate on 8238 samples
Epoch 1/10
AUC train: 0.963536487129, validation: 0.82562840951

Epoch 2/10
AUC train: 0.965037193192, validation: 0.805273799517

Epoch 3/10
AUC train: 0.965286956025, validation: 0.788884279016

Epoch 4/10
AUC train: 0.965807164057, validation: 0.780398894708

Epoch 5/10
AUC train: 0.966993519621, validation: 0.765906157354

Epoch 6/10
AUC train: 0.966196121362, validation: 0.75333938903

Epoch 7/10
AUC train: 0.967402389966, validation: 0.737466608355

Epoch 8/10
AUC train: 0.967739104615, validation: 0.721403221403

Epoch 9/10
AUC train: 0.967919395451, validation: 0.726491069419

Epoch 10/10
AUC train: 0.96827723975, validation: 0.717861701414



<keras.callbacks.History at 0x7f45923024d0>

First, add a Dropout layer before every Dense layer

Second, add more layers :)

Third, add weight regularization to each Dense layer (W_regularizer=l2(val))

## Task 3: Architecture

We have two groups of very different features: client and macro economical.

Let's create two separate neural networks and combine them!

In [135]:
def split_X(X):
    X_train = X[:int(0.6*X.shape[0])]
    X_validation = X[int(0.6*X.shape[0]):int(0.8*X.shape[0])]
    X_test = X[int(0.8*X.shape[0]):]
    return X_train, X_validation, X_test

In [138]:
macro = ['emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']

X_macro = bank[macro]
X_rest = bank.drop(['y_no', 'y_yes']+macro, axis=1)

X_macro_train, X_macro_validation, X_macro_test = split_X(X_macro)
X_rest_train, X_rest_validation, X_rest_test = split_X(X_rest)

In [139]:
X_macro_train.shape, X_macro_validation.shape, X_macro_test.shape

((24712, 5), (8238, 5), (8238, 5))

In [140]:
X_rest_train.shape, X_rest_validation.shape, X_rest_test.shape

((24712, 58), (8238, 58), (8238, 58))

An auto-encoder example

In [154]:
dropout = 0.5
inputs_macro = Input(shape=(X_macro_train.shape[1],), name="inputs_macro")
x_macro = Dropout(dropout)(inputs_macro)
x_macro = Dense(20, activation='relu')(x_macro)
x_macro = Dropout(dropout)(x_macro)
x_macro = Dense(20, activation='relu')(x_macro)
x_macro = Dropout(dropout)(x_macro)
predictions_macro = Dense(X_macro_train.shape[1], activation='linear')(x_macro)

# this creates a model that includes
# the Input layer and three Dense layers
model = Model(input=inputs_macro, output=predictions_macro)
model.compile(optimizer='rmsprop',
              loss='mse',
              )
model.fit(X_macro_train.as_matrix(), X_macro_train.as_matrix(), 
          validation_data=(X_macro_validation.as_matrix(), X_macro_validation.as_matrix()), 
          nb_epoch=20
          )  

Train on 24712 samples, validate on 8238 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f4564800fd0>

Now let's combine auto-encoder with the rest of the variables, in an end-to-end fashion!

In [176]:
class AUCHistoryCombined(keras.callbacks.Callback):
    def __init__(self, validation, input_len=1, *args, **kwargs):
        self.input_len = input_len
        self.validation = validation
        super(AUCHistoryCombined, self).__init__(*args, **kwargs)
 
    def on_epoch_end(self, epoch, logs={}):
        # self.model.training_data cannot be used!
        y_pred_train = self.model.predict([X_macro_train.as_matrix(), X_rest_train.as_matrix()])
        auc_train = roc_auc_score(Y_train['y_yes'], y_pred_train[1][:, 1])
        
        y_pred_val = self.model.predict(self.validation[0])
        auc_val = roc_auc_score(self.validation[1][1][:, 1], y_pred_val[1][:, 1])
        print("\nAUC train: {0}, validation: {1}\n".format(auc_train, auc_val))

In [177]:
dropout = 0.5

# First sub-network
inputs_macro = Input(shape=(X_macro_train.shape[1],), name="inputs_macro")
x_macro = Dropout(dropout)(inputs_macro)
x_macro = Dense(20, activation='relu')(x_macro)
x_macro = Dropout(dropout)(x_macro)
x_macro = Dense(20, activation='relu')(x_macro)
x_macro = Dropout(dropout)(x_macro)
predictions_macro = Dense(X_macro_train.shape[1], activation='linear')(x_macro)

# Second sub-network
inputs_rest = Input(shape=(X_rest_train.shape[1],), name="inputs_rest")
x_rest = Dropout(dropout)(inputs_rest)
x_rest = Dense(128)(x_rest)
x_rest = Dropout(dropout)(x_rest)
x_rest = Dense(128)(x_rest)

# Merging
x = merge([x_rest, x_macro], mode='concat')
predictions_rest = Dense(2, activation='softmax')(x)

model = Model(input=[inputs_macro, inputs_rest], output=[predictions_macro, predictions_rest])
model.compile(optimizer='rmsprop',
              loss=['mse', 'categorical_crossentropy'] ,
              loss_weights=[0.2, 1.]
              )

model.fit([X_macro_train.as_matrix(), X_rest_train.as_matrix()], 
          [X_macro_train.as_matrix(), Y_train.as_matrix()],
          nb_epoch=10,
          callbacks=[AUCHistoryCombined(validation=([X_macro_validation.as_matrix(), X_rest_validation.as_matrix()],
                                                    [X_macro_validation.as_matrix(), Y_validation.as_matrix()]), )]
         
         )

Epoch 1/10
AUC train: 0.949633857706, validation: 0.853528618331

Epoch 2/10
AUC train: 0.946403919745, validation: 0.794246324674

Epoch 3/10
AUC train: 0.944227243815, validation: 0.812567277205

Epoch 4/10
AUC train: 0.938834066305, validation: 0.79631919599

Epoch 5/10
AUC train: 0.937318760924, validation: 0.758412569268

Epoch 6/10
AUC train: 0.938846804925, validation: 0.786302750119

Epoch 7/10
AUC train: 0.934758584542, validation: 0.739589020839

Epoch 8/10
AUC train: 0.948588825682, validation: 0.830849839074

Epoch 9/10
AUC train: 0.94142923811, validation: 0.776135136333

Epoch 10/10
AUC train: 0.931737080463, validation: 0.734289013894



<keras.callbacks.History at 0x7f455aefe210>

Excercise: add more layers after mergin, and play with the acritecture

## Bonus
Comment out the standardisation of numerical inputs and run a network with rmsprop (task 1 or 2). What happens to the validatin AUC?

You will need to re-run the preprocessing steps.