# Learning a Neural Network

Now, let's actually try to learn a ML model: as a first example, we will consider neural networks and train them using [keras](https://keras.io) and [TensorFlow](https://www.tensorflow.org)

Given our input data (i.e. edge ratio, infection probability and so on) we want to predict the number of zombies and survivors. In other words: we have a regression task.

Let's start by loading the data:

In [None]:
import numpy as np
import keras
import pandas as pd
import os

In [None]:
# Load the data
data_fname = os.path.join('shared', 'za_processed.h5')
with pd.HDFStore(data_fname) as store:
    data_tr = store['data_tr']
    data_vl = store['data_vl']
    data_ts = store['data_ts']
    means_in = store['means_in']
    stds_in = store['stds_in']
    sim_in = store['sim_in']
    sim_out = store['sim_out']
    in_defaults = store['in_defaults']
    pop_size = store['meta']['pop_size']

data_tr.head()

## Separating Input and Output

Since we stored the series of input and output attributes, it is fairly easy to separate out data and have it ready for training:

In [None]:
# Separate input and output
X_tr = data_tr[sim_in]
Y_tr = data_tr[sim_out]
X_vl = data_vl[sim_in]
Y_vl = data_vl[sim_out]
X_ts = data_ts[sim_in]
Y_ts = data_ts[sim_out]

## Learning Networks

We are now ready to learn a ML model: we will start by training a network for each type of output (zombies and survivors), using the keras `Sequential` API.

Our loss function will be the Mean Squared Error (as it is often the case for regression problems) and we will also evaluate our networks using the R2 score (i.e. coefficient of determination).

We will use a simple callback to stop trainig if no improvement is reported on the validation set for a number of iterations, and a second callback to store the best model on a pair of files.

For more information about how to use all these components, see [the keras documentation](https://keras.io/getting-started/sequential-model-guide/)

Let's start with a very simple, linear model (if we are lucky it may be enough):

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras import backend as K
import tensorflow
from keras.callbacks import EarlyStopping, ModelCheckpoint

# Seed the RNGs
np.random.seed(42)
tensorflow.set_random_seed(42)

# Custom R2 metric (courtesy of https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion/34019)
def r2_score(y_true, y_pred):
    SS_res =  K.sum(K.square(y_true - y_pred)) 
    SS_tot = K.sum(K.square(y_true - K.mean(y_true))) 
    return ( 1 - SS_res/(SS_tot + K.epsilon()) )

# Input shape
input_shape = (X_tr.shape[1],)

# Handle the two outputs
max_epochs = 50
for target in sim_out:
    y_tr = Y_tr[target].values
    y_vl = Y_vl[target].values
    y_ts = Y_ts[target].values

    # Define a Neural Network model to predict the number of infected
    model = Sequential()
    model.add(Dense(1, activation='linear', input_shape=input_shape))

    # Compile
    model.compile(optimizer='rmsprop',
                  loss='mse',
                  metrics=[r2_score])

    # Setup and perform training
    weight_fname = os.path.join('shared', 'nn_reg_%s.h5' % target)
    callbacks = [EarlyStopping(monitor='val_loss', patience=5),
                 ModelCheckpoint(filepath=weight_fname, monitor='val_loss', save_best_only=True)]
    model.fit(X_tr, y_tr, epochs=max_epochs, batch_size=32, callbacks=callbacks,
              validation_data=(X_vl, y_vl), verbose=0)
    
    # Save the model architecture
    arch_fname = os.path.join('shared', 'nn_reg_%s.json' % target)
    with open(arch_fname, 'w') as fp:
        fp.write(model.to_json())

    # Evaluate on the test set
    print('=== Results for target "{}"'.format(target))
    res_tr = model.evaluate(X_tr, y_tr, batch_size=len(X_tr))
    print('Loss and R2 on the training set: {}, {}'.format(*res_tr))
    res_vl = model.evaluate(X_vl, y_vl, batch_size=len(X_vl))
    print('Loss and R2 on the validation set: {}, {}'.format(*res_vl) )
    res_ts = model.evaluate(X_ts, y_ts, batch_size=len(X_ts))
    print('Loss and R2 on the test set: {}, {}'.format(*res_ts))

Not particualarly good... Let's stack more layers!

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras import backend as K
import tensorflow
from keras.callbacks import EarlyStopping, ModelCheckpoint

# Seed the RNGs
np.random.seed(42)
tensorflow.set_random_seed(42)

# Custom R2 metric (courtesy of https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion/34019)
def r2_score(y_true, y_pred):
    SS_res =  K.sum(K.square(y_true - y_pred)) 
    SS_tot = K.sum(K.square(y_true - K.mean(y_true))) 
    return ( 1 - SS_res/(SS_tot + K.epsilon()) )

# Input shape
input_shape = (X_tr.shape[1],)

# Handle the two outputs
max_epochs = 50
for target in sim_out:
    y_tr = Y_tr[target].values
    y_vl = Y_vl[target].values
    y_ts = Y_ts[target].values

    # Define a Neural Network model to predict the number of infected
    model = Sequential()
    model.add(Dense(16, activation='relu', input_shape=input_shape))
    model.add(Dense(1, activation='linear'))

    # Compile
    model.compile(optimizer='rmsprop',
                  loss='mse',
                  metrics=[r2_score])

    # Setup and perform training
    weight_fname = os.path.join('shared', 'nn_reg_%s.h5' % target)
    callbacks = [EarlyStopping(monitor='val_loss', patience=5),
                 ModelCheckpoint(filepath=weight_fname, monitor='val_loss', save_best_only=True)]
    model.fit(X_tr, y_tr, epochs=max_epochs, batch_size=32, callbacks=callbacks,
              validation_data=(X_vl, y_vl), verbose=0)
    
    # Save the model architecture
    arch_fname = os.path.join('shared', 'nn_reg_%s.json' % target)
    with open(arch_fname, 'w') as fp:
        fp.write(model.to_json())

    # Evaluate on the test set
    print('=== Results for target "{}"'.format(target))
    res_tr = model.evaluate(X_tr, y_tr, batch_size=len(X_tr))
    print('Loss and R2 on the training set: {}, {}'.format(*res_tr))
    res_vl = model.evaluate(X_vl, y_vl, batch_size=len(X_vl))
    print('Loss and R2 on the validation set: {}, {}'.format(*res_vl) )
    res_ts = model.evaluate(X_ts, y_ts, batch_size=len(X_ts))
    print('Loss and R2 on the test set: {}, {}'.format(*res_ts))

Even more...

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras import backend as K
import tensorflow
from keras.callbacks import EarlyStopping, ModelCheckpoint

# Seed the RNGs
np.random.seed(42)
tensorflow.set_random_seed(42)

# Custom R2 metric (courtesy of https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion/34019)
def r2_score(y_true, y_pred):
    SS_res =  K.sum(K.square(y_true - y_pred)) 
    SS_tot = K.sum(K.square(y_true - K.mean(y_true))) 
    return ( 1 - SS_res/(SS_tot + K.epsilon()) )

# Input shape
input_shape = (X_tr.shape[1],)

# Handle outputs
max_epochs = 50
for target in sim_out:
    y_tr = Y_tr[target].values
    y_vl = Y_vl[target].values
    y_ts = Y_ts[target].values

    # Define a Neural Network model to predict the number of infected
    model = Sequential()
    model.add(Dense(16, activation='relu', input_shape=input_shape))
    model.add(Dense(16, activation='relu'))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(1, activation='linear'))

    # Compile
    model.compile(optimizer='rmsprop',
                  loss='mse',
                  metrics=[r2_score])

    # Setup and perform training
    weight_fname = os.path.join('shared', 'nn_reg_%s.h5' % target)
    callbacks = [EarlyStopping(monitor='val_loss', patience=5),
                 ModelCheckpoint(filepath=weight_fname, monitor='val_loss', save_best_only=True)]
    model.fit(X_tr, y_tr, epochs=max_epochs, batch_size=32, callbacks=callbacks,
              validation_data=(X_vl, y_vl), verbose=0)
    
    # Save the model architecture
    arch_fname = os.path.join('shared', 'nn_reg_%s.json' % target)
    with open(arch_fname, 'w') as fp:
        fp.write(model.to_json())

    # Evaluate on the test set
    print('=== Results for target "{}"'.format(target))
    res_tr = model.evaluate(X_tr, y_tr, batch_size=len(X_tr))
    print('Loss and R2 on the training set: {}, {}'.format(*res_tr))
    res_vl = model.evaluate(X_vl, y_vl, batch_size=len(X_vl))
    print('Loss and R2 on the validation set: {}, {}'.format(*res_vl) )
    res_ts = model.evaluate(X_ts, y_ts, batch_size=len(X_ts))
    print('Loss and R2 on the test set: {}, {}'.format(*res_ts))

It seems we have reached a plateau, but we are still not doing well.

What happened?

Well, several things. Let's start with one of them...

## About Correlations

The two outputs that we care about (zombies and survivors) are quite obviously correlated. In case of no deaths, their sum is always necessarily equal to the total population, i.e. 500 individuals (or 1, after the normalization).

By learning individual models for the two outputs, we are failing to provide the network with this information.

We can try and address this issue by predicting **both outputs at the same time**:

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras import backend as K
import tensorflow
from keras.callbacks import EarlyStopping, ModelCheckpoint

# Seed the RNGs
np.random.seed(42)
tensorflow.set_random_seed(42)

# Custom R2 metric (courtesy of https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion/34019)
def r2_score(y_true, y_pred):
    SS_res =  K.sum(K.square(y_true - y_pred)) 
    SS_tot = K.sum(K.square(y_true - K.mean(y_true))) 
    return ( 1 - SS_res/(SS_tot + K.epsilon()) )

# Input shape
input_shape = (X_tr.shape[1],)

# Both outputs at the same time
max_epochs = 50

# Define a Neural Network model to predict the number of infected
model = Sequential()
model.add(Dense(16, activation='relu', input_shape=input_shape))
model.add(Dense(16, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(2, activation='linear'))

# Compile
model.compile(optimizer='rmsprop',
              loss='mse',
              metrics=[r2_score])

# Setup and perform training
weight_fname = os.path.join('shared', 'nn_reg.h5')
callbacks = [EarlyStopping(monitor='val_loss', patience=5),
             ModelCheckpoint(filepath=weight_fname, monitor='val_loss', save_best_only=True)]
model.fit(X_tr, Y_tr, epochs=max_epochs, batch_size=32, callbacks=callbacks,
          validation_data=(X_vl, Y_vl), verbose=0)

# Save the model architecture
arch_fname = os.path.join('shared', 'nn_reg.json')
with open(arch_fname, 'w') as fp:
    fp.write(model.to_json())

# Evaluate on the test set
res_tr = model.evaluate(X_tr, Y_tr, batch_size=len(X_tr))
print('Loss and R2 on the training set: {}, {}'.format(*res_tr))
res_vl = model.evaluate(X_vl, Y_vl, batch_size=len(X_vl))
print('Loss and R2 on the validation set: {}, {}'.format(*res_vl) )
res_ts = model.evaluate(X_ts, Y_ts, batch_size=len(X_ts))
print('Loss and R2 on the test set: {}, {}'.format(*res_ts))

Considerably better! But still not particularly good.

What is going on? Let's go back to the drawing board, i.e. to the data pre-processing step.