# Feed Forward Neural Networks in Keras

From now on, we will start working with Keras and Neural Networks. 
This notebook uses the classic Auto MPG Dataset [Auto MPG Dataset](https://archive.ics.uci.edu/ml/datasets/auto+mpg) and builds a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, provide the model with a description of many automobiles from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.

### Import Libraries
___
To develop our prediction model, we need to import the necessary Python libraries:

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow import keras
from tensorflow.python.keras.utils.vis_utils import plot_model
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow.python.keras.utils import np_utils
from tensorflow.python.keras.models import Sequential 
from tensorflow.python.keras.layers.core import Dense
from tensorflow.python.keras.layers import ReLU
from IPython.display import set_matplotlib_formats

%matplotlib inline
set_matplotlib_formats('svg')
# Make numpy printouts easier to read.
np.set_printoptions(precision=3, suppress=True)

print('Keras Version: ' + keras.__version__)
print('Tensorflow Version: ' + tf.__version__)

### Load Data
---

In [None]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
                'Acceleration', 'Model Year', 'Origin']

raw_dataset = pd.read_csv(url, names=column_names,
                          na_values='?', comment='\t',
                          sep=' ', skipinitialspace=True)

In [None]:
dataset = raw_dataset.copy()
dataset.tail()

### Clean the data
___
The dataset contains a few unknown values:

In [None]:
dataset.isna().sum()

In [None]:
dataset = dataset.dropna()

The `"Origin"` column is really categorical, not numeric. So convert that to a one-hot with pd.get_dummies`

In [None]:
dataset['Origin'] = dataset['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})

In [None]:
dataset = pd.get_dummies(dataset, columns=['Origin'], prefix='', prefix_sep='')
dataset.tail()

### Split the data into train and test
___

In [None]:
X_train_raw, X_test_raw, y_train_raw, y_test_raw = train_test_split(dataset.iloc[:,1:], dataset.iloc[:,0], test_size = 0.25, shuffle = True)

### Inspect the data
____
Have a quick look at the joint distribution of a few pairs of columns from the training set.

Looking at the top row it should be clear that the fuel efficiency (MPG) is a function of all the other parameters. Looking at the other rows it should be clear that they are functions of each other.

In [None]:
sns.set(style="ticks")
sns.pairplot(dataset[['MPG', 'Cylinders', 'Displacement', 'Weight']], diag_kind='kde')

In [None]:
X_train_raw.describe().transpose()

### Standardization
___

In [None]:
from sklearn.preprocessing import StandardScaler
scx = StandardScaler()
scy = StandardScaler()
X_train = scx.fit_transform(X_train_raw)
X_test = scx.transform(X_test_raw)
y_train = (scy.fit_transform(y_train_raw.to_numpy().reshape(-1,1)))
y_test = (scy.transform(y_test_raw.to_numpy().reshape(-1,1)))
y_test.shape

### MLP Model
___

In Keras, the SequentialAPI can be used to define a neural network where layers can be "stacked" to each other. Each time Sequential.add() is called, a new layer gets attached to the end of the network. You have to define input and output shape only, the `SequentialAPI` does the job of defining the shapes for you in all intermediate layers. This API is easy to understand and very intuitive. In later tasks we will also cover the usage of the `FunctionalAPI`. 

An example for three fully connected (dense) feed forward layers - with 5, 2 and 1 neuron respectively - would be. For quick and easy exploration of different layer sizes we add a variable called `size_fac` which linearly scales the hidden layers. The output layer is not affected, of course.

In [None]:
 def build_model(act_fun='relu',loss_fun='mean_absolute_error',size_fac=1):
    mlp = Sequential()
    mlp.add(Dense(int(size_fac*5),input_shape=(X_train.shape[1],),activation='relu',name="hidden1"))
    mlp.add(Dense(int(size_fac*2),activation=act_fun,name="hidden2")) 
    mlp.add(Dense(1,name="output"))
    mlp.compile(loss=loss_fun, optimizer=tf.keras.optimizers.Adam(0.001), metrics=[tf.keras.metrics.RootMeanSquaredError(), 'mean_squared_error', 'mean_absolute_error'])
    return mlp

model = build_model()
plot_model(model, show_shapes=True,show_layer_names=True,dpi=100)

As you can see in this example, the above call builds a neural network that has an input layer for 9 features, and then two hidden layers and an output layer. The build_function is used so that the model can be parametrized and grid searches can be performed. 
 
_Remark: We chose 3 layers somewhat arbitrary. It might also be an idea to use more neurons in a layer and a less deep net - for this particular dataset. In Feed Forward Networks ReLU is chosen often as activation function to serve as a starting point when developing a neural net. The Mean Squared Error (MSE) and Mean Absolute Error (MAE) are common loss functions used for regression problems. Mean Absolute Error is less sensitive to outliers. Different loss functions are used for classification problems._

For teaching and small experiments it will always somehow be hard to provide deep learning examples due to the computational complexity. Always feel free to adopt parameters and/or code to let the examples run in adequate duration!

Now that we have the build function, we can build and train a model.

In [None]:
model = build_model(act_fun='relu',loss_fun='mean_absolute_error',size_fac=4)
model.summary()

In [None]:
%%time
history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    batch_size=32,
    verbose=1, epochs=40)

In [None]:
def plot_model_history(history, ax=None, metric='loss', ep_start=1, ep_stop=None,monitor='val_loss', mode='min',plttitle=None):
    if ax is None:
        fig,ax = plt.subplots()
    if ep_stop is None:
        ep_stop = len(history.epoch)
    if plttitle is None:
        plttitle = metric[0].swapcase() + metric[1:] + ' During Training'
    ax.plot(np.arange(ep_start,ep_stop+1, dtype='int'),history.history[metric][ep_start-1:ep_stop])
    ax.plot(np.arange(ep_start,ep_stop+1, dtype='int'),history.history['val_' + metric][ep_start-1:ep_stop])
    ax.set(title=plttitle)
    ax.set(ylabel=metric[0].swapcase() + metric[1:])
    ax.set(xlabel='Epoch')
    ax.legend(['train', 'val'], loc='upper right')

In [None]:
fig, ax = plt.subplots(1,2,figsize=(10,3))
plot_model_history(history,ax=ax[0])
plot_model_history(history, metric='mean_squared_error',ax=ax[1])

### Performance
____
Now that the model is trained check the test-set performance and see how it did:

In [None]:
test_results = model.evaluate(X_test, y_test, verbose=0)
print('test loss, test loss:', test_results)

Finally, predict have a look at the errors made by the model when making predictions on the test set:

In [None]:
y_pred_train = scy.inverse_transform(model.predict(X_train))
y_pred_test =  scy.inverse_transform(model.predict(X_test))


fig, ax = plt.subplots(1,2,figsize=(10,3))
ax = plt.axes(aspect='equal')

ax = plt.subplot(121)
sns.regplot(x=y_train_raw.to_numpy(), y=y_pred_train.reshape(-1), ax=ax)
plt.xlabel('True Values [MPG]')
plt.ylabel('Predictions [MPG]')
ax = plt.subplot(122)
sns.regplot(x=y_test_raw.to_numpy(), y=y_pred_test.reshape(-1), color='r', ax=ax)
plt.xlabel('True Values [MPG]')
plt.ylabel('Predictions [MPG]');

### Monitoring Training, Preventing Overfitting, and Callbacks
___

Overfitting is something that you have to pay close attention to when designing and training neural networks. Especially for simpler problems, bigger networks tend to overfit quickly. 
An important tool to monitor this is passing validation data after each epoch or batch and plotting a curve of the training epochs.

In [None]:
model = build_model(act_fun='relu',loss_fun='mean_absolute_error',size_fac=30)
history = model.fit(X_train, y_train,validation_split=0.25, batch_size=64, epochs=250,verbose=0)

In [None]:
fig, ax = plt.subplots(1,2,figsize=(10,3))
plot_model_history(history, ax=ax[0])
ax[0].set_ylim(0.1,.35)
plot_model_history(history, metric='root_mean_squared_error',ax=ax[1])
ax[1].set_ylim(0.2,.7)

As we see, there's usually a sweet spot between training too little and training too much, and ideally we would want to stop training after reaching it. In keras, this can be done using an so called `EarlyStopping` callback. The _callback_list_ argument of the `model.fit()` function takes all so called callback functions. This functions will be called after each epoch.Below we indtroduce the `ModelCheckpoint` which enables us to save  You can define your very own callback.

In [None]:
from tensorflow.python.keras.callbacks import ModelCheckpoint, EarlyStopping

model = build_model(act_fun='relu',loss_fun='mean_absolute_error',size_fac=3)
es = EarlyStopping(monitor='val_loss', mode='min', restore_best_weights=True, verbose=1, patience=15)
mc = ModelCheckpoint("best_epoch.h5", monitor='val_loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [es,mc]
history = model.fit(X_train, y_train, validation_split=0.25, epochs=200, batch_size=32, callbacks=callbacks_list, verbose=0)
results = model.evaluate(X_test, y_test)
print('test loss, test loss:', results)

In [None]:
y_pred_train = scy.inverse_transform(model.predict(X_train))
y_pred_test =  scy.inverse_transform(model.predict(X_test))


fig, ax = plt.subplots(1,2,figsize=(10,3))
ax = plt.axes(aspect='equal')

ax = plt.subplot(121)
sns.regplot(x=y_train_raw.to_numpy(), y=y_pred_train.reshape(-1), ax=ax)
plt.xlabel('True Values [MPG]')
plt.ylabel('Predictions [MPG]')
ax = plt.subplot(122)
sns.regplot(x=y_test_raw.to_numpy(), y=y_pred_test.reshape(-1), color='r', ax=ax)
plt.xlabel('True Values [MPG]')
plt.ylabel('Predictions [MPG]');

### Batch Size and Convergence
___

You may have noticed by now that `model.fit()` in keras takes a parameter called`"batch_size"`. The batch is the number of samples that are passed through the network before a weight update is performed, and it can greatly affect your model's training. On the one hand, increasing the batch size can considerably speed up training time, on the other hand the model can take significantly more epochs until convergence if the batch size is too high (because there are less weight updates).


In [None]:
import time

batch_sizes = [1,8,32,128]

fig, axs = plt.subplots(nrows=2, ncols=2)
axs = axs.reshape(-1)

ep = 100
training_times = []

for i,bs in enumerate(batch_sizes,0):
    start = time.time()
    print("Training model with batch size: %s" %(bs))
    model = build_model(act_fun='relu',loss_fun='mean_absolute_error',size_fac=3)
    history = model.fit(X_train, y_train,validation_split=0.25, batch_size=bs, epochs=ep,verbose=0)
    plt.sca(axs[i])
    plot_model_history(history, ax=axs[i], ep_stop=ep, plttitle='Loss During Traning, BS=' + str(bs))
    training_times.append(time.time()-start)

fig.set_figheight(6)
fig.set_figwidth(10)
fig.tight_layout(pad=1.0)

fig2 = plt.figure(figsize=(5.5,3))
sns.barplot(x=np.arange(0,4),y=training_times)
plt.xticks(ticks=[0,1,2,3], labels=["1","8","32","128"])
plt.xlabel("batch size")
plt.ylabel("Training time in seconds")
plt.title("Training times in relation to batch size");

As you can see, the model behaves quite differently. While training times are obviously longer with smaller batch sizes, you can also see how the smaller batch size allows for quicker convergence. A downside of this is also that the model with the smaller batch size overfits harder, since updates for larger batchsizes contain more different samples and thus might lead to more generalization.

There is no right or wrong here, choosing the right batch size depends on your computing power and the size of your dataset, but common values are 4-16 for smaller datasets and CPU processing, and 64,128 or 256 for larger datasets and GPU processing.

## Sklearn Wrappers

To integrate our keras models into the sklearn workflow, keras provides the `KerasRegressor()` wrapper, which enables us to use sci-kit methods with keras models.

Here we build a `KerasRegressor()` from our keras model and train it using earlystopping. 

In [None]:
from tensorflow.python.keras.wrappers.scikit_learn import KerasRegressor
from sklearn.compose import TransformedTargetRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


es = EarlyStopping(monitor='val_loss', mode='min', restore_best_weights=True, verbose=1, patience=15)
mc = ModelCheckpoint("best_epoch.h5", monitor='val_loss', verbose=0, save_best_only=True, mode='min')
callbacks_list = [es,mc]

# evaluate model with standardized dataset
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn=build_model, validation_split=0.25, epochs=200, batch_size=32, verbose=1,size_fac=6,act_fun='relu',loss_fun='mean_absolute_error')))
pipeline = Pipeline(estimators)

wrapped_model = TransformedTargetRegressor(
    regressor=pipeline,
    transformer=StandardScaler())

wrapped_model.fit(X_train_raw, y_train_raw, mlp__callbacks=callbacks_list)

Neural networks are mostly used in deep learning and processing a huge amount of training data. Therefore, cross-validation or exhaustive grid-search is not feasable. But having the sklearn wrapper and with our light-weight model we can even go for a cross-validation.

In [None]:
kfold = KFold(n_splits=5)

results = cross_val_score(wrapped_model, X_train_raw, y_train_raw, scoring='neg_root_mean_squared_error', cv=kfold, fit_params={'mlp__callbacks' : callbacks_list})
print("RMSE: %.2f (+/- %.2f)" % (-(results.mean()), results.std()))

In [None]:
wrapped_model.regressor_['mlp'].get_params()