This notebook provides a tools to **visualize the neural network prediction map during the backpropagation**. It aims at providing a visual tools to change the backpropagation hyperparameters and to see the impact on the model convergence. 

Please for any suggestion or comment: *etienne.demontalivet(at)gmail.com*

#### How to use:

Some of the hyperparameters are set in the *hp* dictionnary. After executing a few cells to initiate the callbacks and the neural network, you'll be able to execute the main cell to visualize the backpropagation effect on building the limits of the NN. After updating the *hp* parameters, just execute the main cell to see the impact of the update. 

At the end of the notebook you'll find some interesting set of hyperparameters.

Enjoy.

In [None]:
# hyperparameters
hp = {
    'lr':0.1,            # learning rate
    'decay':0,           # decay 
    'momentum':0.9,      # mementum
    'patience':50,       # number of epochs to wait before reducing the learning rate (see callback)
    'epochs':100,        # number of epochs for the training
    'n1':10,             # number of neurons in the first hidden layer
    'n2':10,             # number of neurons in the second hidden layer
    'n3':0,              # number of neurons in the third hidden layer
    'kr':0,              # kernel regularization
    'activation':'lrelu',# Type of activation to use in the hidden layers
    'dataset':1          # dataset to use (see below)
}

In [None]:
%matplotlib notebook

There are 4 datasets:
- 1: Moons
- 2: Circles
- 3: Blobs
- 4: tricky one

In [None]:
# Generate fake dataset:
from sklearn.datasets import make_moons, make_circles, make_blobs, make_classification

def get_dataset(case: int):
    if case == 1:
        X, y = make_moons(noise=0.1, random_state=42)
    elif case == 2:
        X, y = make_circles(noise=0.3, factor=0.5, random_state=40)
    elif case == 3:
        X, y = make_blobs(centers=2, n_features=2, random_state=42)
    else:
        X1, y1 = make_classification(n_classes=2, n_features=2, n_redundant=0, n_informative=2,
                                   random_state=1, n_clusters_per_class=2)
        X2, y2 = make_classification(n_classes=2, n_features=2, n_redundant=0, n_informative=2,
                                   random_state=2, n_clusters_per_class=2)
        X = np.concatenate([X1, X2])
        y = np.concatenate([y1, y2])
    
    return X, y

We use a Dense Neural Network:
- 2 inputs (2 dimensions for visualization purpose)
- 3 hidden layers (number of neurons per layer can be changed)
- 1 output: we only use 1 output with a sigmoid activation as we only have 2 classes

We use the common *binary-crossentropy* / *sigmoïd* pair for the output. We remove the decay applied by default to the SGD, so we can see the *learning rate* being decreased only by our other callback (see below).

In [None]:
# MODEL
# We use a Dense Neural Net with 2 inputs, 2 hidden layers and one output
from keras.models import Sequential
from keras.layers import Dense, LeakyReLU, ReLU, Activation
from keras import regularizers

N_HIDDEN = 3

def create_model(l1_neurons: int, l2_neurons: int, l3_neurons: int, kernel_reg: float, activation: str='relu'):
    # Get model parameters
    kr = regularizers.l2(kernel_reg)
    if activation == 'relu':
        act_layer = [ Activation('relu') for i in range(N_HIDDEN) ]
    elif activation == 'lrelu':
        act_layer = [ keras.layers.LeakyReLU(alpha=0.3) for i in range(N_HIDDEN) ]
    elif activation == 'sigmoid':
        act_layer = [ Activation('sigmoid') for i in range(N_HIDDEN) ]
    elif activation == 'tanh':
        act_layer = [ Activation('tanh') for i in range(N_HIDDEN) ]
    else:
        raise ValueError(str(activation) + " is not yet supported. Please implement it or choose between ['relu', 'lrelu', 'sigmoid']")
    
    # Create model
    model = Sequential()
    model.add(Dense(l1_neurons, input_dim=2, kernel_regularizer=kr))
    model.add(act_layer[0])
    if l2_neurons > 0:
        model.add(Dense(l2_neurons, kernel_regularizer=kr))
        model.add(act_layer[1])
    if l3_neurons > 0:
        model.add(Dense(l3_neurons, kernel_regularizer=kr))
        model.add(act_layer[2])
    model.add(Dense(1, activation='sigmoid'))
    
    # Optimizer
    optim = keras.optimizers.SGD(lr=hp['lr'], decay=hp['decay'], momentum=hp['momentum'], nesterov=False)
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer=optim, metrics=['accuracy'])
    
    return model


The callback below evaluates each meshgrid point probability and associate a color depending on the model prediction. It is called at the end of each epoch, so we can visualize the convergence.

In [None]:
import keras
import keras.backend as K
import time

class PlotCallback(keras.callbacks.Callback):
    '''
    Keras callback to draw the prediction map at the end of each epoch
    '''
    def __init__(self, total_epochs: int):
        self.tot_epochs = total_epochs
        self.contours = None
        self.text = None
        self.lr = K.eval(model.optimizer.lr)

    def on_train_begin(self, logs={}):
        self.nb_epochs = 0
        
        # Get the proba of each mesh grid point
        Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])  # [:, 1]
        # Put the result into a color plot
        Z = Z.reshape(xx.shape)

        self.contours = ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)        

        # Evaluate the current score
        score = model.evaluate(X_test, y_test)

        # Update text on figure
        if self.text is not None:
            self.text.remove()
        self.text = ax.text(xx.max() - 0.1, yy.min() - 0.5, ('epochs: %d/%d - loss:%.2f - acc:%.2f - lr:%.2e' % (self.nb_epochs, self.tot_epochs, score[0], score[1], K.eval(model.optimizer.lr))).lstrip('0'),
                size=12, horizontalalignment='right')
        
        fig.canvas.draw()
        # Wait for keyboard input before starting fitting
        input()

    def on_epoch_end(self, batch, logs={}):
        self.nb_epochs = self.nb_epochs + 1

        # Get the proba of each mesh grid point
        Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])  # [:, 1]
        # Put the result into a color plot
        Z = Z.reshape(xx.shape)

        # Remove the old contour map
        for co in self.contours.collections:
            co.remove()
        self.contours = ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)        

        # Evaluate the current score
        score = model.evaluate(X_test, y_test)

        # Update text on figure
        if self.text is not None:
            self.text.remove()
        self.text = ax.text(xx.max() - 0.1, yy.min() - 0.5, ('epochs: %d/%d - loss:%.2f - acc:%.2f - lr:%.2e' % (self.nb_epochs, self.tot_epochs, score[0], score[1], K.eval(model.optimizer.lr))).lstrip('0'),
                size=12, horizontalalignment='right')
        
        fig.canvas.draw()
        
        if self.nb_epochs == hp['epochs']:
            time.sleep(2) # for video recording purpose

The following callback reduces by *factor* the *learning rate* when the loss does not improve (see *min_delta*) during *patience* epochs. It although prevent from overfitting (try to remove it ?)

In [None]:
# CALLBACKS
from keras.callbacks import ReduceLROnPlateau

def get_callbacks(n_epochs: int):
    reduceOnPlateau = ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=hp['patience'],
        verbose=0,
        mode='auto',
        min_delta=0.0001,
        cooldown=0,
        min_lr=4e-5
    )
    
    return [PlotCallback(n_epochs), reduceOnPlateau ]

In [None]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from matplotlib.patches import Patch
from matplotlib.lines import Line2D
from mpl_toolkits.axes_grid1.inset_locator import inset_axes

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

Here comes the **main cell**. There are a lot of settings... Note the few last lines which contains other NN fitting parameters (epochs number, batch size,...). Change the hyperparameters and look at the impact on the convergence.

In [None]:
%%capture --no-display

h = .2  # step size in the mesh. Increase it if your CPU load is too high

X, y = get_dataset(hp['dataset'])

fig = plt.figure(figsize=(10,5))
ax = fig.add_axes([0.05, 0.1, 0.7, 0.8])

X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=.4, random_state=42)

# Get the mesh coordinates
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Color(map) settings
cm = plt.cm.RdBu
cm_r = '#FF0000'
cm_b = '#0000FF'
cm_bright = ListedColormap([cm_r, cm_b])

# Plot the training points
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
           edgecolors='k', zorder=2.5)
# Plot the testing points
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6,
           edgecolors='k', marker="s", zorder=2.5)

# Set axis limits
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())

# Legend
legend_elements = [Line2D([0], [0], marker='o', color='black', markerfacecolor=cm_b, label='class 1 X_train'),
                   Line2D([0], [0], marker='s', color='black', markerfacecolor=cm_b, label='class 1 X_test'),
                   Line2D([0], [0], marker='o', color='black', markerfacecolor=cm_r, label='class 2 X_train'),
                   Line2D([0], [0], marker='s', color='black', markerfacecolor=cm_r, label='class 2 X_test')]
ax.legend(handles=legend_elements, bbox_to_anchor=(1.28, 0.95))

# Title
ax.set(title='Neural Network prediction map')
norm = mpl.colors.Normalize(vmin=0, vmax=100)

# Personal text
ax.text(xx.min()+0.01*(xx.max()-xx.min()), yy.min()+0.01*(yy.max()-yy.min()),('github.com/etiennedemontalivet/ML-tools'))

# Colorbars
axins = inset_axes(ax,
    width="2%",  # width = 5% of parent_bbox width
    height="50%",  # height : 50%
    loc='lower left',
    bbox_to_anchor=(1.2, 0.12, 1, 1),
    bbox_transform=ax.transAxes,
    borderpad=0,
)
axins2 = inset_axes(ax,
    width="2%",  # width = 5% of parent_bbox width
    height="50%",  # height : 50%
    loc='lower left',
    bbox_to_anchor=(1.04, 0.12, 1, 1),
    bbox_transform=ax.transAxes,
    borderpad=0,
)
plt.colorbar(mpl.cm.ScalarMappable(norm=norm, cmap=plt.cm.Blues), cax=axins2, orientation='vertical', label='% prediction class 1')
plt.colorbar(mpl.cm.ScalarMappable(norm=norm, cmap=plt.cm.Reds), cax=axins, orientation='vertical', label='% prediction class 2')

# Show figure
fig.show()
fig.canvas.draw()

# create and fit the model
model = create_model(l1_neurons=hp['n1'], l2_neurons=hp['n2'], l3_neurons=hp['n3'], kernel_reg=hp['kr'], activation=hp['activation'])
n_epochs = hp['epochs']
# model.load_weights('test')
# model.save_weights('test')
model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=10, epochs=n_epochs, verbose=0, shuffle=True, callbacks=get_callbacks(n_epochs=n_epochs))

# Close figure afterwards
plt.close()

In [None]:
# Change hyperparameters here and execute main cell again
hp = {
    'lr':0.07,           # learning rate
    'decay':0.,          # decay 
    'momentum':0.99,     # momentum
    'patience':50,       # number of epochs to wait before reducing the learning rate (see callback)
    'epochs':400,        # number of epochs for the training
    'n1':10,             # number of neurons in the first hidden layer
    'n2':10,             # number of neurons in the second hidden layer
    'n3':0,              # number of neurons in the third hidden layer
    'kr':0,              # kernel regularization
    'activation':'lrelu',# Type of activation to use in the hidden layers
    'dataset':2          # dataset to use
}

**Note:**

This notebook aims at being a tools to visualize the backpropagation and is not 100% correct mathematically speaking. Some shortcuts have been taken for simplification purpose.

### Specific set of parameters 

Here are a few examples of hyperparameters with their effects. To visualize one, execute the corresponding cell and then execute the main cell.

#### Make your gradient explode

If we choose a too high learning rate, this will result in exploding the gradient and your network won't learn anymore.

In [None]:
hp = {
    'lr':0.5,            # learning rate
    'decay':0,           # decay 
    'momentum':0.9,      # momentum
    'patience':50,       # number of epochs to wait before reducing the learning rate (see callback)
    'epochs':150,        # number of epochs for the training
    'n1':10,             # number of neurons in the first hidden layer
    'n2':10,             # number of neurons in the second hidden layer
    'n3':0,              # number of neurons in the third hidden layer
    'kr':0,              # kernel regularization
    'activation':'lrelu',# Type of activation to use in the hidden layers
    'dataset':1          # dataset to use
}

#### Momentum

Compare without and with momentum. All other parameters are fixed.

In [None]:
# without
hp = {
    'lr':0.07,           # learning rate
    'decay':0,           # decay 
    'momentum':0,        # WITHOUT momentum
    'patience':50,       # number of epochs to wait before reducing the learning rate (see callback)
    'epochs':150,        # number of epochs for the training
    'n1':10,             # number of neurons in the first hidden layer
    'n2':10,             # number of neurons in the second hidden layer
    'n3':0,              # number of neurons in the third hidden layer
    'kr':0,              # kernel regularization
    'activation':'lrelu',# Type of activation to use in the hidden layers
    'dataset':1          # dataset to use
}

In [None]:
# with
hp = {
    'lr':0.07,           # learning rate
    'decay':0,           # decay 
    'momentum':0.9,      # WITH momentum
    'patience':50,       # number of epochs to wait before reducing the learning rate (see callback)
    'epochs':150,        # number of epochs for the training
    'n1':10,             # number of neurons in the first hidden layer
    'n2':10,             # number of neurons in the second hidden layer
    'n3':0,              # number of neurons in the third hidden layer
    'kr':0,              # kernel regularization
    'activation':'lrelu',# Type of activation to use in the hidden layers
    'dataset':1          # dataset to use
}

Now if we want to achieve a good convergence without momentum, we have to increase the number of epochs and the patience (not to decrease the learning rate too quickly) !

In [None]:
# without
hp = {
    'lr':0.07,           # learning rate
    'decay':0,           # decay 
    'momentum':0,        # WITHOUT momentum
    'patience':500,       # number of epochs to wait before reducing the learning rate (see callback)
    'epochs':500,        # number of epochs for the training
    'n1':10,             # number of neurons in the first hidden layer
    'n2':10,             # number of neurons in the second hidden layer
    'n3':0,              # number of neurons in the third hidden layer
    'kr':0,              # kernel regularization
    'activation':'lrelu',# Type of activation to use in the hidden layers
    'dataset':1          # dataset to use
}

#### Overfitting

In [None]:
hp = {
    'lr':0.002,           # learning rate
    'decay':0,           # decay 
    'momentum':0.999,    # momentum
    'patience':50,       # number of epochs to wait before reducing the learning rate (see callback)
    'epochs':500,        # number of epochs for the training
    'n1':50,             # number of neurons in the first hidden layer
    'n2':50,             # number of neurons in the second hidden layer
    'n3':50,             # number of neurons in the third hidden layer
    'kr':0,              # kernel regularization
    'activation':'lrelu',# Type of activation to use in the hidden layers
    'dataset':2          # dataset to use
}

<img src="NN_overfitting.gif" width="600">