# Chapter 11 - Training Deep Neural Networks

### Vanishing or exploding gradients

Gradients must have equal variance before and aftes flowing through a layer in the reverse direction, for it to happen, the network needs the same number of inputs and neurons. The connection weights of each layer must be initialized randomly.

- Normal distribution with mean 0 and variance $\sigma^2 = \frac{1}{fan_{avg}}$
- Uniform distribution between $-r$ nad $+r$, with $r = \sqrt{\frac{3}{fan_{avg}}}$ 

Where $fan_{avg} = \frac{fan_{in}+fan_{out}}{2}$

There are other initializations and when to use them:

<center><img src="img/initialization.png"></img></center>

Keras uses Glorot by default, to change it we use _kernel initializer=""_, one option could be _"he_normal"_ or _"he_uniform"_

In [None]:
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")
# He initialization based on fan_avg ranther than fan_in
he_avg_init = keras.initializers.VarianceScaling(scaling=2., mode='fan_avg',
                                                 distribution='uniform')
keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)

In [None]:
# Leaky ReLU layer
model = keras.models.Sequential([
    ...
    keras.layers.LeakyReLU(alpha=0.2),
    ...
])

In [None]:
# PReLU layer (for big training sets, learning alpha on the go)
model = keras.models.Sequential([
    ...
    keras.layers.PReLU(alpha=0.2),
    ...
])

In [None]:
# SELU
# for x < 0: a*(exp(z)-1);  for x >= 0: z
layer = keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")

### Batch Normalization

Adding an operation in the model just before or after the activation function of each hidden layer. It zero-centers and normalizes each input, then scales and shifts the result using two new parameter vectors per layer: one for scaling and the other for shifting. In simple words, it learns the optimal scale and mean of each of the layer's input.

__Batch Normalization Algorithm__

<center><img src="img/batchN.png"></img></center>

Where:
- $\mu_B$ - vector of input means, evaluated over the whole mini-batch $B$ (one mean per input)
- $\sigma_B$ - vector of input standard deviations
- $m_B$ - # of instances per mini-batch
- $\hat{x}^{(i)}$ - vector of zero-centered and normalized inputs for instance $i$
- $y$ - output scale parameter vector for the layer
- $\bigotimes$ - element-wise multiplication
- $\beta$ - output shift (offset) parameter vector for the layer. Each input is offset by its corresponding shift parameter.
- $\epsilon$ - avoid division by 0, 1e-5, smoothing term
- $z^{(i)}$ - output of the BN operation. It is the rescaled and shifted version of the inputs.

Most implementations estimate the input $\mu$ and $\sigma^2$ by using a moving average during training, this are used only after it, while $\beta$ and $y$ are learned through regular backpropagation.

BN also acts as a regularizer, no need for other. It solves vanishing gradients, the network become less sensitve to the weight initialization, it allows bigger learning rates or the use of saturating activation functions.

It becomes more computational demanding, but by substituing the previous layer's weights and biases with the new ones, the BN layer can be removed. TFLite does this automatically.

In [2]:
import tensorflow as tf
from tensorflow import keras

In [3]:
# Keras example
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="relu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax"),
])

2022-01-04 18:50:27.497303: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense (Dense)                (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1

$\mu$ and $\sigma$ are not affected by backpropagation, they are the Non-trainable params of the summary. From the batch layers, we sum them, and divide by 2, they are  $\mu$ and $\sigma$, the others correspond to $y$ and $\beta$.

In [5]:
# Let's prove it
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

In [None]:
# Adding the BN layers before the activation function (depends on task)
# Keras example
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    # remove the activation function in the Dernse layer, bias=0
    keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    # add it after the BN layer
    keras.layers.Activation("elu"),
    keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(10, activation="softmax"),
])

One hyperparameter to tweak is the _momentum_, it is used to update the exponential moving averages and is a value close to 1, like 0.9, 0.99, 0.999... (more zeros if it is a big datasets and smaller mini-batches)

Also, _axis_ is another important hyperparameter, with it we stablish how is the layer going to be normalized. 
- 2D [batch_size, features]: _axis=-1_, the last axis is going to be normalized
- 3D [batch_size, height, width]: _axis=1_, will normalize all pixels in a  given column. _axis=[1, 2]_ will normalize all pixels independently.

### Gradient Clipping

In [None]:
#  Clip the values during training so they never exceed some threshold.
optimizer = keras.optimizers.SGD(clip_value=1.0) # between -1 and 1
model.compile(loss="mse", optimizer=optimizer)
# Using clipnorm will use the l2 norm. For e.g. clipnorm=1, gradient_vector=[0.9, 100]
# It will clip it to: [0.00899964, 0.9999595], preserves the orientation but eliminates 
# the first component

### Transfer Learning

In [None]:
# Loading complex model, if it is retrained, model A will be affected
model_A = keras.models.load_model("my_model_A.h5")
# To solve this, we need to clone it and copy its weights
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())
# dropping the last one
model_B_on_A = keras.models.Sequential(model_A_clone.layers[:-1])
# new output layer
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

In [None]:
# Freezing all layers except last
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False
# Always compile after freezing it
model_B_on_A.compile(loss="binary_crossentropy", optimizer="sgd", metrics=["accuracy"])

In [None]:
# Now we can unfreeze the reused layers and continue training to fine tune 
# the reused layers for task B
history = model.fit(X_train, y_train, epochs=4,
                    validation_data=(X_valid, y_valid))
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

# reduce the lr to acoid damaging the reused weights
optimizer = keras.optimizers.SGD(lr=1e-4)
model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer,
                     metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=16,
                    validation_data=(X_valid, y_valid))
                

### Unsupervised Learning
First, restricted Boltzmann machines, now, GAN's and Autoencoders. Models are trained on unlabeled data, using unsupervised learning, then, it is fine-tuned for the final task on the labeled data with supervised learning.

### Pretrainig on an auxiliary task
If we don't have much labeled training data, using similar data to train a first neural net, then reuse the first layers for the real task.

### Faster optimizers

__Momentum Optimization:__ It cares about the previous gradients, it substracts the local gradient from the _momentum vector_ $\eta m$ and updates the weights by summing $m$. $\beta$ is a regularization parameter used to prevent the momentum from growing too large (usually 0.9).
1. $m \leftarrow \beta m - \eta\Delta_\theta J(\theta) $
2. $\theta \leftarrow \theta + m $
It get out local optima and reach the global optima faster.

__Nesterov Accelerated Gradient:__ It measures the gradient of the cost function not al the local position $\theta$, but slightly ahead in the direction of the momentum $\theta + \beta m$ (it is assumed it always goes to the optima, so looking ahead is good)

1. $m \leftarrow \beta m - \eta\Delta_\theta J(\theta +\beta m) $
2. $\theta \leftarrow \theta + m $
<center><img src="img/momentum.png"></img></center>

__AdaGrad:__ It corrects the direction pinpointing to the global optima by going to the steepest dimensions.
1. $s \leftarrow s + \Delta_\theta J(\theta) \bigotimes \Delta_\theta J(\theta)$
2. $\theta \leftarrow \theta - \eta\Delta_\theta J(\theta) \oslash \sqrt{s + \epsilon} $

The first step acummulates the square of the gradients into vector $s$, each $s_i$ accumulates the square of the partial derivative of the cost function with regard to parameter $\theta$. If the cost function is steep along the $i^{th}$ dimension, then $s_i$ will get larger at each iteration.

The second step is like Gradient Descent, but scale by $\sqrt{s + \epsilon}$ ($\oslash$ element-wise division), $\epsilon$ is to avoid division by 0, usually 10e-10
<center><img src="img/gradientd.png"></img></center>

It usually stops too early, so it isn't recommended to train DNN.

__RMSProp:__ Fixed AdaGrad, the decay rate $\beta$ is usually 0.9,
1. $s \leftarrow \beta s + (1-\beta)\Delta_\theta J(\theta) \bigotimes \Delta_\theta J(\theta)$
2. $\theta \leftarrow \theta - \eta\Delta_\theta J(\theta) \oslash \sqrt{s + \epsilon} $

__Adam:__ Adaptive moment estimation, it keeps track of an exponentially decaying average of past squared gradients:
1. $m \leftarrow \beta_1 m - (1-\beta_1)\Delta_\theta J(\theta)$
2. $s\leftarrow \beta_2 s - (1-\beta_2)\Delta_\theta J(\theta) \bigotimes \Delta_\theta J(\theta)$
3. $\hat{m} \leftarrow \frac{m}{1 - \beta_1^t}$
4. $\hat{s} \leftarrow \frac{s}{1 - \beta_2^t}$
5. $\theta \leftarrow \theta - \eta\hat{m} \oslash \sqrt{\hat{s} + \epsilon} $

$t$ is the iteration number, step 1 computes an exponentially decaying average ($1-\beta_1$) rather than an exponentially decaying sum. Step 3 and 4, $m$ and $s$ are initialized at 0, they will be biased toward 0 at the beginning of training, but it will boost them.

The momentum decay hyperparameter $\beta_1=0.9$, while the scaling decay hyperparameter $\beta_2 = 0.999$, $\epsilon=1e-7$, setting $\eta=0.001$ is ok.

__AdaMax:__ Sometimes more stable than Adam, it is a good try.
1. $m \leftarrow \beta_1 m - (1-\beta_1)\Delta_\theta J(\theta)$
2. $s\leftarrow max(\beta_2 s, \Delta_\theta J(\theta)) $
3. $\theta \leftarrow \theta - \eta\hat{m} \oslash \sqrt{\hat{s} + \epsilon} $


In [None]:
# Momentum
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)
# Nesterov
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)
# RMSprop, ro is beta from the equations
optimizer = keras.optimizers.SGD(lr=0.001, rho=0.9)
# Adam
optimizer = keras.optimizers.SGD(lr=0.001, beta_1=0.9, beta_2=0.999)
# Other optimizers have its own class, check the documentation

### Learning rate scheduling
There are many options to try, start high and decrease every number of epochs by a magnitude, start linear and then decay, etc.
__Power scheduling:__ $\eta (t)=\frac{\eta_0}{(1+t/s)^c}$

__Exponential scheduling:__ $\eta(t) = \eta_0 0.1^{t/s}$

__Piecewise constant scheduling:__ Use a learning rate depending on the epch

__Performance scheduling:__ Reduce lr when error stops dropping

In [None]:
# Power scheduling, Keras assumes c=1
optimizer = keras.optimizers.SGD(lr=0.01, decay=1e-4)

In [None]:
# Exponential scheduling
def exponential_decay(epoch):
    # n0 and s hardcoded
    return 0.01 * 0.1**(epoch/20)

def exponential_decay2(lr0, s):
    def exponential_decay_fn(epoch):
        return lr0 * 0.1**(epoch / s)
    return exponential_decay_fn
exponential_decay_fn = exponential_decay2(lr0=0.01, s=20)

lr_scheduler = kera.callbacks.LearningRateScheduler(exponential_decay_fn)
history.model.fit(X, , y, [...], callbacks=[lr_scheduler])

In [None]:
def exponential_decay3(epoch, lr):
    #  it starts at the beginning of epoch 0
    # it relies on the optimizer's initial lr, tune it
    return lr * 0.1**(1/20)
# When saving a model, the optimizer and it lr are saved, if we want
# to use the model, the fit method start in epoch 0, so this could
# damage the weights in this case. SOLUTION: use initial_epoch argument

In [None]:
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001
# Same process as two cells upward

In [None]:
# Performance scheduling
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

In [None]:
s = 20 * len(X_train) // 32
# use one schedule available in keras
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
# pass it to any optimizer
optimizer = keras.optimizers.SGD(learning_rate)
# This is specific to tf.keras

## Avoid Overfitting though Regularization
### $l_1$ and $l_2$ regularization
$l_2$ regularization to constrain the weight, $l_1$ if we want a sparce model (many weights equal to 0).

In [None]:
# l2 regularization
layer = keras.layers.Dense(100, activation="elu",
                            kernel_initializer="he_normal",
                            kernel_regularizer=keras.regularizars.l2(0.01))
# the l2() function is called at each step during training. l1 works
# the same way, for both of them: keras.regularizers.l1_l2()

In [None]:
# Most of the times we apply the same regularizer, activation function, 
# initialization strategy to all hidden layers, in code it looks bad
from functools import partial
# thin wrapper for any callable, some default arguments
RegularizedDense = partial(keras.layers.Dense,
                            activation="elu",
                            kernel_initializer="he_normal",
                            kernel_regularized=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation="softmax", kernel_initializer="glorot_uniform")

])

### Dropout

A powerful regularization technique that boost 1-2% accuracy most of the times (that's a lot!). The neurons have a probability $p$ of being disabled in a training step, in regular DNN is set to 10-50%, CNN 40-50%, RNN 20-30%.

After training we need to multiply each neuron's input connection weights by the keep probability $(1-p)$ or divide each neuron output by the same magnitude.

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(300, activation="relu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])
# Dropout is only active during training, the training and validation loss may be 
# misleading, make sure to evaluate the training loss without dropout (after training)

If the model  is overfitting, increase dropout rate, if underfitting, decrease it. Another option is to increase the dropout rate in large layers and decrease it in small ones, or maybe only use dropout after the last hidden layer.

To regularize a self-normalizing network based on SELU, we should use _alpha dropout_

### Monte Carlo (MC) Dropout
Make predictions with dropout activated, then apply the mean of all of them, this gives a better measure of the model's uncertainty.

In [None]:
y_probas = np.stack([model(X_test_scaled, training=True)
                     for sample in (range(100))])
y_proba = y_probas.mean(axis=0)

If the model contains other layers that behave in a special way dring training (like BatchNormalization layers), we should replace the Dropout layers with the following MCDropout class:

In [None]:
class MCDropout(keras.layers.Dropout):
    def call(self, inputs):
        return super().call(inputs, training=True)

We subclass the Dropout layer and override the call() method to force its training argument to True. We can define an MCAlphaDropout class by subclassing AlphaDropout instead. If we are creating a model from scratch, MCDropout is good, if we are using a pretrained model with Dropout, we need to create a new identical model using MCDropout and copy the model weights to our model.

### Max-Norm Regularization
For each neuron, it constrains the weights __w__ of the incoming connections such that $||w||_2 \leq r$, where $r$ is the max-norm hyperparameter and $|| w ||_2$ is the $l_2$ norm. It does not add a regularization loss term to the overall loss function, it just rescale the weights.

In [None]:
# the model fit() method will call the object returned by max_norm(),
# passing it the layers weights and getting rescaled weights in return,
# which then replace the layer's weights.
keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal",
                   kernel_constraint=keras.constraints.max_norm(1.))

## Summary and practical guidelines

<center><img src="img/dnn.png"></img></center>
<center><img src="img/dnn2.png"></img></center>

- Normalize the input features
- Use transfer learning
- Unsupervised pretraining if we have a lot of unlabeled data
- Pretraining on an auxiliary task

Exceptions:
- If we need a sparse model, $l_1$ regularization, or use Tensorflow Model Optimization Toolkit.
- Low-latency model (fast): fewer layers, fold the Batch NOrmalization layers into the previous ones, use a fast activation function, reduce float precision.
- Risk-sensitive application, use MC Dropout to get more reliable estimates.