# Chapter 11 - Training Deep Neural Networks

### Vanishing or exploding gradients

Gradients must have equal variance before and aftes flowing through a layer in the reverse direction, for it to happen, the network needs the same number of inputs and neurons. The connection weights of each layer must be initialized randomly.

- Normal distribution with mean 0 and variance $\sigma^2 = \frac{1}{fan_{avg}}$
- Uniform distribution between $-r$ nad $+r$, with $r = \sqrt{\frac{3}{fan_{avg}}}$ 

Where $fan_{avg} = \frac{fan_{in}+fan_{out}}{2}$

There are other initializations and when to use them:

<center><img src="img/initialization.png"></img></center>

Keras uses Glorot by default, to change it we use _kernel initializer=""_, one option could be _"he_normal"_ or _"he_uniform"_

In [None]:
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")
# He initialization based on fan_avg ranther than fan_in
he_avg_init = keras.initializers.VarianceScaling(scaling=2., mode='fan_avg',
                                                 distribution='uniform')
keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)

In [None]:
# Leaky ReLU layer
model = keras.models.Sequential([
    ...
    keras.layers.LeakyReLU(alpha=0.2),
    ...
])

In [None]:
# PReLU layer (for big training sets, learning alpha on the go)
model = keras.models.Sequential([
    ...
    keras.layers.PReLU(alpha=0.2),
    ...
])

In [None]:
# SELU
# for x < 0: a*(exp(z)-1);  for x >= 0: z
layer = keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")

### Batch Normalization

Adding an operation in the model just before or after the activation function of each hidden layer. It zero-centers and normalizes each input, then scales and shifts the result using two new parameter vectors per layer: one for scaling and the other for shifting. In simple words, it learns the optimal scale and mean of each of the layer's input.

__Batch Normalization Algorithm__

<center><img src="img/batchN.png"></img></center>

Where:
- $\mu_B$ - vector of input means, evaluated over the whole mini-batch $B$ (one mean per input)
- $\sigma_B$ - vector of input standard deviations
- $m_B$ - # of instances per mini-batch
- $\hat{x}^{(i)}$ - vector of zero-centered and normalized inputs for instance $i$
- $y$ - output scale parameter vector for the layer
- $\bigotimes$ - element-wise multiplication
- $\beta$ - output shift (offset) parameter vector for the layer. Each input is offset by its corresponding shift parameter.
- $\epsilon$ - avoid division by 0, 1e-5, smoothing term
- $z^{(i)}$ - output of the BN operation. It is the rescaled and shifted version of the inputs.

Most implementations estimate the input $\mu$ and $\sigma^2$ by using a moving average during training, this are used only after it, while $\beta$ and $y$ are learned through regular backpropagation.

BN also acts as a regularizer, no need for other. It solves vanishing gradients, the network become less sensitve to the weight initialization, it allows bigger learning rates or the use of saturating activation functions.

It becomes more computational demanding, but by substituing the previous layer's weights and biases with the new ones, the BN layer can be removed. TFLite does this automatically.

In [2]:
import tensorflow as tf
from tensorflow import keras

In [3]:
# Keras example
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="relu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax"),
])

2022-01-04 18:50:27.497303: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense (Dense)                (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1

$\mu$ and $\sigma$ are not affected by backpropagation, they are the Non-trainable params of the summary. From the batch layers, we sum them, and divide by 2, they are  $\mu$ and $\sigma$, the others correspond to $y$ and $\beta$.

In [5]:
# Let's prove it
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

In [None]:
# Adding the BN layers before the activation function (depends on task)
# Keras example
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    # remove the activation function in the Dernse layer, bias=0
    keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    # add it after the BN layer
    keras.layers.Activation("elu"),
    keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(10, activation="softmax"),
])

One hyperparameter to tweak is the _momentum_, it is used to update the exponential moving averages and is a value close to 1, like 0.9, 0.99, 0.999... (more zeros if it is a big datasets and smaller mini-batches)

Also, _axis_ is another important hyperparameter, with it we stablish how is the layer going to be normalized. 
- 2D [batch_size, features]: _axis=-1_, the last axis is going to be normalized
- 3D [batch_size, height, width]: _axis=1_, will normalize all pixels in a  given column. _axis=[1, 2]_ will normalize all pixels independently.

### Gradient Clipping

In [None]:
#  Clip the values during training so they never exceed some threshold.
optimizer = keras.optimizers.SGD(clip_value=1.0) # between -1 and 1
model.compile(loss="mse", optimizer=optimizer)
# Using clipnorm will use the l2 norm. For e.g. clipnorm=1, gradient_vector=[0.9, 100]
# It will clip it to: [0.00899964, 0.9999595], preserves the orientation but eliminates 
# the first component

### Transfer Learning

In [None]:
# Loading complex model, if it is retrained, model A will be affected
model_A = keras.models.load_model("my_model_A.h5")
# To solve this, we need to clone it and copy its weights
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())
# dropping the last one
model_B_on_A = keras.models.Sequential(model_A_clone.layers[:-1])
# new output layer
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

In [None]:
# Freezing all layers except last
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False
# Always compile after freezing it
model_B_on_A.compile(loss="binary_crossentropy", optimizer="sgd", metrics=["accuracy"])

In [None]:
# Now we can unfreeze the reused layers and continue training to fine tune 
# the reused layers for task B
history = model.fit(X_train, y_train, epochs=4,
                    validation_data=(X_valid, y_valid))
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

# reduce the lr to acoid damaging the reused weights
optimizer = keras.optimizers.SGD(lr=1e-4)
model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer,
                     metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=16,
                    validation_data=(X_valid, y_valid))
                