# **Training Deep Neural Networks**

In [1]:
import tensorflow as tf
from tensorflow import keras

## The Vanishing/Exploding Gradients Problems

Backpropagation works by going from the output layer to the input layer, propagating the error gradient along the way. After computing the gradient of the cost function with regard to each parameter in the network, it uses the gradients to update each parameter with a Gradient Descent step.

Gradients often get smaller as the algorithm progresses down to lower layers. As a result, the Gradient Descent update leabes the lower layers' connection weights virtually unchanged, and training never converges to a good solution. This is the *vanishing gradients* problem. 

In some cases, the opposite can happen: the gradients can grow bigger until layers get insanely large weight updates and the algorithm diverges. This is the *exploding gradients* problem (recurrent neural networks).

Looking at logistic activation function, when inputs become large (negative or positive), the function saturates at 0 or 1, with a derivative extremely close to 0. Thus, when backpropagation kicks in it has virtually no gradient to propagate back through the network; and what little gradient exists keeps getting diluted as backpropagation progresses down through the top layers, so there is nothing left for the lower layers.

### Glorot and He Initialization

Glorot and Bengio proposed that the signal needs to flow properly in both directions: in the foward direction when making predictions, and the reverse direction when backpropagating gradients. For the signal to flow properly, they argue that we need the variance of the outputs of each layer to be equal to the variance of its inputs, and we need the gradients to have equal variance before and after flowing througha layer in the reverse direction. It's not possible to gurantee both, so they proposed a compromise. Connection weights of each layer must be initialized randomly using:

$fan_{avg} = (fan_{in} + fan_{out}) / 2$

|Initialization|Activation Functions|$\sigma^{2}$ (Normal)|
|--------------|:-------------------|:-------------------|
|Glorot |None, tanh, logistic, softmax| $1/fan_{avg}$|
|He|ReLU and variants| $2/fan_{in}$|
|LeCunn|SELU|$1/fan_{in}$|




Replacing $fan_{avg}$ with $fan_{in}$ yields LeCunn initialization. LeCunn and Glorot initialization are equivalent when $fan_{in}$ = $fan_{out}$

By default, Keras uses Glorot with a uniform distribution. Can be changed to He initialization:

In [None]:
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

For He initialization with uniform dist but based on $fan_{avg}$ rather than $fan_{in}$, use `VarianceScaling`:

In [None]:
he_avg_init = keras.initializer.VarianceScaling(scale=2., mode='fan_avg', distribution='uniform')
keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)

### Nonsaturating Activation Functions

Other activation functions (besides sigmoid) work better on Deep Networks. Especially ReLU.

ReLU isn't perfect though. It suffers from *dying ReLUs*: during training, some neurons "die" meaning they stop outputting anything other than 0. A neuron dies when its weights are tweaked in a way that the weighted sum of its inputs are negative for all instances in the training set. When this happens, it keeps outputting zeros, and Gradient Descent does not affect it anymore beacuse the gradient of ReLU is zero when its input is negative. 

**LeakyReLU**

To solve this, use *Leaky ReLU*. Ensures that neurons never die. Defined as:<br>
$LeakyReLU_{\alpha}(z) = max(\alpha z, z)$

$\alpha$ defines how much the function "leaks": it's the slope of the function for z<0 and is typically set to 0.01. 

PReLU also outperforms ReLU in many cases. $\alpha$ is authorized to be learned during training (instead of being a hyperparameter, it becomes a parameter that can be modified by backpropagation like any other parameter)

**ELU**

Outperforms all ReLU variants

![elu_formula](elu_form.png)

![elu](ELU.png)

Main drawback is that it is slower to compute than ReLU (due to the use of exponential function)

**SELU**

scaled variant of ELU. Authors showed that if you build a neural network composed exclusively of a stack of dense layers, and if all hidden layers use the SELU activation function, then the network will *self-normalize*: the output of each layer will tend to preserve a mean of 0 and standard deviation of 1 during training, which solves the vanishing/exploding gradients problem. As a result, it outperforms other activations. However, some conditions must be met for self-normalization to happen:
- Input features must be standardized ($\mu$=0, $\sigma$=1)
- Every hidden layer's weights must be initialized with LeCunn normal initialization. In Keras, `kernel_initializer="lecun_normal"`
- Network architecture must be sequential
- All layers are dense

To use leaky ReLU activation function:

In [None]:
model = keras.models.Sequential([
    [...]
    keras.layer.Dense(10, kernel_initializer='he_normal'),
    keras.layers.LeakyReLU(alpha=0.2),
    [...]
])

For PReLU, replace `LeakyReLU(alpha=0.2)` with `PReLU()`

For SELU:

In [None]:
layer = keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")

### Batch Normalization

Significantly reduces possibility of vanishing/exploding gradients. Consists of adding an operation in the model just before or after the activation function of each hidden layer. This operation zero-centers and normalizes each input, then scales and shifts the result using two new parameter vectors per layer: one for scaling, the other for shifting. I.e., the operation lets the model learn the optimal scale and mean of each of the layer's inputs. 

Led to huge improvement in the ImageNet classification task (large database of images classified into many classes, commonly used to evaluate computer vision systems). Vanishing gradients problem was strongly reduced, to the point they could use saturating activation functions such as tanh and logistic activation function. Networks were also much less sensitive to weight initialization. They were able to use larger learning rates, significantly speeding up the learning process. It also acts like a regularizer.

**Implementing Batch Normalization with Keras**

Add `BatchNormalization` layer before or after each hidden layer's activation function; optionally add BN layer as the first layer in model

In [3]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation='softmax')
])

In [4]:
>>> model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense (Dense)                (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1

In [5]:
>>> [(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

To add BN layers before the activation functions, remove activation function from the hidden layers and add them as separate layers after the BN layers. Moreover, since BN layers include one offset parameter per input, you can remove the bias term from previous layer:

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer='he_normal', use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation('elu'),
    keras.layers.Dense(100, kernel_initializer='he_normal', use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation('elu'),
    keras.layers.Dense(10, 'softmax')
])

Update `momentum` hyperparameter for BN. It is used by BN when it updates the exponential moving average. Good values are close to 1; e.g, 0.9, 0.99, 0.999 (you want more 9s for larger datasets and smaller mini-batches).

`axis` hyperparameter determines which axis should be normalized. Defaults to -1, meaning that it will normalize the last axis (using the means and std computed across the *other* axes). When inpt batch is 2D ([*batch size, features*]), each input feature will be normalized based on the mean and std computed across all the instances in the batch. E.g., first BN layer in the previous code example will independetly normalize (and rescale and shift) 784 input features. If we move the first BN layer before the `Flatten` layer, the input batches will be 3D ([*batch size, height, width*]); therefore, the BN layer will compute 28 means and 28 std (1 per column of pixels, compputed across all instances in the batch and across all rows in the column). If you want to treat each of the 784 pixes independently, set `axis=[1, 2]`

### Gradient Clipping

Clip gradients during backpropagation so they never exceed some threshold; another technique to mitigate the exploding gradients. Often used in RNN, since BN is tricky to use in RNNs.

Keras implementation:

In [None]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss='mse', optimizer=optimizer)

Optimizer will clip every component of the gradient vector to a value between -1.0 and 1.0. If you want to ensure that Gradient Clipping does not change the direction of the gradient vector, clip by norm by setting `clipnorm` instead of `clipvalue`. This will clip the whole gradient if its $l_{2}$ norm is greater than the threshold you picked. If gradients explode during training, try both clipping by value and nrom, with different thresholds, see which option performs best on validation

## Reusing Pretrained Layers