# **Training Deep Neural Networks**

In [None]:
import tensorflow as tf
from tensorflow import keras

## The Vanishing/Exploding Gradients Problems

Backpropagation works by going from the output layer to the input layer, propagating the error gradient along the way. After computing the gradient of the cost function with regard to each parameter in the network, it uses the gradients to update each parameter with a Gradient Descent step.

Gradients often get smaller as the algorithm progresses down to lower layers. As a result, the Gradient Descent update leabes the lower layers' connection weights virtually unchanged, and training never converges to a good solution. This is the *vanishing gradients* problem. 

In some cases, the opposite can happen: the gradients can grow bigger until layers get insanely large weight updates and the algorithm diverges. This is the *exploding gradients* problem (recurrent neural networks).

Looking at logistic activation function, when inputs become large (negative or positive), the function saturates at 0 or 1, with a derivative extremely close to 0. Thus, when backpropagation kicks in it has virtually no gradient to propagate back through the network; and what little gradient exists keeps getting diluted as backpropagation progresses down through the top layers, so there is nothing left for the lower layers.

### Glorot and He Initialization

Glorot and Bengio proposed that the signal needs to flow properly in both directions: in the foward direction when making predictions, and the reverse direction when backpropagating gradients. For the signal to flow properly, they argue that we need the variance of the outputs of each layer to be equal to the variance of its inputs, and we need the gradients to have equal variance before and after flowing througha layer in the reverse direction. It's not possible to gurantee both, so they proposed a compromise. Connection weights of each layer must be initialized randomly using:

$fan_{avg} = (fan_{in} + fan_{out}) / 2$

|Initialization|Activation Functions|$\sigma^{2}$ (Normal)|
|--------------|:-------------------|:-------------------|
|Glorot |None, tanh, logistic, softmax| $1/fan_{avg}$|
|He|ReLU and variants| $2/fan_{in}$|
|LeCunn|SELU|$1/fan_{in}$|




Replacing $fan_{avg}$ with $fan_{in}$ yields LeCunn initialization. LeCunn and Glorot initialization are equivalent when $fan_{in}$ = $fan_{out}$

By default, Keras uses Glorot with a uniform distribution. Can be changed to He initialization:

In [None]:
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

For He initialization with uniform dist but based on $fan_{avg}$ rather than $fan_{in}$, use `VarianceScaling`:

In [None]:
he_avg_init = keras.initializer.VarianceScaling(scale=2., mode='fan_avg', distribution='uniform')
keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)

### Nonsaturating Activation Functions

Other activation functions (besides sigmoid) work better on Deep Networks. Especially ReLU.

ReLU isn't perfect though. It suffers from *dying ReLUs*: during training, some neurons "die" meaning they stop outputting anything other than 0. A neuron dies when its weights are tweaked in a way that the weighted sum of its inputs are negative for all instances in the training set. When this happens, it keeps outputting zeros, and Gradient Descent does not affect it anymore beacuse the gradient of ReLU is zero when its input is negative. 

**LeakyReLU**

To solve this, use *Leaky ReLU*. Ensures that neurons never die. Defined as:<br>
$LeakyReLU_{\alpha}(z) = max(\alpha z, z)$

$\alpha$ defines how much the function "leaks": it's the slope of the function for z<0 and is typically set to 0.01. 

PReLU also outperforms ReLU in many cases. $\alpha$ is authorized to be learned during training (instead of being a hyperparameter, it becomes a parameter that can be modified by backpropagation like any other parameter)

**ELU**

Outperforms all ReLU variants

![elu_formula](elu_form.png)

![elu](ELU.png)

Main drawback is that it is slower to compute than ReLU (due to the use of exponential function)

**SELU**

scaled variant of ELU. Authors showed that if you build a neural network composed exclusively of a stack of dense layers, and if all hidden layers use the SELU activation function, then the network will *self-normalize*: the output of each layer will tend to preserve a mean of 0 and standard deviation of 1 during training, which solves the vanishing/exploding gradients problem. As a result, it outperforms other activations. However, some conditions must be met for self-normalization to happen:
- Input features must be standardized ($\mu$=0, $\sigma$=1)
- Every hidden layer's weights must be initialized with LeCunn normal initialization. In Keras, `kernel_initializer="lecun_normal"`
- Network architecture must be sequential
- All layers are dense

To use leaky ReLU activation function:

In [None]:
model = keras.models.Sequential([
    [...]
    keras.layer.Dense(10, kernel_initializer='he_normal'),
    keras.layers.LeakyReLU(alpha=0.2),
    [...]
])

For PReLU, replace `LeakyReLU(alpha=0.2)` with `PReLU()`

For SELU:

In [None]:
layer = keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")

### Batch Normalization

Significantly reduces possibility of vanishing/exploding gradients. Consists of adding an operation in the model just before or after the activation function of each hidden layer. This operation zero-centers and normalizes each input, then scales and shifts the result using two new parameter vectors per layer: one for scaling, the other for shifting. I.e., the operation lets the model learn the optimal scale and mean of each of the layer's inputs. 

Led to huge improvement in the ImageNet classification task (large database of images classified into many classes, commonly used to evaluate computer vision systems). Vanishing gradients problem was strongly reduced, to the point they could use saturating activation functions such as tanh and logistic activation function. Networks were also much less sensitive to weight initialization. They were able to use larger learning rates, significantly speeding up the learning process. It also acts like a regularizer.

**Implementing Batch Normalization with Keras**

Add `BatchNormalization` layer before or after each hidden layer's activation function; optionally add BN layer as the first layer in model

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation='softmax')
])

In [None]:
>>> model.summary()

In [None]:
>>> [(var.name, var.trainable) for var in model.layers[1].variables]

To add BN layers before the activation functions, remove activation function from the hidden layers and add them as separate layers after the BN layers. Moreover, since BN layers include one offset parameter per input, you can remove the bias term from previous layer:

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer='he_normal', use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation('elu'),
    keras.layers.Dense(100, kernel_initializer='he_normal', use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation('elu'),
    keras.layers.Dense(10, 'softmax')
])

Update `momentum` hyperparameter for BN. It is used by BN when it updates the exponential moving average. Good values are close to 1; e.g, 0.9, 0.99, 0.999 (you want more 9s for larger datasets and smaller mini-batches).

`axis` hyperparameter determines which axis should be normalized. Defaults to -1, meaning that it will normalize the last axis (using the means and std computed across the *other* axes). When inpt batch is 2D ([*batch size, features*]), each input feature will be normalized based on the mean and std computed across all the instances in the batch. E.g., first BN layer in the previous code example will independetly normalize (and rescale and shift) 784 input features. If we move the first BN layer before the `Flatten` layer, the input batches will be 3D ([*batch size, height, width*]); therefore, the BN layer will compute 28 means and 28 std (1 per column of pixels, compputed across all instances in the batch and across all rows in the column). If you want to treat each of the 784 pixes independently, set `axis=[1, 2]`

### Gradient Clipping

Clip gradients during backpropagation so they never exceed some threshold; another technique to mitigate the exploding gradients. Often used in RNN, since BN is tricky to use in RNNs.

Keras implementation:

In [None]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss='mse', optimizer=optimizer)

Optimizer will clip every component of the gradient vector to a value between -1.0 and 1.0. If you want to ensure that Gradient Clipping does not change the direction of the gradient vector, clip by norm by setting `clipnorm` instead of `clipvalue`. This will clip the whole gradient if its $l_{2}$ norm is greater than the threshold you picked. If gradients explode during training, try both clipping by value and nrom, with different thresholds, see which option performs best on validation

## Reusing Pretrained Layers

*Transfer Learning*: resuing lower layers of a prexisting (similar) network to accomplish a task

The more similar the tasks are the more layers you want to reuse. 

### Transfer Learning with Keras

Suppose the Fashion MNIST dataset contained only 8 classes (all the classes except shirt and sandal). Someone built and trained a Keras model on that set and got good performance (>90% accuracy). Call this model A. You want to tackle a different task: you have images of sandals and shirts, and you want to train a binary classifier (0=shirt, 1=sandal). Your dataset is quite small (200 labeled images). When you train a new model for this task (model B) with the same architecture as model A, it performs well (97% accuracy). Since it's a similar task to model A, transfer learning might help.

In [None]:
model_A = keras.models.load_model('my_model_A.h5')
model_B_on_A = keras.models.Sequential(model_A.layers[:-1]) # reuse all the layers except output layer
model_B_on_A.add(keras.layers.Dense(1, activation='sigmoid'))

Note new model shares layers with model A, so training will affect the layers in model A as well. You must clone model A to use its layers. Clone model A's architecture with `clone_model()` then copy its weights (`clone_model()` does not clone weights).

In [None]:
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

Now you can train `model_B_on_A` for task B. But output layer was initialized randomly, it will make large errors, so there will be large error gradients that may wreck the reused weights. To avoid, one approach is to freeze the reused layers during the first few epochs, giving the new layer some time to learn reasonable weights. Set every layer's `trainable` attribute to `False` and compile the model:

In [None]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

model_B_on_A.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])

Now you can train the model for the first few epochs, then unfreeze the reused layers (requires compiling the model again) and continue training to fine-tune the reused layers for task B. After unfreezing layers, reduce the learning rate, once again to avoid damaging the reused weights:

In [None]:
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4, validation_data=[X_valid_B, y_valid_B])

for layer in model_B_on_A[:-1]:
    layer.trainable = True

optimzer = keras.optimizers.SGD(lr=1e-4) # default lr is 1e-2
model_B_on_A.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])

history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16, validation_data=(X_valid_B, y_valid_B))

Transfer learning works best with deep CNN, whcih tend to learn feature detectors that are much more general

### Unsupervised Pretraining

If you can gather unlabeled training data, you can use it to train an unsupervised model, such as an autoencoder or generative adversarial network. Useful when you have a complex task to solve, no similar model you can reuse, and little labeled training data but plenty of unlabeled training data.

### Pretraining on an Auxiliary Task

If you do not have much labeled training data, one option is to train a first neural network on an auxiliary task for which you can easily obtain or generate labeled training data, then reuse the lower layers of that network for you actual task. The first neural network's lower layers will learn feature detectors that will likely be reusable by the second neural network.

## Faster Optimizers

- Momentum optimization
- Nesterov Accelerated Gradient
- AdaGrad
- RMSProp
- Adam and Nadam optimization

### Momentum Optimization

Where Gradient Descent is like taking small steps to find out which direction to go toward, momentum optimization is like rolling a bowling ball down a gentle slope. It'll pick up momentum until it finds a terminal velocity (if there is some friction).

Cares about what the previous gradients were: at each iteration it subtracts the local gradient from the *momentum vector* **m** and it updates the weights by adding this momentum vector. Gradient is used for acceleration not speed. Algorithm introduces hyperparameter $\beta$ for friction set between 0 (high friction) and 1 (no friction). Typical value is 0.9.

In [None]:
optimizer = keras.optimizer.SGD(lr=0.001, momentum=0.9)

Drawback is that it adds another hyperparameter to tune. But 0.9 is usually fine.

### Nesterov Accelerated Gradient

Variant of momentum optimization. Measures a gradient of the cost function not at the local position, but slightly ahead in the direction of the momentum. Generally faster than momentum optimization.

In [None]:
optimizer = keras.optimizer.SGD(lr=0.001, momentum=0.9, nesterov=True)

### AdaGrad

Scales down the gradient vector along the steepest dimensions

The algorithm decays the learning rate, but does so faster for steep dimensions than for dimensions with gentler slopes. This is called an *adaptive learning rate*. It helps the resulting updates more directly toward the global optimum.

Performs well for simple quadratic problems, but often stops too early when training neural networks. 

### RMSProp

Fixes AdaGrad from stopping early by accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training). Decay rate $\beta$ is usually 0.9. Again, another hyperparameter to tune but the default value usually works well.

In [None]:
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)

Note `rho` corresponds to $\beta$. Except on very simple problems, almost always works better than AdaGrad.

### Adam and Nadam Optimization

*Adaptive moment estimation* combines momentum optimization and RMSProp: like momentum optimization, it keeps track of an exponentially decaying average of past gradients; and like RMSProp, it keeps track of an exponentially decaying average of past squared gradients.

In [None]:
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

*AdaMax*: Adam scales down the parameter updates by the $l_{2}$ norm of the time-decayed gradients (recall that the $l_{2}$ norm is the square root of the sum of squares). AdaMax replaces the $l_{2}$ norm with the  $l_{âˆž}$ norm (fancy way of saying the max).

*Nadam*: Adam optimization plus the Nesterov trick, so it will often converge slightly faster than Adam.

## Learning Rate Scheduling

- Power scheduling
- Exponential scheduling
- Piecewise constant scheduling
- Performance scheduling
- 1cycle scheduling

Power scheduling. `decay` is inverse of *s* (number of steps it takes to divide the learning rate by one more unit), Keras assumes *c* is equal to 1:

In [None]:
optimizer = keras.optimizers.SGD(lr=0.01, decay=1e-4)

Exponential and piecewise constant scheduling require defining a function that takes the current epoch and returns the learning rate.

In [None]:
def exponential_decay_fn(epoch):
    return 0.01 * 0.1**(epoch / 20)

to avoid hardcoding $\eta_{0}$ and *s*, create a function that returns a configured function

In [None]:
def exponential_decay(lr0, s):
    def exponential_decay_fn(epoch):
        return lr0 * 0.01**(epoch / s)
    return exponential_decay_fn

exponential_decay_fn = exponential_decay(lr0=0.01, s=20)

Create a `LearningRateScheduler` callback

In [None]:
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)
history = model.fit(X_train_scaled, y_train, [...], callbacks=[lr_scheduler])

Schedule function can optionally take the current learning rate as a second argument. For example, the following schedule function multiplies the previous learning rate by $0.1^{1/20}$ which results in the same exponential decay (except the decay now starts at the beginning of the epoch 0 instead of 1):

In [None]:
def exponential_decay_fn(epoch, lr):
    return lr * 0.1**(1/20)

Relies on the optimizer's initial learning rate. Make sure to set it appropriately

Optimizer and learning rate get saved when saving a model. `epoch` argument of schedule function does not. It gets reset to 0 every time you call `fit()`. Manually set the `fit()` method's `initial_epoch` argument so the `epoch` starts at the right value.

For piecewise constant scheduling

In [None]:
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001

For performance scheduling, use `ReduceLROnPlateau` callback. E.g., if you pass the following callback to the `fit()` method, it will multiply the learning rate by 0.5 whenever the best validation loss does not improve for five consecutive epochs:

In [None]:
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

tf.keras offers an alternative to implement learning rate scheduling: define the learning rate using one of the schedules available in `keras.optimizers.schedules`, then pass this learning rate to any optimizer

In [None]:
s = 20 * len(X_train) // 32 # number of steps in 20 epochs (batch size = 32)
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
optimizer = keras.optimizers.SGD(learning_rate)

Summary: exponential decay, peformance, and 1cycle schedling can considerably speed up convergence