## The Vanishing/Exploding Gradients Problems


### Glorot and He initialization

In [1]:
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
import graphviz
import pydot

In [2]:
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

<tensorflow.python.keras.layers.core.Dense at 0x1a0a9dd4648>

### Nonsaturating Activation Functions

ReLU. Problem: dying ReLUs.

To solve the problem:

leaky ReLU
parametric leaky ReLU (PReLU)
exponential linear unit (ELU)


### Batch Mormalization with Keras

BatchNormalization has become one of the most-used layers in deep neural net‐ works, to the point that it is often omitted in the diagrams, as it is assumed that BN is added after every layer. 

In [3]:
model = keras.models.Sequential([
        keras.layers.Flatten(input_shape=[28, 28]),
        keras.layers.BatchNormalization(),
        keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
        keras.layers.BatchNormalization(),
        keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
        keras.layers.BatchNormalization(),
        keras.layers.Dense(10, activation="softmax")
])


In [4]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense_1 (Dense)              (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_2 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_3 (Dense)              (None, 10)                1

### Gradient Clipping
This technique is most often used in recurrent neural net‐ works, as Batch Normalization is tricky to use in RNNs


In [5]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss="mse", optimizer=optimizer)

## Faster Optimizers

### Momentum Optimization

It will start out slowly, but it will quickly pick up momentum until it eventually reaches terminal velocity.

Momentum optimization cares a great deal about what previous gradients were: at each iteration, it subtracts the local gradient from the momentum vetor m, and it updates the weights by adding this momentum vector.

$$m = \beta m - \eta\nabla_\theta J (\theta) $$
$$\theta = \theta + m $$




In [6]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)

### Nesterov Accelerated Gradient

This algorithm measures the gradient of the cost function not at the local position θ but slightly ahead in the direction of the momentum, at θ + βm

$$m = \beta m - \eta\nabla_\theta J (\theta + \beta m) $$
$$\theta = \theta + m $$

NAG is generally faster than regular momentum optimization. 

In [7]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

### AdaGrad

Gradient Descent starts by quickly going down the steepest slope, which does not point straight toward the global optimum, then it very slowly goes down to the bottom of the valley. It would be nice if the algorithm could correct its direction earlier to point a bit more toward the global optimum. 

AdaGrad frequently performs well for simple quadratic problems, but it often stops too early when training neural networks. The learning rate gets scaled down so much that the algorithm ends up stopping entirely before reaching the global optimum. 

### RMSProp

As we’ve seen, AdaGrad runs the risk of slowing down a bit too fast and never converging to the global optimum. The RMSProp algorithm fixes this by accumulating only the gradients from the most recent iterations. It does so by using exponential decay in the first step.

In [8]:
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)

### Adam and Nadam Optimization

Adam, which stands for adaptive moment estimation, combines the ideas of momentum optimization and RMSProp.

In [10]:
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

### Learning Rate Scheduling

In [None]:
def exponential_decay_fn(epoch): 
    return 0.01 * 0.1**(epoch / 20)

lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)
history = model.fit(X_train_scaled, y_train, [...], callbacks=[lr_scheduler])

## Avoiding Overfitting Through Regularization

The great flexibility of deep neural networks makes the it prone to overfitting the training set. We need regularization.

### Dropout

At every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step.

In [4]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])

## Summary

### Default DNN configuration

Kernel initializer: He initialization

Activation function: ELU

Normalization: None if shallow; Batch Norm if deep

Regularization: Early stopping (+l2 reg. if needed)

Optimizer: Momentum optimization (or RMSProp or Nadam)

Learning rate schedule:  1cycle