# The Vanishing/Exploding Gradients Problems
This problem occurs when information is lost during backpropagation, i.e. gradients dwindle to nothing or explode as the model diverges. This makes it very difficult to train low layers, as their weights are not being updated properly.

This can generally be solved using a better activation function or initialization (or combination thereof).

## Nonsaturating activation functions
Leaky ReLU implementation - create a leaky ReLU layer just after the layer you want to apply it to

In [1]:
import tensorflow as tf
from tensorflow import keras

model = keras.models.Sequential([
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(alpha=0.2)
])

SELU activation implementation

In [2]:
layer = keras.layers.Dense(10, activation="selu",
                           kernel_initializer="lecun_normal")

## Batch Normalization
Adds an operation in teh model just before or after the activation function of each hidden layer: zero-center and normalize each input, then scale and shift the result using two new parameter vectors (one for scaling and one for shifting). Lets the model learn the optimal scale and mean of each layer's inputs.

Batch normization has become ubiquitous.

### Implementing batch normalization with keras
This model adds a batch normalization layer before each hidden layer. That's it!

In [5]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

In [6]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 784)               3136      
_________________________________________________________________
dense_5 (Dense)              (None, 300)               235500    
_________________________________________________________________
batch_normalization_4 (Batch (None, 300)               1200      
_________________________________________________________________
dense_6 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_5 (Batch (None, 100)               400       
_________________________________________________________________
dense_7 (Dense)              (None, 10)               

## Gradient Clipping
Mitigate exploding gradients by limiting gradients during backpropagation so that they don't exceed a threshold. Most often used in recurrent NNs, since batch normalization is tricky with RNNs.

In [7]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss="mse", optimizer=optimizer)

can also use ```clipnorm``` instead of ```clipvalue``` to ensure that clipping does not change the direction of the gradient vector.

# Reusing Pretrained Layers