# Training Deep Neural Networks (DNNs)

## Vanishing/Exploding Gradients Problem

**Vanishing Gradients:** Error gradient propagated from the output back to the input. Gradients in the lower layers sometimes get smaller and smaller until no updates are being made in certain layers. Becomes a problem with a lot of layers because the error gradients won't propagate through the network anymore and they _vanish_.

**Exploding Gradients:** Opposite problem when gradient error updates get so large they throw off the network entirely.

### Saturation Problem - Weight Initialization

Mainly, a problem with the _sigmoid_ activation function and normalizing weights to be normally distributed. Variance of outputs much larger than variance of the inputs. As variance increases, function outputs keep spreading out and out and out. When they spread out enough, most of the outputs are saturated at the top or bottom of the function:


![sigmoid_saturation.PNG](attachment:sigmoid_saturation.PNG)


It has an average activation value of 0.5, not zero like the tanh function, which doesn't help.

### Glorot and He Initialization Solution

- Variance of output layer activation signals should equal variance of input layer signals. 
- Variance of Gradients flowing in the reverse direction should also be equal.

![initialization_given_activation_func.PNG](attachment:initialization_given_activation_func.PNG)

Keras defaults to He initialization. Can be changed like this when declaring a layer:

> keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")


To change the scaling of the variance (in this case, using He but with fan_avg):

> he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg',
                                                   distribution='uniform')
                                                   
> keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)


### Nonsaturating Activation Functions

ReLUs often a much better choice of activation function because it does not saturate at the top end (no biological limitation for how hard a neuron can fire, like a neuron's sigmoid distribution). 

With ReLU activation, 