# Problems:

- vanishing gradients (or the related exploding gradients problem) that affects deep neural networks and makes lower layers very hard to train.
- Second, you might not have enough training data for such a large network, or it might be too costly to label.
- Third, training may be extremely slow.
- Fourth, a model with millions of parameters would severely risk overfitting the training set

## Vanishing/Exploding Gradients Problems
- <u>Vanishing</u>: gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layer connection weights virtually unchanged, and training never converges to a good solution.
- <u>Exploding</u>: the gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges

### Glorot and He Initialization
We need the variance of the outputs of each layer to be equal to the variance of its inputs,2 and we also need the gradients to have equal variance before and after flowing through a layer in the reverse direction.

Not possible to guarantee both unless the layer has an equal number of inputs and neurons (_fan-in_ and _fan-out_ of the layer)

So the connection weights of each layer must be initialized randomly as described in the next equation, where fan_avg = fan_in + fan_out /2.

<img src="Xavier.jpg" alt="Xiavier/Glorot initialization" />

Using Glorot initialization can speed up training considerably, and it is one of the tricks that led to the current success of Deep Learning.

<img src="Table_initialization.jpg" alt="Table of initializations" />

By default, Keras uses Glorot initialization with a uniform distribution. You can change this to He initialization by setting ```python kernel_initializer="he_uniform"``` or ```python kernel_initializer="he_normal"```:

```python
    keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")
```

If you want He initialization with a uniform distribution, but based on fanavg rather than fanin, you can use the VarianceScaling initializer like this:

```python
    he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', 
                                                     distribution='uniform')
    keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)
```

## Nonsaturating Activation Functions
dying ReLUs: during training, some neurons effectively die, meaning they stop outputting anything other than 0, especially if you used a large learning rate.

A variant of the ReLU: the leaky ReLU: defined as LeakyReLUα(z) = max(αz, z).

The hyperparameter α defines how much the function “leaks”: it is the slope of the function for z < 0, and is typically set to 0.01. setting α = 0.2 (huge leak) seemed to result in better performance than α = 0.01.

Exponential linear unit (ELU) outperformed all the ReLU variants in their experiments: training time was reduced and the neural network performed better on the test set.
<img src="ELU-1.jpg" alt="Exponential Linear Unit" />
<img src="ELU-plot.jpg" alt="ELU plot" />

Major differences with ReLU:
- 