# Training deep neural networks

* As problems are more challenging, deeper NNs are used, which are more complex
* Potential problems include: vanishing gradients, exploding gradients, insufficient data, slow training, lots of parameters risks overfitting

Main problems:

1. Vanishing/exploding gradients
2. Reusing pretrained layers
3. Faster optimisers
4. Avoiding overfitting using regularization

### 1. Vanishing/exploding gradients

* DNNs use backpropagation, but gradients can get very small on lower layers and so training never converges
* Other side is where gradients are very large and training diverges
* Previously DNNs used logistic sigmoid activation function and normalized initial weights, which together have more variance than the inputs
* When inputs to logistic function are very large negative or positive, this leads to zero gradient and so nothing to propagate 
* Early 2000s paper argued that the variance of the outputs of an activation layer had to be similar to variance of inputs, or chaining gives extremities
* ReLU behaves better in DNNs because it does not saturate for positive values
* ReLU can lead to nodes "dying" where the gradient is zero and so there are no subsequent updates, so "leaky" ReLU is used
* LeakyReLU = max(alpha * z, z), where 0 < alpha < 1, e.g. alpha = 0.01, and so there is small negative gradient
* Another improvement was from exponential linear unit (ELU) which is similar to LeakyRLU but is smooth
* 2017 paper suggested SELU (Scaled ELU), then the network will self-normalize under certain conditions


In [None]:
# Using the LeakyReLU activation function

# leaky_relu = keras.layers.LeakyReLU(alpha=0.2)
# layer = keras.layers.Dense(10, activation=leaky_relu, kernel_initializer="he_normal")

* Batch normalization (BN) - another approach to deal with vanishing/exploding gradients
* Add an operation before the activation function, zero-centring and normalizing each input, then shifting the result using that shift and scale

### 2. Reusing pretrained layers

* Generally not a good idea to train a very large DNN from scratch
* Reusing the lower layers from a pretrained network: "transfer learning"
* You'll need to preprocess the input from the new context so it's similar to the pretraining context
* To reuse pretrained layers, they are "frozen" so the weights are non-trainable

In [None]:
# model_A = keras.models.load_model("my_model_A.h5")
# model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
# model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

# model_A_clone = keras.models.clone_model(model_A)

## ** Note that definining new instance of models, and clone_model BOTH have shared weights between A and B

# model_A_clone.set_weights(model_A.get_weights())

In [None]:
# Freezing layers - use 'trainable' attribute

# for layer in model_B_on_A.layers[:-1]:
#   layer.trainable = False

### 3. Faster optimisers

* Potential for big speed gains through different gradient descent optimizer
* Momentum optimization includes the vectors from the previous iterations, plus the change at that iteration
* A hyperparameter is introduced for friction, otherwise it might overshoot the optimal point and oscillate around it
* Nesterov accelerated gradient is similar but uses the new momentum vector as input 
* AdaGrad scales the vector along the steepest dimensions
* RMSProp gets the gradients from the most recent iterations, using exponential decay
* Adam (adaptive moment estimation) combines momentum and expontential decay
* Learning rate scheduling: possible to estimate best learning rate, then set that, and also have the rate change during training

In [None]:
# Setting up momentum in keras

# optimiser = keras.optimizers.SGD(lr=0.001, momentum=0.9)

### 4. Avoiding overfitting through regularization 

* l1 and l2 regularisation used, stopping the weights from being too large
* dropout : highly successful, at every training step, each neuron except output has a probability p of being "dropped out" - ignored during this training step
* dropout is typically set 10% - 50%
* intuition behind dropout: this stops neurons from being too reliant on particular inputs and developing more flexible functions
* dropout models will perform differently on test and train, since there are no dropouts at the testing stage
* monte carlo (MC) dropout - stacking matrices of predictions and averaging gives a MC estimate that is generally more reliable than a single prediction from a model trained with dropout