Problems that occur when training compex DNN

- Vanishing gradients problem or the related exploding gradients problem. This is when the gradients grow smaller and smaller, or larger and larger, when flowing backward through the DNN during training. Both of these problems make lower layers very hard to train.

- You might not have enough training data for such a large network, or it might be too costly to label.

- Training may be extremely slow.

- A model with millions of parameters would severely risk overfitting the training set, especially if there are not enough training instances or if they are too noisy.

# The Vanishing/Exploding Gradients Problems

- **Vanishing gradients** problem: Gradients often get smaller and smaller as the algorithm progresses down to the lower layers, an issue for backpropagation. As a result, the Gradient Descent update leaves the lower layers’ connection weights virtually unchanged, and training never converges to a good solution.

- **Exploding gradients** problem: the gradients can grow bigger and bigger until layers get insanely large weight updates and the algorithm diverges.  Resurfaces in recurrent neural networks.

A huge reason that theese problems existed was due to initilization of weights and unequal variances at each step. These caused activation functions to saturate at locations with derivatives near zero. Giving us little information for the gradient step and resulting in no good solution.

Many smart people worked on this and found various initilization on variance which would lead to convergence.


Table 11-1. Initialization parameters for each type of activation function


|Initialization	|Activation functions	|σ² (Normal) | 
|---------------|-----------------------|------------|
|Glorot |None, tanh, logistic, softmax|  1 / fanavg  |
| He    | ReLU and variants           | 2 / fanin    |
|LeCun  | SELU                        |  1 / fanin   |


By default, Keras uses Glorot initialization with a uniform distribution. When creating a layer, you can change this to He initialization by setting kernel_initializer="he_uniform" or kernel_initializer="he_normal" like this



In [1]:
import tensorflow as tf
from tensorflow import keras

In [2]:
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

<tensorflow.python.keras.layers.core.Dense at 0x221e57671f0>

In [3]:
# If you want He initialization with a uniform distribution but based on fanavg rather than fanin, 
# you can use the VarianceScaling initializer like this:
he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg',
                                                 distribution='uniform')
keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)

<tensorflow.python.keras.layers.core.Dense at 0x221e57fe040>

## Nonsaturating Activation Functions

- ReLU is better than mother natures sigmoid function as it does not saturate for positive values. 

- ReLUs suffer with large learning rates as it might change the inputs to negative values. ReLU is max(o, input). Thus the neuron becomes dead when this occurs. 

To solve this there is a modified ReLU

- **Leaky ReLU**: LeakyReLUα(z) = max(αz, z). The hyperparameter α defines how much the function “leaks”: it is the slope of the function for z < 0 and is typically set to 0.01. 

There are variation of the Leaky ReLU that have been found to outperform it. 

- **Randomized leaky ReLU**: outperforms the ReLu and acts like a regularizer, reducing the risk of overfitting the training set.

- Chosing a huge leak, alpha = .2 did better than a small leak, .01

- **Parametric leaky ReLU (PReLU)**, where α is authorized to be learned during training. Does well with large image datasets. Risks overfitting on smaller datasets. 

- **Exponential linear unit (ELU)** that outperformed all the ReLU variants in publishing. Training is reduced, and the neural network performed better on the test set.

Equation 11-2. ELU activation function

$$\mathrm{ELU}_{\alpha}(z)= \begin{cases}\alpha(\exp (z)-1) & \text { if } z<0 \\ z & \text { if } z \geq 0\end{cases}$$

The ELU activation function looks a lot like the ReLU function, with a few major differences:

- It takes on negative values when z < 0, which allows the unit to have an average output closer to 0 and helps alleviate the vanishing gradients problem. The hyperparameter α defines the value that the ELU function approaches when z is a large negative number. It is usually set to 1, but you can tweak it like any other hyperparameter.

- It has a nonzero gradient for z < 0, which avoids the dead neurons problem.

- If α is equal to 1 then the function is smooth everywhere, including around z = 0, which helps speed up Gradient Descent since it does not bounce as much to the left and right of z = 0.

The main drawback of the ELU activation function is that it is slower to compute than the ReLU function and its variants. Its faster convergence rate during training compensates for that slow computation, but still, at test time an ELU network will be slower than a ReLU network.


- **Scaled ELU (SELU) activation function**: as its name suggests, it is a scaled variant of the ELU activation function.

 - If you build a neural network composed exclusively of a stack of dense layers, and if all hidden layers use the SELU activation function, then the network will self-normalize: the output of each layer will tend to preserve a mean of 0 and standard deviation of 1 during training
 - which solves the vanishing/exploding gradients problem
- SELU activation function often significantly outperforms other activation functions for such neural nets. But there are a few requirements.

 - The input features must be standardized (mean 0 and standard deviation 1).

 - Every hidden layer’s weights must be initialized with LeCun normal initialization. In Keras, this means setting kernel_initializer="lecun_normal".

 - The network’s architecture must be sequential. Unfortunately, if you try to use SELU in nonsequential architectures, such as recurrent networks (see Chapter 15) or networks with skip connections (i.e., connections that skip layers, such as in Wide & Deep nets), self-normalization will not be guaranteed, so SELU will not necessarily outperform other activation functions.

 - The paper only guarantees self-normalization if all layers are dense, but some researchers have noted that the SELU activation function can improve performance in convolutional neural nets as well
 
 
 ____
- **TIPS**: So, which activation function should you use for the hidden layers of your deep neural networks? Although your mileage will vary, 

 - in general SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic. 
 
 - If the network’s architecture prevents it from self-normalizing, then ELU may perform better than SELU (since SELU is not smooth at z = 0). 
 
 - If you care a lot about runtime latency, then you may prefer leaky ReLU. If you don’t want to tweak yet another hyperparameter, you may use the default α values used by Keras (e.g., 0.3 for leaky ReLU). 
 
 - If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, such as RReLU if your network is overfitting or PReLU if you have a huge training set. That said, because ReLU is the most used activation function (by far), many libraries and hardware accelerators provide ReLU-specific optimizations; therefore, if speed is your priority, ReLU might still be the best choice.
 ____
 
 To use the leaky ReLU activation function, create a LeakyReLU layer and add it to your model just after the layer you want to apply it to:

In [4]:
model = keras.models.Sequential([
    #Stuff 
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(alpha=0.2),
    #More stuff
    ])


For PReLU, replace LeakyReLU(alpha=0.2) with PReLU(). There is currently no official implementation of RReLU in Keras, but you can fairly easily implement your own.

For SELU activation, set activation="selu" and kernel_initializer="lecun_normal" when creating a layer:


In [5]:
layer = keras.layers.Dense(10, activation="selu",
                           kernel_initializer="lecun_normal")

## Batch Normalization

Although using He initialization along with ELU (or any variant of ReLU) can significantly reduce the danger of the vanishing/exploding gradients problems at the beginning of training, it doesn’t guarantee that they won’t come back during training.

**Batch Normalization (BN)**: addresses vanishing/exploding gradients problems.
- The technique consists of adding an operation in the model just before or after the activation function of each hidden layer
- This operation simply zero-centers and normalizes each input, then scales and shifts the result using two new parameter vectors per layer:
 - one for scaling, the other for shifting.
 - In other words, the operation lets the model learn the optimal scale and mean of each of the layer’s inputs. In many cases, if you add a BN layer as the very first layer of your neural network, you do not need to standardize your training set (e.g., using a StandardScaler); the BN layer will do it for you (well, approximately, since it only looks at one batch at a time, and it can also rescale and shift each input feature).
 
- In order to zero-center and normalize the inputs, the algorithm needs to estimate each input’s mean and standard deviation. It does so by evaluating the mean and standard deviation of the input over the current mini-batch (hence the name “Batch Normalization”). 


The whole operation is summarized step by step in Equation 11-3.

Equation 11-3. Batch Normalization algorithm


1. $\quad \boldsymbol{\mu}_{B}=\frac{1}{m_{B}} \sum_{i=1}^{m_{B}} \mathbf{x}^{(i)}$
2. $\quad \boldsymbol{\sigma}_{B}^{2}=\frac{1}{m_{B}} \sum_{i=1}^{m_{B}}\left(\mathbf{x}^{(i)}-\boldsymbol{\mu}_{B}\right)^{2}$
3. $\quad \widehat{\mathbf{x}}^{(i)}=\frac{\mathbf{x}^{(i)}-\boldsymbol{\mu}_{B}}{\sqrt{\boldsymbol{\sigma}_{B}^{2}+\varepsilon}}$
4. $\quad \mathbf{z}^{(i)}=\boldsymbol{\gamma} \otimes \widehat{\mathbf{x}}^{(i)}+\boldsymbol{\beta}$


In this algorithm:

- $μ_B$ is the vector of input means, evaluated over the whole mini-batch B (it contains one mean per input).

- $σ_B$ is the vector of input standard deviations, also evaluated over the whole mini-batch (it contains one standard deviation per input).

- $m_B$ is the number of instances in the mini-batch.

- $\hat{x}^{(i)}$ is the vector of zero-centered and normalized inputs for instance i.

- $γ$ is the output scale parameter vector for the layer (it contains one scale parameter per input).

- $⊗$ represents element-wise multiplication (each input is multiplied by its corresponding output scale parameter).

- $β$ is the output shift (offset) parameter vector for the layer (it contains one offset parameter per input). Each input is offset by its corresponding shift parameter.

- $ε$ is a tiny number that avoids division by zero (typically 10–5). This is called a smoothing term.

- $z^{(i)}$ is the output of the BN operation. It is a rescaled and shifted version of the inputs.

We may need to make predictions for individual instances rather than for batches of instances: in this case, we will have no way to compute each input’s mean and standard deviation.

Even if we do have a batch of instances, it may be too small, or the instances may not be independent and identically distributed, so computing statistics over the batch instances would be unreliable.

To sum up, four parameter vectors are learned in each batch-normalized layer: γ (the output scale vector) and β (the output offset vector) are learned through regular backpropagation, and μ (the final input mean vector) and σ (the final input standard deviation vector) are estimated using an exponential moving average. Note that μ and σ are estimated during training, but they are used only after training (to replace the batch input means and standard deviations

The creators of this method concluded Batch Normalization considerably improved all the deep neural networks they experimented with. 

- **ImageNet** is a large database of images classified into many classes, commonly used to evaluate computer vision systems.

The vanishing gradients problem was strongly reduced, to the point that they could use saturating activation functions such as the tanh and even the logistic activation function. The networks were also much less sensitive to the weight initialization. The authors were able to use much larger learning rates, significantly speeding up the learning process.

- Batch Normalization acts like a regularizer, reducing the need for other regularization techniques (such as dropout, described later in this chapter).

- there is a runtime penalty: the neural network makes slower predictions due to the extra computations required at each layer.

Fortunately, it’s often possible to fuse the BN layer with the previous layer, after training, thereby avoiding the runtime penalty
 - This is done by updating the previous layer’s weights and biases so that it directly produces outputs of the appropriate scale and offset.
 
- **NOTE**: You may find that training is rather slow, because each epoch takes much more time when you use Batch Normalization. This is usually counterbalanced by the fact that convergence is much faster with BN, so it will take fewer epochs to reach the same performance. All in all, wall time will usually be shorter (this is the time measured by the clock on your wall).

## IMPLEMENTING BATCH NORMALIZATION WITH KERAS

As with most things with Keras, implementing Batch Normalization is simple and intuitive. Just add a BatchNormalization layer before or after each hidden layer’s activation function, and optionally add a BN layer as well as the first layer in your model. For example, this model applies BN after every hidden layer and as the first layer in the model (after flattening the input images):



In [6]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])
# That’s all! In this tiny example with just two hidden layers, it’s unlikely that Batch Normalization will have a very 
# positive impact; but for deeper networks it can make a tremendous difference.

In [7]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense_4 (Dense)              (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_5 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_6 (Dense)              (None, 10)               

The last two parameters, $μ$ and $σ$, are the moving averages; they are not affected by backpropagation, so Keras calls them “non-trainable”. 

If you count the total number of BN parameters, 3,136 + 1,200 + 400, and divide by 2, you get 2,368, which is the total number of non-trainable parameters in this model

Let’s look at the parameters of the first BN layer. Two are trainable (by backpropagation), and two are not:


In [8]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

In [9]:
# Now when you create a BN layer in Keras, it also creates two operations that will be called by Keras
# at each iteration during training. These operations will update the moving averages
model.layers[1].updates



[]

The authors of the BN paper argued in favor of adding the BN layers before the activation functions, rather than after (as we just did). There is some debate about this, as which is preferable seems to depend on the task—you can experiment with this too to see which option works best on your dataset. 

To add the BN layers before the activation functions, you must remove the activation function from the hidden layers and add them as separate layers after the BN layers. Moreover, since a Batch Normalization layer includes one offset parameter per input, you can remove the bias term from the previous layer (just pass use_bias=False when creating it):

In [10]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(10, activation="softmax")
])

The BatchNormalization class has quite a few hyperparameters you can tweak.

- You may occasionally need to tweak the **momentum**. Used to update the exponential moving averages. 

A good momentum value is typically close to 1; for example, 0.9, 0.99, or 0.999 (you want more 9s for larger datasets and smaller mini-batches).

- Another important hyperparameter is **axis**,  it determines which axis should be normalized. 
 It defaults to –1, meaning that by default it will normalize the last axis , using the means and standard deviations computed across the other axes
 
For example, the first BN layer in the previous code example will independently normalize (and rescale and shift) each of the 784 input features. If we move the first BN layer before the Flatten layer, then the input batches will be 3D, with shape [batch size, height, width]; therefore, the BN layer will compute 28 means and 28 standard deviations (1 per column of pixels, computed across all instances in the batch and across all rows in the column), and it will normalize all pixels in a given column using the same mean and standard deviation. There will also be just 28 scale parameters and 28 shift parameters. If instead you still want to treat each of the 784 pixels independently, then you should set axis=[1, 2].
 
 
Notice that the BN layer does not perform the same computation during training and after training: it uses batch statistics during training and the “final” statistics after training 
 
 
BatchNormalization has become one of the most-used layers in deep neural networks, to the point that it is often omitted in the diagrams, as it is assumed that BN is added after every layer. 

# Gradient Clipping

-  **Gradient Clipping**: clip the gradients during backpropagation so that they never exceed some threshold.

This technique is most often used in recurrent neural networks, as Batch Normalization is tricky to use in RNNs. For other types of networks, BN is usually sufficient. In Keras, implementing Gradient Clipping is just a matter of setting the clipvalue or clipnorm argument when creating an optimizer, like this:

In [11]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)
# This means that all the partial derivatives of the loss
#will be clipped between –1.0 and 1.0
model.compile(loss="mse", optimizer=optimizer)

If you want to ensure that Gradient Clipping does not change the direction of the gradient vector, you should clip by norm by setting clipnorm instead of clipvalue. This will clip the whole gradient if its ℓ2 norm is greater than the threshold you picked.

For example, if you set clipnorm=1.0, then the vector [0.9, 100.0] will be clipped to [0.00899964, 0.9999595], preserving its orientation but almost eliminating the first component. 

If you observe that the gradients explode during training (you can track the size of the gradients using TensorBoard), you may want to try both clipping by value and clipping by norm, with different thresholds, and see which option performs best on the validation set.

# Reusing Pretrained Layers

It is generally not a good idea to train a very large DNN from scratch: instead, you should always try to find an existing neural network that accomplishes a similar task to the one you are trying to tackle then reuse the lower layers of this network. This technique is called **transfer learning**. It will not only speed up training considerably, but also require significantly less training data.

- **NOTE**: If the input pictures of your new task don’t have the same size as the ones used in the original task, you will usually have to add a preprocessing step to resize them to the size expected by the original model. More generally, transfer learning will work best when the inputs have similar low-level features.

- The output layer of the original model should usually be replaced because it is most likely not useful at all for the new task, and it may not even have the right number of outputs for the new task.

- The upper hidden layers of the original model are less likely to be as useful as the lower layers, since the high-level features that are most useful for the new task may differ significantly from the ones that were most useful for the original task. 

You want to find the right number of layers to reuse.

- **TIP**: The more similar the tasks are, the more layers you want to reuse (starting with the lower layers). For very similar tasks, try keeping all the hidden layers and just replacing the output layer.

- Try freezing all the reused layers first (i.e., make their weights non-trainable so that Gradient Descent won’t modify them), then train your model and see how it performs. 

- Then try unfreezing one or two of the top hidden layers to let backpropagation tweak them and see if performance improves. The more training data you have, the more layers you can unfreeze. It is also useful to reduce the learning rate when you unfreeze reused layers: this will avoid wrecking their fine-tuned weights.

- If you still cannot get good performance, and you have little training data, try dropping the top hidden layer(s) and freezing all the remaining hidden layers again. You can iterate until you find the right number of layers to reuse.

- If you have plenty of training data, you may try replacing the top hidden layers instead of dropping them, and even adding more hidden layers.

In [12]:
model_A = keras.models.load_model("my_model_A.h5")
model_B_on_A = keras.models.Sequential(model_A.layers[:-1]) # Gets all but the output layer
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

OSError: SavedModel file does not exist at: my_model_A.h5\{saved_model.pbtxt|saved_model.pb}

When you train model_B_on_A, it will also affect model_A. If you want to avoid that, you need to clone model_A before you reuse its layers.

To do this, you clone model A’s architecture with clone_model(), then copy its weights (since clone_model() does not clone the weights):



In [13]:
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

NameError: name 'model_A' is not defined

Now you could train model_B_on_A for task B, but since the new output layer was initialized randomly it will make large errors (at least during the first few epochs), so there will be large error gradients that may wreck the reused weights. To avoid this, one approach is to freeze the reused layers during the first few epochs, giving the new layer some time to learn reasonable weights. To do this, set every layer’s trainable attribute to False and compile the model:



In [14]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

model_B_on_A.compile(loss="binary_crossentropy", optimizer="sgd",
                     metrics=["accuracy"])

NameError: name 'model_B_on_A' is not defined

- **NOTE:** You must always compile your model after you freeze or unfreeze layers.

Now you can train the model for a few epochs, then unfreeze the reused layers (which requires compiling the model again) and continue training to fine-tune the reused layers for task B. After unfreezing the reused layers, it is usually a good idea to reduce the learning rate, once again to avoid damaging the reused weights:

In [15]:
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,
                           validation_data=(X_valid_B, y_valid_B))

for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

optimizer = keras.optimizers.SGD(lr=1e-4) # the default lr is 1e-2
model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer,
                     metrics=["accuracy"])
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
                           validation_data=(X_valid_B, y_valid_B))

NameError: name 'model_B_on_A' is not defined

In the book the above code would have increase the model performance. Trying to reproduce it would bare little. The author tells us he  did a method called “torturing the data until it confesses.” He went thru many seeds and made many configurations. A lot of science papers do this. Most of the time, this is not malicious at all, but it is part of the reason so many results in science can never be reproduced.

He cheated because transfer learning does not work very well with small dense networks, presumably because small networks learn few patterns, and dense networks learn very specific patterns, which are unlikely to be useful in other tasks.

- Transfer learning works best with deep convolutional neural networks, which tend to learn feature detectors that are much more general (especially in the lower layers).

## Unsupervised Pretraining

- If you can gather plenty of unlabeled training data, you can try to use it to train an unsupervised model, such as an autoencoder or a generative adversarial network. Then you can reuse the lower layers of the autoencoder or the lower layers of the GAN’s discriminator, add the output layer for your task on top, and fine-tune the final network using supervised learning (i.e., with the labeled training examples).

- only after the vanishing gradients problem was alleviated did it become much more common to train DNNs purely using supervised learning.

- ** Unsupervised pretraining**  is still a good option when you have a complex task to solve, no similar model you can reuse, and little labeled training data but plenty of unlabeled training data.

In the early days of Deep Learning it was difficult to train deep models, so people would use a technique called **greedy layer-wise pretraining**

They would first train an unsupervised model with a single layer, typically an RBM(Restricted Boltzman machine), then they would freeze that layer and add another one on top of it, then train the model again (effectively just training the new layer), then freeze the new layer and add another layer on top of it, train the model again, and so on.

Now people can train in one shot using GANs or autoencoders. Image of training below 

![](https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/assets/mls2_1105.png)

## Pretraining on an Auxiliary Task

If you do not have much labeled training data, one last option is to train a first neural network on an auxiliary task for which you can easily obtain or generate labeled training data, then reuse the lower layers of that network for your actual task.

For example, if you want to build a system to recognize faces, you may only have a few pictures of each individual—clearly not enough to train a good classifier. Gathering hundreds of pictures of each person would not be practical. You could, however, gather a lot of pictures of random people on the web and train a first neural network to detect whether or not two different pictures feature the same person. Such a network would learn good feature detectors for faces, so reusing its lower layers would allow you to train a good face classifier that uses little training data.

For natural language processing (NLP) applications, you can download a corpus of millions of text documents and automatically generate labeled data from it. For example, you could randomly mask out some words and train a model to predict what the missing words are.  If you can train a model to reach good performance on this task, then it will already know quite a lot about language, and you can certainly reuse it for your actual task and fine-tune it on your labeled data.

- **NOTE** Self-supervised learning is when you automatically generate the labels from the data itself, then you train a model on the resulting “labeled” dataset using supervised learning techniques. Since this approach requires no human labeling whatsoever, it is best classified as a form of unsupervised learning.



# Faster Optimizers

A huge speed boost comes from using a faster optimizer than the regular Gradient Descent optimizer. In this section we will present the most popular algorithms: momentum optimization, Nesterov Accelerated Gradient, AdaGrad, RMSProp, and finally Adam and Nadam optimization.

## Momentum Optimization

Imagine a bowling ball rolling down a gentle slope on a smooth surface: it will start out slowly, but it will quickly pick up momentum until it eventually reaches terminal velocity. This is the very simple idea behind momentum optimization. In contrast, regular Gradient Descent will simply take small, regular steps down the slope, so the algorithm will take much more time to reach the bottom.

Recall that Gradient Descent updates the weights $θ$ by directly subtracting the gradient of the cost function $J(θ)$ with regard to the weights ($∇_θJ(θ)$) multiplied by the learning rate $η$. The equation is: $θ ← θ – η∇_θJ(θ)$. It does not care about what the earlier gradients were. If the local gradient is tiny, it goes very slowly.


Momentum optimization cares a great deal about what previous gradients were: at each iteration, it subtracts the local gradient from the **momentum vector** $m$  and it updates the weights by adding this momentum vector. In other words, the gradient is used for acceleration, not for speed.

To simulate some sort of friction mechanism he algorithm introduces a new hyperparameter β, called the momentum, which must be set between 0 (high friction) and 1 (no friction)

Equation Momentum algorithm
$$
\begin{aligned}
&\mathbf{m} \leftarrow \beta \mathbf{m}-\eta \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) \\
&\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}+\mathbf{m}
\end{aligned}
$$

You can easily verify that if the gradient remains constant, the terminal velocity is equal to that gradient multiplied by the learning rate η multiplied by 1/(1–β)

For example, if β = 0.9, then the terminal velocity is equal to 10 times the gradient times the learning rate, so momentum optimization ends up going 10 times faster than Gradient Descent!

 Gradient Descent goes down the steep slope quite fast, but then it takes a very long time to go down the valley(caused by different inputs scales from features).

In deep neural networks that don’t use Batch Normalization, the upper layers will often end up having inputs with very different scales, so using momentum optimization helps a lot. It can also help roll past local optima.

- **NOTE**: Due to the momentum, the optimizer may overshoot a bit, then come back, overshoot again, and oscillate like this many times before stabilizing at the minimum. This is one of the reasons it’s good to have a bit of friction in the system: it gets rid of these oscillations and thus speeds up convergence.

Implementing momentum optimization in Keras is a no-brainer: just use the SGD optimizer and set its momentum hyperparameter, then lie back and profit!

In [16]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)



The one drawback of momentum optimization is that it adds yet another hyperparameter to tune. However, the momentum value of 0.9 usually works well in practice and almost always goes faster than regular Gradient Descent.

## Nesterov Accelerated Gradient

A variant to momentum optimization. The Nesterov Accelerated Gradient (NAG) method, also known as Nesterov momentum optimization, measures the gradient of the cost function not at the local position θ but slightly ahead in the direction of the momentum, at $θ + βm$

Nesterov Accelerated Gradient algorithm
$$
\begin{aligned}
&\mathbf{m} \leftarrow \beta \mathbf{m}-\eta \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}+\beta \mathbf{m}) \\
&\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}+\mathbf{m}
\end{aligned}
$$

This small tweak works because in general the momentum vector will be pointing in the right direction, so it will be slightly more accurate to use the gradient measured a bit farther in that direction rather than the gradient at the original position. 

![](https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/assets/mls2_1106.png)

As you can see, the Nesterov update ends up slightly closer to the optimum. After a while, these small improvements add up and NAG ends up being significantly faster than regular momentum optimization.

 Moreover, note that when the momentum pushes the weights across a valley, ∇1 continues to push farther across the valley, while ∇2 pushes back toward the bottom of the valley. This helps reduce oscillations and thus NAG converges faster.
 
 To use 

In [None]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

## AdaGrad

Consider the elongated bowl problem again: Gradient Descent starts by quickly going down the steepest slope, which does not point straight toward the global optimum, then it very slowly goes down to the bottom of the valley.

It would be nice if the algorithm could correct its direction earlier to point a bit more toward the global optimum. The AdaGrad algorithm achieves this correction by scaling down the gradient vector along the steepest dimensions.

Equation 11-6. AdaGrad algorithm
$$
\begin{aligned}
&\mathbf{s} \leftarrow \mathbf{s}+\nabla_{\theta} J(\boldsymbol{\theta}) \otimes \nabla_{\theta} J(\boldsymbol{\theta}) \\
&\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}-\eta \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) \oslash \sqrt{\mathbf{s}+\varepsilon}
\end{aligned}
$$

- The first step accumulates the square of the gradients into the vector s.
- each $s_i$ accumulates the squares of the partial derivative of the cost function with regard to parameter $θ_i$
- If the cost function is steep along the ith dimension, then $s_i$ will get larger and larger at each iteration.

- The second step is almost identical to Gradient Descent, but the gradient vector is scaled down by a factor of $\sqrt{s+\epsilon}$
- ε is a smoothing term to avoid division by zero

In short, this algorithm decays the learning rate, but it does so faster for steep dimensions than for dimensions with gentler slopes. This is called an **adaptive learning rate**. 

- It helps point the resulting updates more directly toward the global optimum 
- it requires much less tuning of the learning rate hyperparameter η

![](https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781492032632/files/assets/mls2_1107.png)

- AdaGrad frequently performs well for simple quadratic problems, but it often stops too early when training neural networks.
- even though Keras has an Adagrad optimizer, you should not use it to train deep neural networks
- Use it for simple ML tasks. 

## RMSProp

The RMSProp algorith fixes adaGrad by accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training).

- It does so by using exponential decay in the first step

Equation 11-7. RMSProp algorithm

$$
\begin{aligned}
&\mathbf{s} \leftarrow \beta \mathbf{s}+(1-\beta) \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) \otimes \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) \\
&\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}-\eta \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) \oslash \sqrt{\mathbf{s}+\varepsilon}
\end{aligned}
$$

- The decay rate β is typically set to 0.9. Yes, it is once again a new hyperparameter, but this default value often works well, so you may not need to tune it at all.

Karas has an RMS optimizer. Note that the rho argument corresponds to β. 

This algo was preffered over ada unti Adam Optimization came around. 

# Adam and Nadam Optimization

Adam,17 which stands for adaptive moment estimation, combines the ideas of momentum optimization and RMSProp. 

Like momentum optimization, it keeps track of an exponentially decaying average of past gradients; and just like RMSProp, it keeps track of an exponentially decaying average of past squared gradients

Eqn Adam Algorithm 

1. $\quad \mathbf{m} \leftarrow \beta_{1} \mathbf{m}-\left(1-\beta_{1}\right) \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})$
2. $\quad \mathbf{s} \leftarrow \beta_{2} \mathbf{s}+\left(1-\beta_{2}\right) \nabla_{\theta} J(\boldsymbol{\theta}) \otimes \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})$
3. $\widehat{\mathbf{m}} \leftarrow \frac{\mathbf{m}}{1-\beta_{1}{ }^{t}}$
4. $\quad \hat{\mathbf{s}} \leftarrow \frac{\mathbf{s}}{1-\beta_{2}{ }^{t}}$
5. $\quad \boldsymbol{\theta} \leftarrow \boldsymbol{\theta}+\eta \widehat{\mathbf{m}} \oslash \sqrt{\hat{\mathbf{s}}+\varepsilon}$

- In this equation, t represents the iteration number

Step 1 computes an exponentially decaying average rather than an exponentially decaying sum, but these are actually equivalent except for a constant factor. 

-  Since m and s are initialized at 0, they will be biased toward 0 at the beginning of training, so these two steps will help boost m and s at the beginning of training.

- The momentum decay hyperparameter β1 is typically initialized to 0.9, while the scaling decay hyperparameter β2 is often initialized to 0.999

- smoothing term ε

Since Adam is an adaptive learning rate algorithm (like AdaGrad and RMSProp), it requires less tuning of the learning rate hyperparameter η. You can often use the default value η = 0.001, making Adam even easier to use than Gradient Descent.

How to implement



In [1]:
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

NameError: name 'keras' is not defined

There are two variations of Adam optimization that are worth mentioning. 

AdaMax and Nadam. 

- **Warning**: Adaptive optimization methods (including RMSProp, Adam, and Nadam optimization) are often great, converging fast to a good solution. A paper showed they can lead to solutions that generalize poorly on some datasets. So when you are disappointed by your model’s performance, try using plain Nesterov Accelerated Gradient instead: your dataset may just be allergic to adaptive gradients. 

All the optimization techniques discussed so far only rely on the first-order partial derivatives (Jacobians). 

The optimization literature also contains amazing algorithms based on the second-order partial derivatives (the Hessians, which are the partial derivatives of the Jacobians). Unfortunately, these algorithms are very hard to apply to deep neural networks because there are $n^2$ Hessians per output (where n is the number of parameters), as opposed to just n Jacobians per output. Since DNNs typically have tens of thousands of parameters, the second-order optimization algorithms often don’t even fit in memory, and even when they do, computing the Hessians is just too slow.

## Training Sparse Models 

All the optimization algorithms just presented produce dense models, meaning that most parameters will be nonzero. f you need a blazingly fast model at runtime, or if you need it to take up less memory, you may prefer to end up with a sparse model instead.

- One easy way to achieve this is to train the model as usual, then get rid of the tiny weights (set them to zero). Note that this will typically not lead to a very sparse model, and it may degrade the model’s performance.

- A better option is to apply strong ℓ1 regularization during training. 

- If these techniques remain insufficient, check out the TensorFlow Model Optimization Toolkit (TF-MOT)


- \* is bad, ** is average, and *** is good).

|Class	|Convergence speed	|Convergence quality|
|-------|-------------------|-------------------|
| SGD   | *                 | ***  |
|SGD(momentum=...)| **      |***   |
|SGD(momentum=..., nesterov=True)|  ** |  *** |
|Adagrad | *** |  * (stops too early) |
|RMSprop| *** |   ** or *** | 
| Adam  | *** |  ** or ***  |
| Nadam | *** |  ** or ***  |
| AdaMax| *** |  ** or ***  |

# Learning Rate Scheduling

Finding a good learning rate is very important. If you set it much too high, training may diverge. 

If you set it too low, training will eventually converge to the optimum, but it will take a very long time.

![](https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781492032632/files/assets/mls2_1108.png)

- Of course we can exponentially increase the learning rate within a few iterations to find an upper bound on the learning rate. See chapter 10 exercise. 

-  if you start with a large learning rate and then reduce it once training stops making fast progress, you can reach a good solution faster than with the optimal constant learning rate. This is called a learning schedule 

Types of learning schedules. 

Power scheduling

    Set the learning rate to a function of the iteration number t: η(t) = η_0 / (1 + t/s)^c. The initial learning rate η_0, the power c (typically set to 1), and the steps s are hyperparameters. The learning rate drops at each step. After s steps, it is down to η_0 / 2. After s more steps, it is down to η_0 / 3, then it goes down to η_0 / 4, then η_0 / 5, and so on. As you can see, this schedule first drops quickly, then more and more slowly. Of course, power scheduling requires tuning η_0 and s (and possibly c).
    
Exponential scheduling

    Set the learning rate to η(t) = η_0 *0.1^{t/s}. The learning rate will gradually drop by a factor of 10 every s steps. While power scheduling reduces the learning rate more and more slowly, exponential scheduling keeps slashing it by a factor of 10 every s steps.
    
Piecewise constant scheduling

    Use a constant learning rate for a number of epochs (e.g., η_0 = 0.1 for 5 epochs), then a smaller learning rate for another number of epochs (e.g., η_1 = 0.001 for 50 epochs), and so on. Although this solution can work very well, it requires fiddling around to figure out the right sequence of learning rates and how long to use each of them.
    
Performance scheduling

    Measure the validation error every N steps (just like for early stopping), and reduce the learning rate by a factor of λ when the error stops dropping.
    
1cycle scheduling

    A newer approach to scheduling. Starts by increasing the initial learning rate η_0, growing linearly up to η_1 halfway through training. Then it decreases the learning rate linearly down to η_0 again during the second half of training, finishing the last few epochs by dropping the rate down by several orders of magnitude (still linearly). The maximum learning rate η1 is chosen using the same approach we used to find the optimal learning rate, and the initial learning rate η0 is chosen to be roughly 10 times lower. When using a momentum, we start with a high momentum first (e.g., 0.95), then drop it down to a lower momentum during the first half of training (e.g., down to 0.85, linearly), and then bring it back up to the maximum value (e.g., 0.95) during the second half of training, finishing the last few epochs with that maximum value. Smith did many experiments showing that this approach was often able to speed up training considerably and reach better performance. For example, on the popular CIFAR10 image dataset, this approach reached 91.9% validation accuracy in just 100 epochs, instead of 90.3% accuracy in 800 epochs through a standard approach (with the same neural network architecture).
    
Implementing power scheduling in Keras is the easiest option: just set the decay hyperparameter when creating an optimizer:    

In [None]:
 optimizer = keras.optimizers.SGD(lr=0.01, decay=1e-4)

The decay is the inverse of s (the number of steps it takes to divide the learning rate by one more unit), and Keras assumes that c is equal to 1.

Exponential scheduling and piecewise scheduling are quite simple too. You first need to define a function that takes the current epoch and returns the learning rate. For example, let’s implement exponential scheduling:

In [3]:
def exponential_decay_fn(epoch):
    return 0.01 * 0.1**(epoch / 20)

In [2]:
# If you do not want to hardcode η0 and s, you can create a function that returns a configured function:
def exponential_decay(lr0, s):
    def exponential_decay_fn(epoch):
        return lr0 * 0.1**(epoch / s)
    return exponential_decay_fn

exponential_decay_fn = exponential_decay(lr0=0.01, s=20)

In [None]:
#Next, create a LearningRateScheduler callback, 
# giving it the schedule function, and pass this callback to the fit() method:
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)
history = model.fit(X_train_scaled, y_train, [...], callbacks=[lr_scheduler])

The LearningRateScheduler will update the optimizer’s learning_rate attribute at the beginning of each epoch.

Updating the learning rate once per epoch is usually enough, but if you want it to be updated more often, for example at every step, you can always write your own callback (see the “Exponential Scheduling” section of the notebook for an example). Updating the learning rate at every step makes sense if there are many steps per epoch. Alternatively, you can use the keras.optimizers.schedules approach, described shortly.

The schedule function can optionally take the current learning rate as a second argument. For example, the following schedule function multiplies the previous learning rate by $0.1^{1/20}$, which results in the same exponential decay (except the decay now starts at the beginning of epoch 0 instead of 1):



In [None]:
def exponential_decay_fn(epoch, lr):
    return lr * 0.1**(1 / 20)

This implementation relies on the optimizer’s initial learning rate (contrary to the previous implementation).

- When you save a model, the optimizer and its learning rate get saved along with it. This means that with this new schedule function, you could just load a trained model and continue training where it left off, no problem

Things are not so simple if your schedule function uses the epoch argument, however: the epoch does not get saved, and it gets reset to 0 every time you call the fit() method. If you were to continue training a model where it left off, this could lead to a very large learning rate, which would likely damage your model’s weights. One solution is to manually set the fit() method’s initial_epoch argument so the epoch starts at the right value.

For piecewise constant scheduling, you can use a schedule function like the following one (as earlier, you can define a more general function if you want; see the “Piecewise Constant Scheduling” section of the notebook for an example), then create a LearningRateScheduler callback with this function and pass it to the fit() method, just like we did for exponential scheduling:




In [4]:
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001

For performance scheduling, use the ReduceLROnPlateau callback. For example, if you pass the following callback to the fit() method, it will multiply the learning rate by 0.5 whenever the best validation loss does not improve for five consecutive epochs (other options are available; please check the documentation for more details):




In [5]:
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

NameError: name 'keras' is not defined

Lastly, tf.keras offers an alternative way to implement learning rate scheduling: define the learning rate using one of the schedules available in keras.optimizers.schedules, then pass this learning rate to any optimizer.

This approach updates the learning rate at each step rather than at each epoch. For example, here is how to implement the same exponential schedule as the exponential_decay_fn() function we defined earlier:



In [6]:
s = 20 * len(X_train) // 32 # number of steps in 20 epochs (batch size = 32)
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
optimizer = keras.optimizers.SGD(learning_rate)

NameError: name 'X_train' is not defined

This is nice and simple, plus when you save the model, the learning rate and its schedule (including its state) get saved as well. 

As for the 1cycle approach, the implementation poses no particular difficulty: just create a custom callback that modifies the learning rate at each iteration (you can update the optimizer’s learning rate by changing self.model.optimizer.lr).

- See the “1Cycle scheduling” section of the notebook for an example.

- exponential decay, performance scheduling, and 1cycle can considerably speed up convergence, so give them a try!

# Avoiding Overfitting Through Regularization


In this section we will examine other popular regularization techniques for neural networks: ℓ1 and ℓ2 regularization, dropout, and max-norm regularization.


## ℓ1 and ℓ2 Regularization

You can use ℓ2 regularization to constrain a neural network’s connection weights, and/or ℓ1 regularization if you want a sparse model (with many weights equal to 0)

Here is how to apply ℓ2 regularization to a Keras layer’s connection weights, using a regularization factor of 0.01:

In [1]:
layer = keras.layers.Dense(100, activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01))

NameError: name 'keras' is not defined

The l2() function returns a regularizer that will be called at each step during training to compute the regularization loss. This is then added to the final loss.

As you might expect, you can just use keras.regularizers.l1().

If you want ℓ1 regularization; if you want both ℓ1 and ℓ2 regularization, use keras.regularizers.l1_l2()

- We typicall want to apply regularizers to all network layers. Rather than constantly adding it to the coding(which would make it ugly), we can use Python’s functools.partial() function, which lets you create a thin wrapper for any callable, with some default argument values:


In [2]:
from functools import partial

RegularizedDense = partial(keras.layers.Dense,
                           activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation="softmax",
                     kernel_initializer="glorot_uniform")
])

NameError: name 'keras' is not defined

## Dropout

Dropout is one of the most popular regularization techniques for deep neural networks. Gives even state of the art neural networks a 1-2% accuracy boost. This may not sound like a lot, but when a model already has 95% accuracy, getting a 2% accuracy boost means dropping the error rate by almost 40% (going from 5% error to roughly 3%).

It is a fairly simple algorithm:

- at every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step

- The hyperparameter p is called the dropout rate, and it is typically set between 10% and 50%: closer to 20–30% in recurrent neural nets and closer to 40–50% in convolutional neural networks

- After training, neurons don’t get dropped anymore. And that’s all (except for a technical detail we will discuss momentarily).

Despite being so destructive (it literally removes neurons for an epoch).  

- Neurons trained with dropout cannot co-adapt with their neighboring neurons; they have to be as useful as possible on their own.

- They also cannot rely excessively on just a few input neurons; they must pay attention to each of their input neurons. They end up being less sensitive to slight changes in the inputs. In the end, you get a more robust network that generalizes better.

Another way to understand the power of dropout is to realize that a unique neural network is generated at each training step. Since each neuron can be either present or absent, there are a total of 2N possible networks. Giving a large number of unique networks along the way. At the end, the resulting neural network can be seen as an averaging ensemble of all these smaller neural networks.

- **TIP**: In practice, you can usually apply dropout only to the neurons in the top one to three layers (excluding the output layer).

There is one small but important technical detail. Suppose p = 50%, in which case during testing a neuron would be connected to twice as many input neurons as it would be (on average) during training. To compensate for this fact, we need to multiply each input connection weight by the keep probability (1 – p) after training. 

If we don’t, each neuron will get a total input signal roughly twice as large as what the network was trained on and will be unlikely to perform well. Alternatively, we can divide each neuron’s output by the keep probability during training (these alternatives are not perfectly equivalent, but they work equally well).

To implement dropout using Keras, you can use the keras.layers.Dropout layer and it uses the alternative method with division. 

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])

- **WARNING**: Since dropout is only active during training, comparing the training loss and the validation loss can be misleading. In particular, a model may be overfitting the training set and yet have similar training and validation losses. So make sure to evaluate the training loss without dropout (e.g., after training).

If you observe that the model is overfitting, you can increase the dropout rate. Conversely, you should try decreasing the dropout rate if the model underfits the training set. 

It can also help to increase the dropout rate for large layers, and reduce it for small ones. 

Moreover, many state-of-the-art architectures only use dropout after the last hidden layer, so you may want to try this if full dropout is too strong.

Dropout does tend to significantly slow down convergence, but it usually results in a much better model when tuned properly. So, it is generally well worth the extra time and effort (ME: If you already have a great model and need that extra boost).

- **TIP** If you want to regularize a self-normalizing network based on the SELU activation function (as discussed earlier), you should use alpha dropout: this is a variant of dropout that preserves the mean and standard deviation of its inputs (it was introduced in the same paper as SELU, as regular dropout would break self-normalization).

## Monte Carlo (MC) Dropout

A few more good reasons to use dropout:

 - a profound connection between dropout networks (i.e., neural networks containing a Dropout layer before every weight layer) and approximate Bayesian inference,26 giving dropout a solid mathematical justification.
 
 - Second, the authors introduced a powerful technique called MC Dropout, which can boost the performance of any trained dropout model without having to retrain it or even modify it at all, provides a much better measure of the model’s uncertainty, and is also amazingly simple to implement.
 
 It is the full implementation of MC Dropout, boosting the dropout model we trained earlier without retraining it:

In [None]:
y_probas = np.stack([model(X_test_scaled, training=True)
                     for sample in range(100)])
y_proba = y_probas.mean(axis=0)

Note that model(X) is similar to model.predict(X) except it returns a tensor rather than a NumPy array, and it supports the training argument.

In this code example, setting training=True ensures that the Dropout layer remains active, so all predictions will be a bit different. We just make 100 predictions over the test set, and we stack them.

Each call to the model returns a matrix with one row per instance and one column per class. Because there are 10,000 instances in the test set and 10 classes, this is a matrix of shape [10000, 10]. We stack 100 such matrices, so y_probas is an array of shape [100, 10000, 10]. Once we average over the first dimension (axis=0), we get y_proba, an array of shape [10000, 10], like we would get with a single prediction.

- Averaging over multiple predictions with dropout on gives us a Monte Carlo estimate that is generally more reliable than the result of a single prediction with dropout off.

In the book, we are give an example with a model without dropout, one with droupout, one with MC droupout. Apparently there’s quite a lot of variance in the probability estimates: if you were building a risk-sensitive system (e.g., a medical or financial system), you should probably treat such an uncertain prediction with extreme caution.

- **NOTE** The number of Monte Carlo samples you use (100 in this example) is a hyperparameter you can tweak. The higher it is, the more accurate the predictions and their uncertainty estimates will be. However, if you double it, inference time will also be doubled. Moreover, above a certain number of samples, you will notice little improvement. So your job is to find the right trade-off between latency and accuracy, depending on your application.

If your model contains other layers that behave in a special way during training (such as BatchNormalization layers), then you should not force training mode like we just did. Instead, you should replace the Dropout layers with the following MCDropout class

In [None]:
# Here, we just subclass the Dropout layer and override the call() method to force its training argument to True
class MCDropout(keras.layers.Dropout):
    def call(self, inputs):
        return super().call(inputs, training=True)

# Similarly, you could define an MCAlphaDropout class by subclassing AlphaDropout instead. 

But if you have a model that was already trained using Dropout, you need to create a new model that’s identical to the existing model except that it replaces the Dropout layers with MCDropout, then copy the existing model’s weights to your new model.

In short, MC Dropout is a fantastic technique that boosts dropout models and provides better uncertainty estimates. And of course, since it is just regular dropout during training, it also acts like a regularizer.

# Max-Norm Regularization

Another regularization technique that is popular for neural networks is called max-norm regularization: for each neuron, it constrains the weights w of the incoming connections such that $∥ w ∥_2 ≤ r$, where r is the max-norm hyperparameter and ∥ · ∥2 is the ℓ2 norm.

Max-norm regularization does not add a regularization loss term to the overall loss function. Instead, it is typically implemented by computing $∥w∥_2$ after each training step and rescaling w if needed $(w ← w r/‖ w ‖_2)$.

Reducing r increases the amount of regularization and helps reduce overfitting. Max-norm regularization can also help alleviate the unstable gradients problems (if you are not using Batch Normalization).

To implement max-norm regularization in Keras, set the kernel_constraint argument of each hidden layer to a max_norm() constraint with the appropriate max value, like this:

In [3]:
keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal",
                   kernel_constraint=keras.constraints.max_norm(1.))

NameError: name 'keras' is not defined

In [None]:
keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal",
                   kernel_constraint=keras.constraints.max_norm(1.))

After each training iteration, the model’s fit() method will call the object returned by max_norm(), passing it the layer’s weights and getting rescaled weights in return, which then replace the layer’s weights.

In the next chapter you can define your own custom constraint function if necessary and use it as the kernel_constraint. You can also constrain the bias terms by setting the bias_constraint argument.

The max_norm() function has an axis argument that defaults to 0. 

- A Dense layer usually has weights of shape [number of inputs, number of neurons], so using axis=0 means that the max-norm constraint will apply independently to each neuron’s weight vector.

- If you want to use max-norm with convolutional layers (see Chapter 14), make sure to set the max_norm() constraint’s axis argument appropriately (usually axis=[0, 1, 2]).

# Summary and Practical Guidelines

In this chapter we have covered a wide range of techniques, and you may be wondering which ones you should use. This depends on the task, and there is no clear consensus yet, but I have found the configuration in Table 11-3 to work fine in most cases, without requiring much hyperparameter tuning. That said, please do not consider these defaults as hard rules!

See [Table 11-3](https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/ch11.html#default_deep_neural_network_config) for rules 

If the network is a simple stack of dense layers, then it can self-normalize, and you should use the configuration in [Table 11-4](https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/ch11.html#self_norm_deep_neural_network_config) instead.


Don’t forget to normalize the input features! You should also try to reuse parts of a pretrained neural network if you can find one that solves a similar problem, or use unsupervised pretraining if you have a lot of unlabeled data, or use pretraining on an auxiliary task if you have a lot of labeled data for a similar task.

While the previous guidelines should cover most cases, here are some exceptions:

- If you need a sparse model, you can use ℓ1 regularization (and optionally zero out the tiny weights after training). If you need an even sparser model, you can use the TensorFlow Model Optimization Toolkit. This will break self-normalization, so you should use the default configuration in this case.

- If you need a low-latency model (one that performs lightning-fast predictions), you may need to use fewer layers, fold the Batch Normalization layers into the previous layers, and possibly use a faster activation function such as leaky ReLU or just ReLU. Having a sparse model will also help. Finally, you may want to reduce the float precision from 32 bits to 16 or even 8 bits (see “Deploying a Model to a Mobile or Embedded Device”). Again, check out TF-MOT.

- If you are building a risk-sensitive application, or inference latency is not very important in your application, you can use MC Dropout to boost performance and get more reliable probability estimates, along with uncertainty estimates.

