<a href="https://colab.research.google.com/github/adhadse/colab_repo/blob/master/homl/Ch%2011%20Training%20Deep%20Neural%20Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Chapter 11: Training Deep Neural Networks
This work is partialy combined text and code from the book [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) is only supposed to be used as reference and is recommended to follow along with a copy of the Book puchased.

# The Vanishing/Exploding Gradients Problems
Gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower laers connection weights almost unchanged. Which leads to the training not converging to optimal solution. This is called <mark>*vanishing gradients*</mark>.

Other times, the gradients can grow bigger and bigger until layers get insanely large weight updates and the algorithm diverges. This is referred to as <mark>*exploding gradients*</mark>.

**DNNs suffer from unstable gradients.**

In a 2010 paper, by Xavior Glorot and Yoshua Begio; the authors that due to few reasons prominent in those times like the use of  logistic activation function and the weight initialization technique common during that time (a normal distribution with a mean of 0 and a standard deviation of 1), they concluded <mark>the varaince if the outputs of each layer is much greater than the variance of its inputs. Going forward in the network, the variance keeps increasing after each layer until the activation function saturates at the top layers</mark>


## Glorot initialization and He initialization
The paper also proposed a way to significantely alleviate the unstable gradients problem. The signal needs to properly flow, to do so:
- <mark>We need the variance of the outputs of each layer to be equal to the variance of its inputs.</mark>
- <mark>We also need the gradients to have equal variance before and after flowing through a layer in the reverse direction<mark>

This is not possible unless we have equal number of inputs and neurons ( these are called *fan-in* and *fan-out*. But Glorot and Bengio proposed a good compromise, **initialize connection weights of each layer randomly**, as per equation 11-1 where, $fan_{avg} = \frac{fan_{in} + fan_{out}}{2}$. This initialization strategy is called <mark>*Xavier initialization*</mark> or <mark>*Glorot initialization*</mark>. 

**Equation 11-1 Glorot Initialization (When using logistic Activation function)**

- Normal distribution with mean 0 and variance $\sigma^2 = \frac{1}{fan_{avg}}$

- Or a uniform distribution between $-r$ nad $+r$, with $r=\sqrt{\frac{3}{fan_{avg,}}}$

If we replace $fan_{avg}$ with $fan_{in}$ in the Equation 11-1 we get initialization strategy propsed in the 1990s. He called it LeCun Initialization. It is equivalent to Glorot Initialization when $fan_{in}=fan_{out}$. Using Glorot initialization can speed up training considerably, and it is one of the tricks that led to the success of Deep Learning.

<mark>The initialization strategy for the ReLU activation function (and its variants, including the ELU activation) is sometimes calle *He initialization*</mark>

**Table 11-1, Initialization parameters for each type of activation function**

|Initialization| Activation function|$\sigma^2$ (Normal)|
|---|---|---|
|Glorot| None, tanh, logistic, softmax| $\frac{1}{fan_{avg}}$|
|He|ReLU and variants|$\frac{2}{fan_{in}}$ |
|LeCun|SELU|$\frac{1}{fan{in}}$|

By default, Keras uses Glorot initialization with a uniform distribution. When creating a layer, we can change this to He Initialization by setting `kernel_initializer="he_uniform"` or `kernel_initializer="he_normal"` like this:

`keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")`

If we want He initialization with a uniform distribution but based o $fan_{avg}$ rather than $fan_{in}$, we can use the `VarianceScaling` initializer like this:

```Python
he_avg_init = keras.initializers.VarianceScaling(scale=2., mode="fan_avg", distribution="uniform")
keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)
```

## Nonsaturating Activation Functions
Unfortunately, the ReLU activation function is not perfect. It suffers from a probelm known as the *dying ReLUs* during training. <mark>A neuron dies when its weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set.</mark>. Gradient Descent also doesn't affect because for the gradient of  ReLU activation function is zero when its output is negative.

### Solution
Use a variant of the ReLU activation function such as:
1. **Leaky ReLU**:
$$\text{LeakyReLU}_{\alpha}(z) = \max(\alpha z,\, z)$$
The hyperparamter $\alpha$ defines how much the function "leaks": it is the slope of the function for $z< 0 $ and is typically set to 0.01. This small slope ensures that neurons dont' die, may go into comma, but have a chance to wake up. Outperformed ReLU.

2. **Randomized Leaky ReLU (RReLU)**:
$\alpha$ is picked randomly in a given range during training and is fixed to an average value during testing. Seems to act as regulizer.

3. **Parametric Leaky ReLU (PReLU)**: 
$\alpha$ is authorized to be learned during training (instead of being a hyperparameter, it becomes a parameter that can be modified by backpropagation). Strongly outperform ReLU on large image datasets, but on smaller risks overfitting. 

4. **Exponential Linear Unit (ELU)**:
Outperformed all the ReLU variants in the authors' experiments.

**Equation 11-2. ELU activation function**
$$\text{ELU}_{\alpha}(z) = \begin{cases}
                            \alpha(\exp (z) -1) & \text{if } z < 0 \\
                            z  &\text{if } z>=0
\end{cases}$$
There are major differences with respect to ReLU activation function:
- Has nonzero gradients for $z < 0$, which <mark>avoids the dead neurons problem.</mark>
- If $\alpha$ is equal to 1 then the function is smooth everywhere, including around  $z=0$, which helps spped up Gradient Descent since it <mark>does not bounce as much to the left or right of $z=0$.</mark>
- It takes on negative values when $z< 0$ which allows the unit to have an average output closer to 0 <mark> and helps alleviate the vanishing gradients problem</mark>.

The **main drawback** of the ELU activation function is that it is **slower to compute than the ReLU function** (and its variants) due to the use of the exponential function). Its faster convergence rate during training compensates for the slow computation.



Then, a 2017 paper by Gunter klambauer et al. introduced the <mark>Scaled ELU (SELU)</mark> activation function, which is a scaled variant of the ELU activation function. The authors showed that if you build a neural netowrk composed exclusively of a stack of dense layers, and if all hidden layers use the SELU activation function, then the network will *self-normalize*. 

There are, however, a few conditions for self-normalization to happen:
- <mark>The input features must be standardized (mean 0 and standard deviation 1).</mark>
- <mark>Everu hidden layer's weights must be initialized with LeCun normal initialization. (set `kernel_initializer="lecun_normal"`.</mark>
- <mark>The network's architecture must be sequential.</mark> Will not gurantee self-normalization with *skip connection*.

> 🟢 **Which one to choose?**
> 
> **SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic**
>
> - If network architecture prevents from self-normalization, then ELU may perform better than SELU (since, SELU is not smooth at $z = 0$)
> - Care about runtime latency? go for leaky ReLU.
> - Try (if have spare time and compute rsc) RReLU if overfitting, PReLU if you have huge training set.

**How to use Leky ReLU activation function in Keras?**

In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

In [None]:
model = keras.model.Sequential([
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(alpha=0.2)
])

**How to implement PReLU**

In [None]:
model = keras.model.Sequential([
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.PReLU(alpha=0.2)
])

There isn't currently official implementation of RReLU in Keras but we can easily implement.

**For SELU**

In [None]:
layers = keras.layers.Dense(10, activation="selu",
                            kernel_initializeer="lecun_normal")

## Batch Normalization
He initialization along with ELU doesn't gurantee that the vanishing/exploding gradients problems won't come back during training (only in the beginning).

Then in a 2015 paper, Sergey Ioffe and Christian Szegedy propsed a technique all *Batch Normalization (BN)*. The technique consists of adding an operation in the model just before or after the activation function of each hidden layer. In simple words, the operation <mark>lets the model learn the optimal scale and mean of each of the layers's inputs.</mark>

**In a nutshell**

<mark>The operation simply zero-centers and normalizes each input, then scales and shifts the result using two new parameter vectors per layer: one for scaling, the other for shifting</mark>

**Working**

The algorithm needs to estimate each input's mean and standard deviation. It does so by evaluating the mean and standard deviation of the input over the current mini-batch. 

**Equation 11-3. Batch Normalization algorithm**
1. $\mu_B = \frac{1}{m_B}\;\sum\limits_{i=1}^{m_B}x^{(i)}$
2. $\sigma_{B}^2 = \frac{1}{m_B}\;\sum\limits_{i=1}^{m_B}\big(x^{(i)} - \mu_B\big)^2$
3. $\hat{x}^{(i)} = \frac{x^{(i)} \,-\, \mu_B}{\sqrt{\sigma_B^2\, + \,\epsilon}}$
4. $z{{(i)} = \gamma \otimes\hat{x}^{(i)}\, +\,\beta}$

In this algorithm, 
- $\mu_B$ is the <mark>vector of input means</mark>, evaluated over the whole mini-batch $B$ (it contains one means per input of mini-batch)
- $\mu_B$ is the <mark>vector of input standard deviatitions</mark>, also evaulated over the whole mini-batch (it contains one standard deviation per input of mini-batch)
- $m_B$ is the <mark>number of instances in the mini-batch</mark>.
- $\hat{x{(i)}}$ is the <mark>vector of zero-centered and normalized inputs for instance $i$</mark>.
- $\gamma$ is the <mark>output scale parameter vector for the layer (it contains one scale parameter per input of mini-batch).</mark>
- $\otimes$ represents <mark>element wise multiplication </mark>(each input is multiplied by its corrensponding output scale parameter).
- $\beta$ is the <mark>output shift (offset) parameter vector for the layer </mark>(it contains one offset parameter per input of mini-batch). Each input is offset by its corresponding shift parameter.
- $\epsilon$ is a <mark>tiny number that avoids division by zero</mark> (typically $10^{-5}$. This is called <mark>*smoothing term*</mark>
- $z^{(i)}$ is the <mark>output of the BN operation</mark>. It is scaled and shifted version of the inputs.

When we need to make predictions for individual instances, we will not have inputs' mean and standard deviation. Even if we batch the input test instances the instances themselves may not be independent and identically distributed.

One solution could be to wait until the training finishes, then run the whole training set through the neural netowork and compute the mean and standard deviation of each input of the BN layer. However, most implementation (including keras) of BN estimate these final statistics during traingn by using a moving average of the layer's input means and standard deviation. 

Four parameter are learned in each BN layer: $\gamma$ & $\beta$ are learned through regular backpropagation, and $\mu$ & $\sigma$ are estimated using an exponential moving average. <mark>Note that $\mu$ & $\sigma$ are estiamted during training, BUT are used only after training (to replace the batch input means and standard deviations in Equation 11-3).</mark>

### Implementing Batch Nomalization with Keras


In [None]:
 model = keras.models.Sequential([
      keras.layers.Flatten(input_shape=(28, 28)),
      keras.layers.BatchNormalization(),
      keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
      keras.layers.BatchNormalization(),
      keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
      keras.layers.BatchNormalization(),
      keras.layers.Dense(10, activation="softmax")
 ])

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense (Dense)                (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1

In [None]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

Creating BN layers in Keras, it also creates two operations (TF operations) that will be called by Keras at each iteration during training.

There is some debate about the preferred way to place BN layers, before or after. <mark>To add the BN layers before the activation functions, you must remove the activation function from the hidden layers and add them as separate layers after the BN layers.</mark>
Since the BN layer already includes one offset parameter per input, we can remove the bias term from the previous layer.

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layer.Dense(100, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(10, activation="softmax")
])

`BatchNormalization` class has quite a few hyperparameters you can tweak. (Defaults are also goo) <mark>`momentum` is one such hyperparameter which is used by the `BatchNormalization` layer when it updates the exponential moving averages;</mark> given a new value <mark>$\textbf{v}$</mark> (i.e., **a new vector of input means or standard deviations** computed over the current batch), the layer updates the **running average <mark>$\hat{\textbf{v}}$</mark>** using the following equation:

$$\hat{v}\,\leftarrow\,\hat{v}\times\text{momentum} + v \times (1- \text{momentum})$$

A good momentum value is typically close to 1; for example, 0.9, 0.99, or 0.9999 (we would want more 9s for larger datasets and smaller mini-batches.


Another important hyperparameter is `axis`: <mark>which determines which axis should be normalized.</mark> It defaults to -1, meaning that by default it will normalize the last axis (using the means and standard deviations computed across the *other* axis.




## Gradients Clipping
Another popular technique to mitgate the exploding gradients problem is to <mark>clip the gradients during backpropagation so that they never exceed some threshold.</mark> This is called *gradients clipping*.



In [None]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss="mse", optimizer=optimizer)

This means that all the partial derivatives of the loss (with regard to each and every trainaible parameter) will be clipped between -1.0 and 1.0. Note that it may change the orientation of the gradeint vector.

If you want to ensure that Gradient Clipping does not chenage the direction of the gradient vector, you should clip by norm by setting `clipnorm` instead. This will clip the whole gradient if its $\mathcal{l_2}$ norm is greater than the threshold you picked.

# Reusing Pretrained Layers
Insterad of training very large DNNs from scratch: instead, we should always try to find an existing neural entowrk that accomplished a similar task to the one we are trying to tackle. This is what is referred to as transfer learning.

> 🔵 If the input pictures of your task don't have the same size as the ones used in the original task, you will usually ave to add a preprocessing step to resize them to the size expected by the original model.

> 🟢 The more similar the tasks are, the more layers you want to reuse (starting with the lower layers).

First off keep all the tranferred layer freezed (so that gradient descent don't do its magic), see how the model performs. Then unfreeze few top layers, and then revaulate. <mark>Also, keep the learning rate low, so that the fine-tuned weights dot get wrecked</mark>


## Transfer Learning with Keras


In [None]:
# original model will share layers
model_A = keras.model.load_model("my_model_A.h5")
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation='sigmoid'))

Note that `model_A` and `model_B_on_A` now **shares some layers**. When you train `model_B_on_A`, **it will also affect `model_A`**. To avoid that, you need to *clone* `model_A` before you reuse its layers. <mark>To do this, you clone model A's architecture with `clone.model()`, then copy its weights (since `clone_mode()` does not clone the weights.

In [None]:
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

# Now the model we want to train will share layers with copied model of modelA
model_B_on_A = keras.model.Sequential(model_A_clone.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

But since the output layer it is initialized randomly, it will make large errors (at least during the first few epochs), so there will large error gradients that may wreck the reused layers weights. To avoid this we will *freeze* the reused layers during the first few epochs.

In [None]:
for layer in model_B_on_A.layers[:-1]:
  layer.trainable = False

model_B_on_A.compile(loss="binary_crossentropy", 
                     optimizer="sgd",
                     metrics=["accuracy"])

> 🔵 **You must always compile your model after you freeze or undreeze layers.**

After unfreezing the reused layers, it is ususally a good idea to reduce the learning rate, once again to avoid daamaging the reused weights.

In [None]:
history = model_B_on_A.fit(X_train_B, y_train_B, 
                           epochs=4,
                           validation_data=(X_valid_B, y_valid_B))


for layer in model_B_on_A.layers[:-1]:
  layer.trainaible = True

optimizer = keras.optimizers.SGD(lr=1e-4)
model_B_on_A.compile(loss=binary_crossentropy, 
                     optimizer=optimizer,
                     metrics=["accuracy"])
history = model_B_on_A.fit(X_train_B, y_train_B, 
                           epochs=16,
                           validation_data=[X_valid_B, y_valid_B])

## Unsupervised Pretraining
Let's suppose we need to figure out a complex task and we don't have enough data and also we cannot find a model trained on a similar task. 

What we can is utilize *unsupervised pretraining*. 
- <mark>Gather unlabeled trainig example</mark>
- <mark>Try to use it to train an unsupervised model</mark>
  
  Such as an autoencoder or Generative adversarial Network. 
- <mark>Reuse the lower layers of the GAN's discriminator</mark>

  Add the output layer for your task, and fine tune the final network using the supervised learning (i.e., with the labeled training examples).



# Faster Optimizers
So far we have seen four ways to speed up training (and reach a better solution):
- Applying a good initialization strategy for the connection weights
- Using a good activation function
- Using Batch Normalization
- Reusing parts of a pretrained network

Another huge speed boost speed comes from using a faster optimizer than the regular Gradient Descent Optimizer. We will look at most popular algorithms:
- [Momentum Optimization](#momentum-optimization)
- [Nesterov Accelerated Gradient](#nesterov-accelerated-gradient)
- [AdaGrad](#adagrad)
- [RMSProp](#rmsprop)
- [Adam and Nadam optimization](#adam-and-nadam-optimization)

## <a name="momentum-optimization"></a>Momentum Optimization
Imagine a bowling ball rolling down a gentle slope on a smooth surface: it will start out slowly, but it will quickly pick up momentum until it eventually reaches terminal velocity.

Where Gradient Descent doesn't care about what previous gradients were:
$$\theta \,\leftarrow\,\theta - \eta\;\nabla_{\theta}J(\theta)$$

Where, it updates <mark>weights $\theta$</mark> by directly subtracting the <mark>gradient of cost function $j(\theta)$ with regard to the weights ($\nabla_\theta j(\theta)$)</mark> multilied by the <mark>learning rate $\eta$.</mark>

### Momentum Optimization uses gradient is used for accelearation, not for speed.
*Equation 11-4. Momentum Algorithm*
$$\textbf{m} \leftarrow \beta \textbf{m} - \eta\;\nabla_\theta J(\theta$$
$$\theta \leftarrow \theta + \textbf{m}$$

At each iteration, it substract the local gradient from the *momentum vector* $\textbf{m}$ (mulitplied by the learning rate $\eta$). Then it updates the weights by adding this momentum vector.

To prevent the momentum from growing too large and to simulate some sort of friction mechanism the hyperparameter $\beta$ is used called *momentum* which must be set between 0 (high friction) and 1 (no friction). Typically set to 0.9.

**Implementing in Keras**

In [None]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)

## <a name="nesterov-accelerated-gradient"></a> Nesterov Accelerated Gradient
<mark>**Measures the gradient of the cost function not at the local position $\theta$ but slightly ahead in the direction of the momentum, at $\theta + \beta\textbf{m}$**</mark>

*Equation 11-5. Nesterov Accelerated Gradient Algorithm* 
$$\textbf{m}\leftarrow \beta\textbf{m} - \eta\;\nabla_\theta J(\theta+\beta\textbf{m})$$
$$\theta \leftarrow \theta + \textbf{m}$$

**Implementing in Keras**

In [None]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

## <a name="adagrad"></a>AdaGrad
Gradient Descent always go head over heels down towards the direction which points to the steepest slope, and not in direction of global optimum. <mark>It would be nice if the algorithm could correct its direction earlier to point a bit more toward the global optimum.</mark> The *AdaGrad* algorithm achieves this correction by **scaling down the gradient vector along the steepest dimensions.**

*Equation 11-6. AdaGrad algorithm*
$$\textbf{s} \leftarrow \textbf{s} + \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta)$$
$$\theta \leftarrow \theta - \eta \;\nabla_\theta J(\theta)\;\oslash\sqrt{\textbf{s}+\epsilon}$$

> **NOTE**
>
> 1. $\otimes$ represent element-wise multiplication
> 2. $\oslash$ resprent element-wise division.

1. The first step accumulates the square of the gradients into the vector $\textbf{s}$. 
2. The second step is almost identical to Gradient Descent, but with one key difference; the gradient vector is scaled down by a factor of $\sqrt{\textbf{s}+\epsilon}$. $\epsilon$ is a smooting term to avoid division by zero, typically around to $10^{-10}$. 

<mark>In short, this algorithm decays the learning rate, but it does so faster for steep dimensions than for the gentler slopes. This is called an *adaptive learning rate*.</mark>

Adagrad has a disadvantage of slowing down a bit too fast and stopping too early; and so they are not very helpful to train deep neural networks. (although it may be efficient for simpler tasks such as Linear Regression)

## <a name="rmsprop"></a> RMSProp
The RMSProp fixes AdaGrad by **accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training)**. It does so **by using exponential decay in the first step**.

*Equation 11-7. RMSProp algorithm*
$$\textbf{s} \leftarrow \beta s + (1-\beta)\nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta)$$
$$\theta\leftarrow\theta - \eta\nabla_\theta J(\theta) \oslash \sqrt{\textbf{s}+\epsilon}$$

The <mark>decay rate $\beta$ is typically set to 0.9.</mark>  

**Implementing in Keras**

In [None]:
optimizer = keras.optimizers.RMSProp(lr=0.001, rho=0.9)

## <a name="adam-and-nadam-optimization"></a> Adam and Nadam Optimization
*Adam* which stands for ***adaptive moment estimation***, is a adaptive learning rate algorithm that combines the ideas of momentum optimization and RMSProp: 
- Just like momentum optimization, it keeps track of an exponentially decaying average of past gradients
- and just like RMSProp, it keeps track of an exponentially decaying average of past squared gradients.

*Equation 11-8. Adam algorithm*
$$\textbf{m} \leftarrow \beta_1\textbf{m} - (1 - \beta_1)\nabla_\theta J(\theta)$$
$$\textbf{s} \leftarrow \beta_2\textbf{s} + (1-\beta_2)\nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta)$$
$$\hat{\textbf{m}} \leftarrow \frac{\textbf{m}}{1 -\mathbf{\beta}^\top_1}$$
$$\hat{\textbf{s}} \leftarrow \frac{\textbf{m}}{1 - \mathbf{\beta}^\top_2}$$
$$\theta \leftarrow \theta + \eta\,\hat{m}\oslash \sqrt{\textbf{s} + \epsilon}$$

In this equation :
- <mark>$\top$ represents the iteration number.</mark>
- <mark>$\beta_1$ is momentum decay hyperparameter typically set to 0.9</mark>
- <mark>$\beta_2$ is scaling decay hyperparameter typically set to 0.999</mark>
- <mark>A smoothing parameter $\epsilon$ is usually initialized to a tiny number such as $10^{-7}$



**Implementing in keras**

In [None]:
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

Two variants of Adam are worth mentioning:

- *AdaMax*
  
  Adam, replaces the $\ell_2$ norm with the $\ell_\infty$ norm, specifically, it replaces step 2 in Equation 11-8 with 
  $$\mathbf{s}\leftarrow \max(\beta_s \mathbf{s},\;\nabla_\theta J(\theta))$$, it drops step 4, and in step 5 it scales down the gradient updates by a factor od $\mathbf{s}$, which is just the max of the time-decayed gradients.

- *Nadam*

  Nadam optimization is Adam Optimization plus the Nesterov trick, so it will often converge slightly faster than Adam. 

> 🟠 Adaptive optimization methods are often freat, converge faster than normal gradient descent, <mark>but they can lead to solutions that generalize poorly on some datasets.</mark> So when you are disappointed by your model's performance, just try plain Nesterov Accelerated Gradient instead.
 
> ## Training Sparase Models
> All the optimization algorithms just presented produce <mark>dense models, meaning that most parameters will be nonzero.</mark> If you need blazingly fast model at runtime, or if you need to take up less memory, you may want to a sparse model instead, whcih has very few nonzero parameters.
>
><mark>A very good option is to apply strong $\ell_1$ regularization during training</mark>, as it pushes the optimizer to zero out as many weights as it can. Otherwise check out TensorFlow Model Optimization Toolkit (TF-MOT), which provides a pruning API.



*Table 11-2. Optimizer comparison* (🟩⬜⬜ is bad, 🟩🟩⬜ is average and 🟩🟩🟩 is good)

|Class|Convergence Speed| Convergence Quality|
|---|---|---|
|**SGD**|🟩⬜⬜|🟩🟩🟩|
|**SGD**(`momentum`=..)|🟩🟩⬜|🟩🟩🟩|
|**SGD**(`momentum`=..., `nesterov=True`)|🟩🟩⬜|🟩🟩🟩|
|**AdaGrad**|🟩🟩🟩|🟩⬜⬜(Stops too early)|
|**RMSProp**|🟩🟩🟩|🟩🟩⬜ or 🟩🟩🟩|
|**Adam**|🟩🟩🟩|🟩🟩⬜ or 🟩🟩🟩|
|**Nadam**|🟩🟩🟩|🟩🟩⬜ or 🟩🟩🟩|
|**AdaMax**|🟩🟩🟩|🟩🟩⬜ or 🟩🟩🟩|

## Learning Rate Scheduling
There are many different strategies to reduce the learning rate during training (instead of keeping it constant 0.001). These strategies are called *learning schedules*. Some of the most commonly used learning schedules are:
- ***Power Scheduling***

  Set the learning rate to a function of the <mark>iteration number $t$.</mark>
  $$\eta(t) = \frac{\eta_0}{(1+t/s)^c} $$
  where:
  - $\eta_0$ represent initial learning rate
  - $c$ power (typically set to 1).
  - $s$ represents steps.
  - $t$ is iteration number

  After s steps, the learning rate drops to $\eta_0/2$, after $s$ more steps it is down to $\eta_0/ 3$, then again after s steps to $\eta_0/4$, and keeps decreasing.

  First drops quickly and then slowly and slowly.
- ***Exponential Scheduling***

  $$\eta(t) = \eta_0\; 0.1^{t/s}$$
  The learning rate will gradually drop by a factor of 10 every $s$ steps.

- ***Piecewise constant scheduling***

  Use a constant learning rate for a number of epochs then a smaller learning rate for another number of epochs, and so on.

- ***Performance Scheduling***

  Measure the validation error every $N$ steps (just like early stopping), and reduce the learning rate by a factor of $\lambda$ when the error stops dropping.

- ***1cycle Scheduling***

  Starts by increasing the initail learning rate $\eta_0$ growing linearly up to $\eta_1$ halfway through training. Then it decreases the learning rate down to $\eta_0$ aain during the second half of training, finishing the last few epochs by dropping the rate down by several orders of magnitude (still linearly).

  The maximum learning rate $\eta_1$ is choosen using the same approach we used to find the optimal learning rate, and the initial learning rate $\eta_0$ is choosen to be roughly 10 times lower. 

**Implementing power scheduling in Keras**

In [None]:
# just set the decay hyperparameter when creating optimizer
optimizer = keras.optimizer.SGD(lr=0.01, decay=1e-4)

The `decay` is the inverse of $s$ (the number of steps it takes to divide the learning rate by one more unit), and keras assumes that $c$ is equal to 1.

**Implementing Exponential Scheduling and Piecewise scheduling in Keras**

In [None]:
def exponential_decay_fn(epoch):
  """
  Returns learing rate using exponential scheduling.
  Takes current epoch count(t) and drop by a factor of 10 every 20 steps (s).
  """
  return 0.01 * 0.1**(epoch / 20)

def exponential_decay(lr_0, s):
  def exponential_decay_fn(epoch):
    return lr_0 * 0.1**(epoch/ s)
  return exponetial_decay_fn

exponential_decay_fn = exponential_decay(lr_0=0.01, s=20)

Next, create a `LearningRateScheduler` callback, giving it the schedule function, and pass this callback to the `fit()` method.

In [None]:
lr_scheduler = keras.callback.LearningRateScheduler(exponential_decay_fn)
history = model.fit(X_train_scaled, y_train, [...], callbacks=[lr_scheduler])

The schedule function can optionally take the current learning rate as a second argument.

In [None]:
def exponential_decay_fn(epoch, lr):
  """
  Multiples the previous learning rate by 0.1^1/20,
  which result in the same exponential decay
  """
  return lr*0.1**(1/20)

When you save a model, the optimizer and its learning rate get saved along with it. This means that with this new shedule function (the one just above), you could just load a trained model and continue training where it left off.

However, things are not so simple if your schedule function uses the epoch argument (like the previous 2 functions), since the <mark> the epoch does not get saved, and it gets reset to 0 every time you call the `fit()` method.

One solution is to manually set the `fit()` methods `initial_epoch` argument so the `epoch` starts at the right value.

**Implementing Piecewise contant scheduling in Keras**

In [None]:
def priecewise_constant_fn(epoch):
  if epoch < 5:
    return 0.01
  elif epoch < 15:
    return 0.005
  else:
    return 0.001

**Implementing Performance Scheduling in Keras**
Use the `ReduceLROnPlateau`callback. The following callback will multiply the learning rate by 0.5 whenever the best validation loss does not improve for five consecutive epochs.

In [None]:
lr_scheduler = keras.callbacks.ReduceLROnPlateru(factor=0.5, patience=5)

`tf.keras` offers an alternative way to implement learning rate scehduling: 

**Define the learning rate using one of the schedules availaible in `keras.optimizers.schedules`, then pass the learning this learning rate to any optimizer.**

This <mark>updates the learning rate at each step</mark> rather than at each epoch.

In [None]:
s = 20 * len(X_train) // 32 # number of steps in 20 epochs (batch size=32)
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)

<mark>When you save the model, the learning rate and its schedule (including its state) get saved as well.</mark>

# Avoiding Overfitting Through Regularization
With the Neural Network's immense amount of parameters, they can fit a huge variety of complete datasets and sometimes even overfitting.

We have already seen one such great technique to avoid overfitting, *early  stopping*. In this section we will examine other popular regularization techniques for neural networks:
- [$\ell_1$ and $\ell_2$ regualrization](#l1-l2)
- [Dropout Regularization](#dropout)
- [Max-Norm regularization](#max-norm) 


## <a name="l1-l2"></a>$\ell_1$ and $\ell_2$ regularization
You can use $\ell_2$ regularization to constrain a neura network's connection weights, and/or $\ell_1$ regularization if you want a sparse model.

**Implementing $\ell_2$ regularization to a keras layer's connection weights**

In [None]:
layer = keras.layers.Dense(100, activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01))

The `l2()` functions returns a regularizer that will be called at each step during training to compute the regularization loss. This is then added to the final loss.

Since, we probably want to apply the same operation; regularization, and activation function, we will find ourselves repeating the same arguments, making the code ugly and error-prone. 

To avoid this use Python's **`functools.partial()`** function, which lets us create a thin wrapper for any callable, with some default argument values:


In [None]:
from functools import partial

RegularizedDense = partial(keras.layers.Dense,
                           activation='elu',
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
      keras.layers.Flatten(input_shape=(28, 28)),
      RegularizedDense(300),
      RegularizedDense(100),
      RegularizedDense(10, activation="softmax",
                       kernel_initializer="glorot_uniform")
])

## <a name="dropout"></a>Dropout
At every training step, every neuron (includin the input neurons, but always **excluding the output neurons**) has a probability $p$ of being temporarily "dropped out", meaning it will be entirely ignored during this training step, but it may be active during the next step.

This hyperparameter <mark>$p$ is called ***dropout rate***.</mark>

Typically set around 10% and 50%.
- close to 20-30% in recurrent nets.
- close to 40-50% in convulational neural networks.

<mark>Neurons trained with dropout cannot co-adapt with their neighbouring neurons; **they have to be as useful as possible on their own**.</mark>

>🟢In practice, you can usually apply dropout only to the neurons in the top one to three layers (excluding the output layer) instead of every layer in the network.

<mark>There is a small catch though. Suppose $p = 50%$, in which case during testing a neuron would be connected to twice as many input neurons as it would be (on average) during training.</mark> To compensate for this, we need to multiply each neuron's input connection weights by 0.5 after training. If we dont't each neuron will et a total input signal roughly twice as large as what the network was trained on and will be unlikely to perform well. 

**We need to multiply each input connection weight by the <mark>_keep probability_</mark> ($1-p$) after training**.

Alternatively, we can divide each neuron's output by th keep probability during training.

**Implementing dropout in Keras**
- Use `keras.layers.Dropout`.
- After training, just passes the inputs to next layer. 

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(300, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])

>🟠 <mark>Since dropout is only active during training, comparing the training loss and the validation loss can be misleading.</mark> In particular, a model may be overfitting the training set and yet have similar training and validation loss. So **make sure to evaluate the training loss without dropout (e.g., after training)**

Decrease dropout if the model underfits, otherwise increase the dropout rate when the model overfits.

>🟢 <mark>If we want to regularize a self-normalizing netowork based on the SELU activation function</mark>, use ***alpha dropout***, which preserves the mean and standard deviation, where otherwise regular regularization would break self-normalization.



## Monte Carlo (MC) dropout
In 2016, a paper by Yarin Gal and Zoubin Ghahramani added  few more good reasons to use dropout:
- The paper established a profound connection between dropout network and approximate Bayesian inference (i.e., <mark>a dropout network is mathematically equivalent to approximate Bayesian inference in a specific type of probablistic model called a ***Deep Gaussian Process***.</mark>
- They introduced *MC Dropout* which can boost performance of trained dropout network without having to retrain or modify it.

 
**Averaging over multiple predictions with dropout on gives us a Monte Carlo estimate**

**Implementation:**

In [None]:
y_probas = np.stack([model(X_test_scaled, training=True)
                    for sample in range(100)])
y_proba = y_probas.mean(axis=0)

>🔵 The number of Monte Carlo samples you use (previoud example, 100) is a hyperparameter you can tweak. <mark>The higher it is, the more accureate the predictions and their unceratainity estimates will be</mark>. Which also might increase the inference time, so find a right trade-off depending on the application of use.

You should replace the `Dropout` layer with the following **`MCDropout`** class, if the model contains other layers that behave in a special way during training (such as `BatchNormalization`) instead of forcing training mode.

In [None]:
class MCDropout(keras.layers.Dropout):
  def call(self, inputs):
    return super().call(inputs, training=True)


class MCAlphaDropout(keras.layers.AlphaDropout):
  # When you have self-normalizing network 
  # based on the SELU activation function
  def call(self, inputs):
    return super().call(inputs, training=True)

<mark>And yeah, if you are creating a model from scratch, it's just a matter of using `MCDropout` rather than `Dropout`.

## <a name="max-norm"></a> Max-Norm Regularization
<mark>For each neuron, it constrains the weights $\textbf{w}$ of the incoming connection such that $\lvert\lvert\;{\textbf{w}}\;\rvert\rvert_2 \leqslant r$, where $r$ is the max-norm hyperparameter and $\lvert\lvert \;.\; \rvert\rvert_2$ is the $\ell_2$ norm.</mark>

Max-norm regularization does not add a regularization loss term to the overall loss function. Instead, it is typically implemented by computing $\lvert\lvert\;\textbf{w}\;\rvert\rvert_2$ after each training step and rescalling $\textbf{w}$ if needed ($\textbf{w}\leftarrow\frac{r}{\lvert\lvert\;\textbf{w}\;\rvert\rvert_2}$).

**Reducing $r$ increases the amount of regularization and helps reduce overfitting.**

**Implementing in Keras**
- set the `kernel_constraint` argument of each hidden layer to `max_norm()` constraint.

In [None]:
keras.layers.Dense(100, activation="elu",
                   kernel_initializer='he_normal',
                   kernel_contraint=keras.contraints.max_norm(1.))

We can also constrain the bias terms by setting the `bias_constraint` argument. 

The `max_norm()` function has an `axis` argument that defaults to $0$ meaning that the max-norm constraint will apply independently to each neurons's weight vector.

# Summary and Practical Guidelines
*Table 11-3. Default DNN configuration (not to be considered as hard & fast rule)*

|**Hyperparameter**|**Default value**|
|---|---|
|**Kerenl initializer**|He initialization|
|**Activation function**| ELU|
|**Normalization**| None if shallow; Batch Norm if Deep|
|**Regularization**| Early stopping (+$\ell_2$ reg if needed)|
|**Optimizer**| Momentum optimization (or RMSProp or Nadam)|
|**Learing rate schedule**| 1cycle|

If the network is a simple stack of dense layers, then it can self-normalize, and you should use the configuration in Table 11-4 instead.

*Table 11-4. DNN configuration for a self-normalizing net*

|**Hyperparameter**|**Default Value**|
|---|---|
|**Kerenl initializer**|LeCun initialization|
|**Activation function**| SELU|
|**Normalization**| None (Self-Normalization) |
|**Regularization**| Alpha Dropout if needed|
|**Optimizer**| Momentum optimization (or RMSProp or Nadam)|
|**Learing rate schedule**| 1cycle |

**Don't forget to normalize the input features.**