# Chapter 11 – Training Deep Neural Networks

This notebook contains all the code samples and solutions to the exercises in chapter 11 of *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition* (O'Reilly). *Note: all code examples are based on the author's original GitHub repository.*

**Assignment Instructions:**
Per the assignment guidelines, this notebook reproduces the code from Chapter 11. It also includes theoretical explanations and summaries for each concept, as required.

## Chapter Summary

This chapter tackles the challenges of training *deep* neural networks. While Chapter 10 introduced MLPs, training them when they are very deep (e.g., 10+ layers) presents several problems. This chapter provides solutions to these problems, allowing us to build and train powerful, deep models.

Key problems and solutions covered:

1.  **The Vanishing/Exploding Gradients Problem:** Gradients can get smaller and smaller (vanish) or larger and larger (explode) as they backpropagate, making lower layers very hard to train. We address this with:
    * **Weight Initialization:** Using smarter initialization like **Glorot** and **He initialization**.
    * **Nonsaturating Activation Functions:** Replacing functions like tanh or sigmoid with **ReLU** and its variants (**Leaky ReLU, ELU, SELU**), which do not saturate for positive values.
    * **Batch Normalization (BN):** Adding BN layers to zero-center and normalize the inputs at each layer, which dramatically stabilizes and accelerates training.
    * **Gradient Clipping:** Clamping the gradients during backpropagation so they never exceed a threshold.

2.  **Lack of Labeled Data:** Deep networks need lots of data. If we don't have enough, we can use **Transfer Learning** to reuse the lower layers of a network already trained on a similar, large dataset. We also briefly cover *unsupervised pretraining*.

3.  **Slow Training:** We can speed up training significantly by using more advanced optimizers instead of regular Stochastic Gradient Descent:
    * **Momentum optimization**
    * **Nesterov Accelerated Gradient (NAG)**
    * **AdaGrad, RMSProp,** and **Adam**

4.  **Overfitting:** Deep networks have millions of parameters and can easily overfit. We explore powerful regularization techniques:
    * **L1 and L2 Regularization**
    * **Dropout** and **Alpha Dropout** (for self-normalizing networks)
    * **Monte Carlo (MC) Dropout** for better uncertainty estimates.
    * **Max-Norm Regularization**

Finally, the chapter provides practical guidelines and default configurations for building a high-performance deep neural network.

## Setup

First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 3.7 or later is required for the latest versions of Scikit-Learn), as well as Scikit-Learn ≥0.20 and TensorFlow ≥2.0.

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "deep"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

## The Vanishing/Exploding Gradients Problems

### Theoretical Explanation

The **backpropagation** algorithm works by propagating the error gradient from the output layer to the input layer. As it progresses down the network, the gradients often get smaller and smaller until they are almost zero. When this happens, Gradient Descent leaves the lower layers' connection weights virtually unchanged, and training never converges to a good solution. This is the **vanishing gradients problem**.

The opposite can also happen: the gradients can grow bigger and bigger until the weights become insanely large and the algorithm diverges. This is the **exploding gradients problem**.

These problems arise because of the combination of the activation functions used and the weight initialization method. For example, the logistic (sigmoid) activation function saturates at 0 and 1, where its derivative is extremely close to 0. During backpropagation, this tiny gradient gets diluted as it passes through each layer, so there is nothing left for the lower layers.

We will explore several solutions to this.

### 1. Glorot and He Initialization

### Theoretical Explanation

To alleviate unstable gradients, we need the signal to flow properly in both directions (forward for predictions, backward for gradients). We need the variance of the outputs of each layer to be equal to the variance of its inputs, and the gradients to have equal variance before and after flowing through a layer in the reverse direction.

**Glorot (or Xavier) initialization** (named after its author) proposed a practical compromise. The connection weights of each layer must be initialized randomly as described below, where $fan_{in}$ and $fan_{out}$ are the number of input and output connections for the layer (known as *fan-in* and *fan-out*).

**Glorot Initialization (for tanh, logistic, softmax):**
* Normal distribution with mean 0 and variance $\sigma^2 = \frac{1}{fan_{avg}}$, where $fan_{avg} = (fan_{in} + fan_{out}) / 2$.
* Or a uniform distribution between $-r$ and $+r$, with $r = \sqrt{\frac{3}{fan_{avg}}}$.

**He Initialization (for ReLU and its variants):**
This strategy is similar but accounts for the fact that ReLU cuts all negative values.
* Normal distribution with mean 0 and variance $\sigma^2 = \frac{2}{fan_{in}}$.
* Or a uniform distribution between $-r$ and $+r$, with $r = \sqrt{\frac{6}{fan_{in}}}$.

By default, Keras uses Glorot initialization with a uniform distribution. We can switch to He initialization by setting `kernel_initializer="he_normal"` or `"he_uniform"` in a layer.

In [2]:
# Using He initialization
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

<Dense name=dense, built=False>

In [3]:
# If you want He init with a uniform distribution but based on fan_avg
he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='uniform')
keras.layers.Dense(10, activation="relu", kernel_initializer=he_avg_init)

<Dense name=dense_1, built=False>

### 2. Nonsaturating Activation Functions

### Theoretical Explanation

The 2010 paper by Glorot and Bengio highlighted that the vanishing gradients problem was also due to the choice of activation function (like sigmoid or tanh), which saturate and have a derivative of 0.

**ReLU (Rectified Linear Unit):** `ReLU(z) = max(0, z)`
This function is the most popular default. It doesn't saturate for positive values and is fast to compute.
* **Problem:** ReLU suffers from the "dying ReLUs" problem. During training, some neurons can "die," meaning they stop outputting anything other than 0 (because their weights get tweaked so the weighted sum of their inputs is always negative). When this happens, Gradient Descent can't affect them anymore because the gradient of ReLU is 0 when $z < 0$.

**Leaky ReLU:** `LeakyReLU(z) = max(αz, z)`
This is a variant of ReLU. The hyperparameter $\alpha$ (alpha) defines how much the function "leaks." It's the slope of the function for $z < 0$ and is typically set to 0.01. This small slope ensures that leaky ReLUs never die.

**PReLU (Parametric Leaky ReLU):** $\alpha$ is *learned* during training, rather than being a fixed hyperparameter.

**ELU (Exponential Linear Unit):** `ELU(z) = α(exp(z) - 1)` if $z < 0$, `z` if $z ≥ 0$.
This function outperforms other ReLU variants: training time is reduced and accuracy is higher.
1.  It takes on negative values, which allows the unit's average output to be closer to 0, alleviating the vanishing gradients problem.
2.  It has a non-zero gradient for $z < 0$, which avoids the dead neurons problem.
3.  It is smooth everywhere, which helps speed up Gradient Descent.
* **Drawback:** It is slower to compute than ReLU due to the exponential function.

**SELU (Scaled ELU):**
This is a scaled variant of ELU. If you build a network composed exclusively of a stack of dense layers, and all hidden layers use the **SELU** activation function with **LeCun normal initialization**, then the network will **self-normalize**. The output of each layer will tend to preserve a mean of 0 and standard deviation of 1 during training, which solves the unstable gradients problem.

**Which activation function to use?**
In general **SELU > ELU > Leaky ReLU > ReLU > tanh > logistic**.
* If the network architecture allows for self-normalization, **SELU** is the best choice.
* If not, **ELU** is a great default.
* If you care a lot about runtime latency, **Leaky ReLU** is a good compromise.
* **ReLU** is the most used, so many libraries and hardware accelerators are optimized for it. If speed is your priority, ReLU might be the best choice.

In [4]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]), # Example input layer
    keras.layers.Dense(300, activation="relu", kernel_initializer="he_normal"), # Example hidden layer
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(alpha=0.2), # alpha is the leak parameter
    keras.layers.Dense(10, activation="softmax") # Example output layer
])

  super().__init__(**kwargs)


In [5]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]), # Example input layer
    keras.layers.Dense(300, activation="relu", kernel_initializer="he_normal"), # Example hidden layer
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.PReLU(), # alpha will be learned
    keras.layers.Dense(10, activation="softmax") # Example output layer
])

In [6]:
# Code Reproduction: Using SELU for a self-normalizing network

# Note: For SELU, you must use kernel_initializer="lecun_normal"
layer = keras.layers.Dense(10, activation="selu",
                           kernel_initializer="lecun_normal")

### 3. Batch Normalization

### Theoretical Explanation

**Batch Normalization (BN)** is a technique that addresses the vanishing/exploding gradients problems, and it has become one of the most-used layers in Deep Learning.

The technique consists of adding an operation in the model just before or after the activation function of each hidden layer. This operation does the following:
1.  **Zero-centers and normalizes** each input.
2.  **Scales and shifts** the result using two new parameter vectors per layer: one for scaling (gamma, $\gamma$) and the other for shifting (beta, $\beta$).

In other words, the operation lets the model learn the optimal scale and mean of each of the layer's inputs.

**How it works (training):**
The algorithm estimates each input's mean ($\mu$) and standard deviation ($\sigma$) *over the current mini-batch*. Then it normalizes the input: $\hat{x}^{(i)} = (x^{(i)} - \mu_B) / \sqrt{\sigma_B^2 + \epsilon}$.
Finally, it scales and shifts the result: $z^{(i)} = \gamma \otimes \hat{x}^{(i)} + \beta$.

**How it works (testing):**
At test time, we don't have a mini-batch to compute the mean and standard deviation. Instead, the algorithm uses the *final* statistics (mean and standard deviation of all inputs) estimated during training using a moving average.

**Advantages of BN:**
* Strongly reduces the vanishing gradients problem.
* Networks are much less sensitive to weight initialization.
* Allows the use of much larger learning rates, speeding up training.
* Acts as a **regularizer**, reducing the need for other regularization techniques (like dropout).

In [7]:
# Code Reproduction: Implementing Batch Normalization with Keras

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(), # Add BN layer as the first layer
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(), # Add BN layer after the hidden layer
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(), # Add BN layer after the hidden layer
    keras.layers.Dense(10, activation="softmax")
])

In [8]:
model.summary()

Note that each BN layer adds four parameters per input: $\gamma$, $\beta$, $\mu$, and $\sigma$. The last two (the moving averages) are not trainable (they are not affected by backpropagation), so Keras calls them "non-trainable params."

The authors of the BN paper argued for adding BN layers *before* the activation functions. To do this, you remove the activation from the hidden layer and add it as a separate layer after the BN layer.

In [9]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False), # Set use_bias=False
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(10, activation="softmax")
])

### 4. Gradient Clipping

### Theoretical Explanation

A popular technique to mitigate the **exploding gradients** problem is to clip the gradients during backpropagation so they never exceed some threshold. This is called **Gradient Clipping**.

In Keras, you can set the `clipvalue` or `clipnorm` argument when creating an optimizer.

In [10]:
# `clipvalue` clips every component of the gradient vector to be between -1.0 and 1.0
optimizer = keras.optimizers.SGD(clipvalue=1.0)

# `clipnorm` clips the whole gradient vector if its l2 norm is greater than 1.0
optimizer = keras.optimizers.SGD(clipnorm=1.0)

## Reusing Pretrained Layers

### Theoretical Explanation

It is generally not a good idea to train a very large DNN from scratch. Instead, you should almost always try to find an existing neural network that accomplishes a similar task and reuse its lower layers. This technique is called **Transfer Learning**.

**Why it works:**
* It speeds up training considerably.
* It requires significantly less training data.
* Lower layers of a network learn general features (e.g., edges, textures), while upper layers learn task-specific features (e.g., cat ears, dog noses). For a new, similar task, the general features are likely to be useful.

**How to do it (with Keras):**
1.  Load a pretrained model (e.g., `model_A`), excluding its top output layer (`include_top=False`).
2.  Create a new model (`model_B_on_A`) using `model_A`'s layers, and add your new output layer on top.
3.  **Freeze** the weights of the reused layers (by setting `layer.trainable = False` for each one) to avoid wrecking them.
4.  Compile and train the model for a few epochs. This will only train the new output layer.
5.  **Unfreeze** the reused layers (or some of them).
6.  Compile the model again, this time with a **much lower learning rate**.
7.  Continue training to fine-tune the reused layers for your new task.

In [11]:
# Code Reproduction: Transfer Learning with Keras

# Let's load the Fashion MNIST dataset again
fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

# Split the data. We'll pretend we're training on a different task with classes 0-7.
X_train_A, y_train_A = X_train_full[y_train_full < 8], y_train_full[y_train_full < 8]
X_test_A, y_test_A = X_test[y_test < 8], y_test[y_test < 8]

# And the new task B (e.g., shirts vs. sandals) has very little data.
X_train_B, y_train_B = X_train_full[y_train_full >= 8], y_train_full[y_train_full >= 8]
X_test_B, y_test_B = X_test[y_test >= 8], y_test[y_test >= 8]

# Scale the data
X_train_A = X_train_A / 255.0
X_test_A = X_test_A / 255.0
X_train_B = X_train_B / 255.0
X_test_B = X_test_B / 255.0

# Let's pretend we've already trained and saved model A
tf.random.set_seed(42)
model_A = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="relu", kernel_initializer="he_normal"),
    keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"),
    keras.layers.Dense(8, activation="softmax") # 8 classes
])
model_A.compile(loss="sparse_categorical_crossentropy",
                optimizer=keras.optimizers.SGD(learning_rate=1e-3),
                metrics=["accuracy"])
history = model_A.fit(X_train_A, y_train_A, epochs=10, validation_split=0.1)
model_A.save("my_model_A.h5")

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
[1m29515/29515[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
[1m26421880/26421880[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
[1m5148/5148[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
[1m4422102/4422102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 0us/step
Epoch 1/10
[1m1350/1350[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 5ms/step - accuracy: 0.4977 - loss: 1.5771 - val_accuracy: 0.7160 - val_loss: 0.9051
Epoch 2/10
[1m1350/1350[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m



In [12]:
# 1. Load model A
model_A = keras.models.load_model("my_model_A.h5")

# 2. Create model B based on A's layers (reusing all but the output layer)
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid")) # New output layer for binary task B

# 3. Freeze the reused layers
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

# 4. Compile and train (only the new output layer will be trained)
# Use a larger learning rate for the new layer
model_B_on_A.compile(loss="binary_crossentropy",
                     optimizer=keras.optimizers.SGD(learning_rate=1e-3),
                     metrics=["accuracy"])

# We subtract 8 from the labels to make them 0 (Bag) or 1 (Ankle boot)
history = model_B_on_A.fit(X_train_B, y_train_B - 8, epochs=4,
                            validation_split=0.1)

# 5. Unfreeze the reused layers
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

# 6. Compile again with a very low learning rate
optimizer = keras.optimizers.SGD(learning_rate=1e-4)
model_B_on_A.compile(loss="binary_crossentropy",
                     optimizer=optimizer,
                     metrics=["accuracy"])

# 7. Continue training (fine-tuning)
history = model_B_on_A.fit(X_train_B, y_train_B - 8, epochs=16,
                            validation_split=0.1)



Epoch 1/4
[1m338/338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.8318 - loss: 0.4458 - val_accuracy: 0.9233 - val_loss: 0.3211
Epoch 2/4
[1m338/338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.9253 - loss: 0.3009 - val_accuracy: 0.9367 - val_loss: 0.2624
Epoch 3/4
[1m338/338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9378 - loss: 0.2507 - val_accuracy: 0.9525 - val_loss: 0.2277
Epoch 4/4
[1m338/338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9446 - loss: 0.2198 - val_accuracy: 0.9583 - val_loss: 0.2038
Epoch 1/16
[1m338/338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.9546 - loss: 0.1958 - val_accuracy: 0.9633 - val_loss: 0.1786
Epoch 2/16
[1m338/338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.9647 - loss: 0.1720 - val_accuracy: 0.9692 - val_loss: 0.1592
Epoch 3/16
[1m338/338[0m [32m

In [13]:
model_B_on_A.evaluate(X_test_B, y_test_B - 8)

[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.9915 - loss: 0.0710


[0.06705056130886078, 0.9915000200271606]

Transfer learning works best with deep convolutional networks, as they learn more general feature detectors.

## Faster Optimizers

### Theoretical Explanation

Training a large DNN can be painfully slow. A huge speed boost comes from using a faster optimizer than regular Gradient Descent.

**1. Momentum Optimization**
Imagine a bowling ball rolling down a gentle slope. It starts slow, but picks up *momentum* until it reaches terminal velocity. Regular GD takes small, regular steps. Momentum optimization adds a *momentum vector* **m** to the weights, which accumulates past gradients. The gradient is used for acceleration, not speed. This allows it to roll past plateaus and converge faster. A `momentum` hyperparameter (typically 0.9) acts as friction.

**2. Nesterov Accelerated Gradient (NAG)**
A small, fast-converging variant of momentum. Instead of computing the gradient at the current position, it computes the gradient slightly *ahead* in the direction of the momentum. This is slightly more accurate and helps reduce oscillations.

**3. AdaGrad (Adaptive Gradient)**
This algorithm decays the learning rate, but it does so *faster for steep dimensions* and *slower for dimensions with gentler slopes*. This is an *adaptive learning rate*. It helps point the updates more directly toward the global optimum. However, it often stops too early because the learning rate gets scaled down too much.

**4. RMSProp (Root Mean Square Propagation)**
This algorithm fixes AdaGrad's problem by accumulating only the gradients from the most *recent* iterations (using an exponential decay). It has become a very popular optimizer.

**5. Adam and Nadam**
* **Adam** (Adaptive Moment Estimation) combines the ideas of momentum optimization and RMSProp. It keeps track of an exponentially decaying average of past gradients (like momentum) and an exponentially decaying average of past *squared* gradients (like RMSProp). It is very popular and generally performs well, requiring less tuning of the learning rate.
* **Nadam** is Adam optimization plus the Nesterov trick. It often converges slightly faster than Adam.

In [15]:
# Code Reproduction: Using different optimizers in Keras

# Momentum
optimizer = keras.optimizers.SGD(learning_rate=0.001, momentum=0.9)

# Nesterov Accelerated Gradient
optimizer = keras.optimizers.SGD(learning_rate=0.001, momentum=0.9, nesterov=True)

# RMSProp
optimizer = keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9)

# Adam
optimizer = keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

# Nadam
optimizer = keras.optimizers.Nadam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

## Learning Rate Scheduling

### Theoretical Explanation

Finding a good learning rate is crucial. If it's too high, training may diverge. If it's too low, it will take too long. If it's slightly too high, it may converge fast but be unstable around the optimum.

A better approach than a constant learning rate is to use a **learning schedule**, which reduces the learning rate during training. This can help you start with a large learning rate (for fast convergence) and then reduce it to settle at a good solution.

Common schedules include:
* **Power scheduling:** $\eta(t) = \eta_0 / (1 + t/s)^c$. Drops quickly, then more slowly.
* **Exponential scheduling:** $\eta(t) = \eta_0 \times 0.1^{t/s}$. Drops by a factor of 10 every $s$ steps.
* **Piecewise constant scheduling:** Use a constant rate for some epochs, then a smaller rate, etc.
* **Performance scheduling:** Reduce the rate by a factor of $\lambda$ when the validation error stops dropping.
* **1cycle scheduling:** Increases the rate from $\eta_0$ to $\eta_1$ during the first half of training, then decreases it back to $\eta_0$ during the second half. Often speeds up training considerably.

In [18]:
# Code Reproduction: Implementing schedules in Keras

# 1. Power scheduling (decay)
optimizer = keras.optimizers.SGD(learning_rate=0.01)

# 2. Exponential scheduling (using a callback)
def exponential_decay_fn(epoch):
    return 0.01 * 0.1**(epoch / 20)

lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)
# Then, pass callbacks=[lr_scheduler] to model.fit()

# 3. Piecewise constant scheduling (using a callback)
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001

lr_scheduler_piecewise = keras.callbacks.LearningRateScheduler(piecewise_constant_fn)

# 4. Performance scheduling (using a callback)
lr_scheduler_perf = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

# 5. tf.keras schedules (updates at each step, not epoch)
X_train = X_train_full # Define X_train based on existing variable
s = 20 * len(X_train) // 32 # number of steps in 20 epochs (batch size = 32)
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
optimizer = keras.optimizers.SGD(learning_rate)

## Avoiding Overfitting Through Regularization

Deep neural networks have many parameters, which gives them a lot of freedom and makes them prone to overfitting. We've already seen two regularization techniques: **Batch Normalization** and **Early Stopping**. Here are a few more.

### 1. ℓ1 and ℓ2 Regularization

### Theoretical Explanation

You can apply $\ell_1$ and $\ell_2$ regularization to constrain a neural network’s connection weights.
* **$\ell_2$ regularization** (like Ridge) penalizes large weights and encourages smaller weights.
* **$\ell_1$ regularization** (like Lasso) pushes the optimizer to zero out as many weights as it can, leading to a *sparse model*.

In Keras, you can apply a kernel regularizer to any layer.

In [19]:
from functools import partial

# We can apply a regularizer to a layer like this:
layer = keras.layers.Dense(100, activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01))

# To avoid repeating all those parameters, we can use functools.partial
RegularizedDense = partial(keras.layers.Dense,
                           activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation="softmax",
                       kernel_initializer="glorot_uniform",
                       kernel_regularizer=None) # No regularization on output layer
])

  super().__init__(**kwargs)


### 2. Dropout

### Theoretical Explanation

**Dropout** is one of the most popular and successful regularization techniques for deep neural networks.

**The Algorithm:** At every training step, every neuron (excluding output neurons) has a probability `p` (the *dropout rate*, typically 10-50%) of being temporarily **"dropped out."** This means it will be entirely ignored during this training step, but it may be active in the next step.

**Why it works:**
1.  **More Robust Neurons:** Neurons trained with dropout cannot co-adapt with their neighboring neurons. They are forced to be as useful as possible on their own. They become less sensitive to slight changes in the inputs, leading to a more robust network.
2.  **Ensemble Effect:** At each training step, a unique network is generated. The final neural network can be seen as an averaging ensemble of all these smaller networks.

**Implementation:** In Keras, you add a `keras.layers.Dropout` layer. During training, it randomly drops inputs and divides the remaining inputs by the *keep probability* ($1 - p$). After training (at test time), it does nothing at all.

**Note:** If a model is overfitting, you can increase the dropout rate. If it is underfitting, you should decrease it.

In [20]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])

### 3. Monte Carlo (MC) Dropout

### Theoretical Explanation

**MC Dropout** is a technique that uses dropout to get a better measure of the model's uncertainty.

Instead of only using dropout during training, we **also activate it during inference (prediction)**. Because dropout is active, we will get a slightly different prediction every time we run it.

By making multiple predictions (e.g., 100) on the same instance and averaging them, we get a **Monte Carlo estimate** that is generally more reliable than a single prediction. More importantly, we can look at the *standard deviation* of these predictions to get a measure of the model's uncertainty.

To do this, we can't just call `model.predict()`. We have to call the model as a function with `training=True`.

In [21]:
# This code assumes a 'model' with Dropout layers has been trained
# and 'X_test' is available.

# We'll create a simple model for demonstration
model_mc = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])
model_mc.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])
# In a real scenario, you would fit this model first.

# To perform MC Dropout:
y_probas = np.stack([model_mc(X_test, training=True)
                     for sample in range(100)])

# y_probas shape is [n_samples, n_instances, n_classes]
# We average over the samples to get the final probabilities:
y_proba = y_probas.mean(axis=0)

# We can also get the standard deviation to measure uncertainty
y_std = y_probas.std(axis=0)

### 4. Max-Norm Regularization

### Theoretical Explanation

**Max-Norm Regularization** is another popular technique. For each neuron, it constrains the weights $\mathbf{w}$ of the incoming connections such that $\|\mathbf{w}\|_2 \le r$, where $r$ is the *max-norm* hyperparameter and $\|\cdot\|_2$ is the $\ell_2$ norm.

It doesn't add a regularization loss. Instead, after each training step, it checks the $\ell_2$ norm of each neuron's weight vector and rescales it if needed ($\|\mathbf{w}\| = \mathbf{w} \frac{r}{\|\mathbf{w}\|_2}$).

Reducing $r$ increases the regularization and helps reduce overfitting. It can also help alleviate unstable gradients.

To implement this, you set a layer's `kernel_constraint` argument.

In [22]:
keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal",
                   kernel_constraint=keras.constraints.max_norm(1.))

<Dense name=dense_28, built=False>

## Summary and Practical Guidelines

The chapter concludes with a table of default configurations for a standard DNN. This is an excellent starting point for most problems.

**Table 11-3. Default DNN configuration**
| Hyperparameter | Default value |
| --- | --- |
| Kernel initializer | He initialization |
| Activation function | ELU |
| Normalization | None if shallow; Batch Norm if deep |
| Regularization | Early stopping (+ ℓ2 reg. if needed) |
| Optimizer | Momentum optimization (or RMSProp or Nadam) |
| Learning rate schedule | 1cycle |

**Table 11-4. DNN configuration for a self-normalizing net (SELU)**
| Hyperparameter | Default value |
| --- | --- |
| Kernel initializer | LeCun initialization |
| Activation function | SELU |
| Normalization | None (self-normalization) |
| Regularization | Alpha dropout if needed |
| Optimizer | Momentum optimization |
| Learning rate schedule | 1cycle |