# Chapter 11: Training Deep Neural Networks

**Reference:** Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (Aurélien Géron)

---

## 1. Chapter Summary

Training a shallow neural network is straightforward, but training a Deep Neural Network (DNN) with tens of layers and millions of parameters comes with significant challenges. This chapter addresses the "Four Horsemen" of deep learning training difficulties and their solutions.

**Key Challenges & Solutions:**
1.  **Vanishing/Exploding Gradients:** Gradients shrink or grow uncontrollably as they propagate back through deep layers, making lower layers hard to train.
    * *Solutions:* Better initialization strategies (Glorot, He), non-saturating activation functions (ReLU, ELU, SELU), and Batch Normalization.
2.  **Lack of Data:** Deep networks need massive data.
    * *Solutions:* Transfer Learning (reusing parts of pretrained networks) and Unsupervised Pretraining.
3.  **Slow Training:** Gradient Descent is slow.
    * *Solutions:* Fast optimizers (Momentum, RMSProp, Adam) and Learning Rate Scheduling.
4.  **Overfitting:** Models with millions of parameters easily memorize training data.
    * *Solutions:* Regularization techniques ($l_1$/$l_2$ regularization, Dropout, Max-Norm).

## 2. Theoretical Explanations

### A. Vanishing and Exploding Gradients

**The Problem:**
During backpropagation, error gradients are propagated from the output layer to the input layer. By the Chain Rule, deep layers involve multiplying many small numbers (derivatives). If these derivatives are $<1$, the gradient shrinks to zero (**Vanishing**). If $>1$, it grows to infinity (**Exploding**).

**Solution 1: Weight Initialization**
Glorot and Bengio (2010) proposed that for signals to flow properly, the variance of the outputs of a layer should equal the variance of its inputs. This leads to specific initialization strategies depending on the activation function:

* **Glorot (Xavier) Initialization:** For Sigmoid/Tanh/Softmax. Sample weights from a normal distribution with variance $\sigma^2 = \frac{1}{fan_{avg}}$.
* **He Initialization:** For ReLU and its variants. Variance $\sigma^2 = \frac{2}{fan_{in}}$.

**Solution 2: Non-Saturating Activation Functions**
The Sigmoid function saturates at 0 and 1, killing gradients. Alternatives include:

* **ReLU (Rectified Linear Unit):** $ReLU(z) = \max(0, z)$. Fast and non-saturating for positive values. *Issue:* Dying ReLUs (output 0 for all inputs).
* **Leaky ReLU:** $LReLU_\alpha(z) = \max(\alpha z, z)$. Ensures neurons never die.
* **ELU (Exponential Linear Unit):** Smoother than ReLU, converges faster, but computationally expensive.
    $$ \text{ELU}_\alpha(z) = \begin{cases} \alpha(\exp(z) - 1) & \text{if } z < 0 \\ z & \text{if } z \ge 0 \end{cases} $$
* **SELU (Scaled ELU):** Self-normalizing network if used with LeCun Normal initialization.

**Solution 3: Batch Normalization (BN)**
Addresses Internal Covariate Shift (distribution of inputs to a layer changes during training). It zero-centers and normalizes each input, then scales and shifts the result using two learnable parameters ($\\gamma, \\beta$) per layer.

$$ \hat{x}^{(i)} = \frac{x^{(i)} - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} $$
$$ z^{(i)} = \gamma \otimes \hat{x}^{(i)} + \beta $$

### B. Faster Optimizers

Standard SGD is slow. We can accelerate it using momentum.

* **Momentum Optimization:** Accumulates a momentum vector $\mathbf{m}$ (like a ball rolling down a hill). $\beta$ is the momentum (friction).
    $$ \mathbf{m} \leftarrow \beta \mathbf{m} - \eta \nabla_\theta J(\theta) $$
    $$ \theta \leftarrow \theta + \mathbf{m} $$
* **RMSProp:** Adapts the learning rate by dividing by the square root of the exponential moving average of squared gradients. It slows down along steep dimensions.
* **Adam (Adaptive Moment Estimation):** Combines Momentum and RMSProp. It tracks both the first moment (mean) and second moment (variance) of the gradients.

### C. Learning Rate Scheduling

Instead of a constant learning rate $\eta$, we start high and reduce it over time.
* **Power Scheduling:** $\eta(t) = \eta_0 / (1 + t/s)^c$
* **Exponential Scheduling:** $\eta(t) = \eta_0 0.1^{t/s}$
* **Performance Scheduling:** Reduce $\eta$ when validation error stops improving (e.g., `ReduceLROnPlateau`).

### D. Avoiding Overfitting: Regularization

* **$l_1$ and $l_2$ Regularization:** Adding a penalty term to the loss function (sum of absolute weights for $l_1$, sum of squared weights for $l_2$).
* **Dropout:** At every training step, every neuron has a probability $p$ (dropout rate) of being temporarily "dropped out" (ignored). This forces the network to be robust and acts like an ensemble of exponentially many networks.
* **Max-Norm:** Constrains the weights $w$ such that $\|w\|_2 \le r$. Updates weights by clipping them after each training step.

## 3. Step-by-Step Implementation with Keras

### A. Initialization and Activation Functions
We will create a model using **He Normal** initialization and the **Leaky ReLU** activation function.

In [None]:
import tensorflow as tf
from tensorflow import keras

# Load Fashion MNIST for demonstration
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train, X_valid = X_train_full[:-5000] / 255.0, X_train_full[-5000:] / 255.0
y_train, y_valid = y_train_full[:-5000], y_train_full[-5000:]

# Building a model with He Initialization and Leaky ReLU
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    
    # He Normal initialization is optimal for ReLU-based activations.
    # We separate the activation layer to use the advanced LeakyReLU layer.
    keras.layers.Dense(300, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(alpha=0.2),
    
    keras.layers.Dense(100, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(alpha=0.2),
    
    keras.layers.Dense(10, activation="softmax")
])

# Compile and train (standard SGD for now)
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

print("Model with He Init and LeakyReLU built successfully.")

### B. Batch Normalization
Adding Batch Normalization layers is usually done **after** the linear computation and **before** the activation function (though after activation also works). This stabilizes training and allows higher learning rates.

In [None]:
model_bn = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    
    # Input normalization is implicit in the first BN layer if placed here
    keras.layers.BatchNormalization(),
    
    keras.layers.Dense(300, use_bias=False), # Bias is redundant because BN adds a shift parameter (beta)
    keras.layers.BatchNormalization(),
    keras.layers.Activation("relu"),
    
    keras.layers.Dense(100, use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("relu"),
    
    keras.layers.Dense(10, activation="softmax")
])

# BN models often benefit from larger learning rates
model_bn.compile(loss="sparse_categorical_crossentropy",
                 optimizer=keras.optimizers.SGD(learning_rate=1e-2),
                 metrics=["accuracy"])

print("Model with Batch Normalization built successfully.")

### C. Transfer Learning
Instead of training from scratch, we often reuse the lower layers of a pretrained model (e.g., trained on ImageNet) and retrain the upper layers for our specific task. 

Here, we simulate this by taking a trained `model_A` and creating `model_B` on top of it.

In [None]:
# Simulating a pretrained model A
model_A = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="relu", kernel_initializer="he_normal"),
    keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"),
    keras.layers.Dense(10, activation="softmax")
])
model_A.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])
model_A.fit(X_train[:1000], y_train[:1000], epochs=1, verbose=0) # Quick pretraining

# Reuse layers for Model B (Transfer Learning)
# We reuse all layers except the output layer
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid")) # New binary classification head

# Freeze reused layers to prevent destroying their weights during the first few epochs
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

model_B_on_A.compile(loss="binary_crossentropy", optimizer="sgd", metrics=["accuracy"])
print("Transfer Learning model ready. Reused layers are frozen.")

### D. Optimizers and Learning Rate Scheduling
We will demonstrate the **Nadam** optimizer (Adam + Nesterov Momentum) combined with an **Exponential Decay** schedule.

In [None]:
# Learning Rate Schedule: Exponential Decay
# decayed_learning_rate = initial_learning_rate * decay_rate ^ (step / decay_steps)
lr_schedule = keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.01,
    decay_steps=10000,
    decay_rate=0.9)

# Nadam Optimizer: Often converges faster than standard SGD or Adam
optimizer = keras.optimizers.Nadam(learning_rate=lr_schedule)

model_opt = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(10, activation="softmax")
])

model_opt.compile(loss="sparse_categorical_crossentropy", 
                  optimizer=optimizer, 
                  metrics=["accuracy"])

print("Model configured with Nadam optimizer and Exponential Decay LR.")

### E. Regularization: Dropout
Implementing **Dropout**, the most popular regularization technique for DNNs. A dropout rate of 0.2 means 20% of neurons are ignored at each step.

In [None]:
model_dropout = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    
    # Dropout layer applied after activation
    # Ideally placed after every dense layer
    keras.layers.Dropout(rate=0.2),
    
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    
    keras.layers.Dense(10, activation="softmax")
])

model_dropout.compile(loss="sparse_categorical_crossentropy",
                      optimizer="adam",
                      metrics=["accuracy"])

# Training the dropout model
# Dropout is only active during training, not during evaluation/testing.
history = model_dropout.fit(X_train, y_train, epochs=5, 
                            validation_data=(X_valid, y_valid))

print("Training with Dropout complete.")