# Part 3.4: Training Deep Networks â€” The Formula 1 Edition

Building a neural network architecture is only half the battle. The real challenge lies in **training** it effectively. In this notebook, we explore the techniques that make deep learning work in practice: optimizers that navigate complex loss landscapes, regularization methods that prevent overfitting, and normalization techniques that stabilize training.

**F1 analogy:** Designing the car (architecture) is one thing. Getting it to go fast on race day is another. This notebook is about the engineering science of car setup: how to tune the setup parameters (optimizers), how to ensure the car works on all tracks and not just one (regularization), and how to keep the car stable through varying conditions (normalization). The difference between a championship-winning team and a backmarker is not the car concept -- it is the quality of the development process.

---

## Learning Objectives

By the end of this notebook, you should be able to:

- [ ] Explain the intuition behind SGD, Momentum, RMSprop, and Adam optimizers
- [ ] Choose appropriate learning rate schedules for different training scenarios
- [ ] Apply regularization techniques (L1, L2, Dropout, Early Stopping) to prevent overfitting
- [ ] Understand when and why to use Batch Normalization vs LayerNorm
- [ ] Select proper weight initialization for different activation functions
- [ ] Build a complete training pipeline with best practices

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
torch.manual_seed(42)
np.random.seed(42)

---

## 1. Optimizers

### Intuitive Explanation

Imagine you're trying to find the lowest point in a mountainous landscape while blindfolded. All you can feel is the slope beneath your feet. **Gradient descent** is your basic strategy: always step downhill. But this simple approach has problems:

1. **Getting stuck in valleys**: You might oscillate back and forth across a narrow valley instead of moving along it
2. **Slow progress on plateaus**: Flat regions give tiny gradients, leading to tiny steps
3. **Different terrain scales**: Some directions might be steep, others gentle

Different optimizers address these challenges in different ways.

**F1 analogy:** Optimizers are setup tuning strategies. SGD is the conservative engineer who makes one small change at a time and re-tests. Momentum is the engineer who notices "we have been improving by adding downforce for three sessions, so keep pushing in that direction." Adam is the veteran engineer who adapts their approach to each parameter -- making big changes where the car is clearly off and small refinements where it is already close.

| Optimizer | Strategy | F1 Analogy |
|-----------|----------|------------|
| **SGD** | Follow the gradient | Conservative: one small setup change at a time |
| **Momentum** | Build up velocity | Aggressive: if the last 3 changes all went the same way, commit harder |
| **RMSprop** | Adapt step size per dimension | Smart: big changes for insensitive parameters, small for sensitive ones |
| **Adam** | Momentum + adaptive step sizes | Veteran: momentum awareness + parameter-specific tuning |

### 1.1 SGD (Stochastic Gradient Descent)

The simplest optimizer. Update rule:

$$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$$

#### Breaking down the formula:

| Component | Meaning | Typical Values | F1 Analogy |
|-----------|---------|----------------|------------|
| $\theta_t$ | Current weights | - | Current car setup |
| $\eta$ | Learning rate | 0.001 to 0.1 | How big each setup adjustment is |
| $\nabla L$ | Gradient of loss | Computed via backprop | Which direction to adjust each parameter |

**What this means:** Move in the opposite direction of the gradient, scaled by learning rate. Simple but can be slow and get stuck oscillating.

**F1 analogy:** This is the most conservative setup strategy. After each session, you look at the data, identify which parameter is most responsible for the time loss, and adjust it by a small fixed amount. It works, but it is slow -- especially when one parameter needs a big change and another needs a tiny one, since SGD uses the same step size for everything.

### 1.2 Momentum

**Intuition: A ball rolling downhill**

Instead of just following the current gradient, momentum keeps track of which direction we've been moving. Like a ball rolling downhill, we build up velocity and can push through small bumps.

$$v_{t+1} = \beta v_t + \nabla L(\theta_t)$$
$$\theta_{t+1} = \theta_t - \eta v_{t+1}$$

| Component | Meaning | Typical Values |
|-----------|---------|----------------|
| $v_t$ | Velocity (accumulated gradient) | - |
| $\beta$ | Momentum coefficient | 0.9 |

**What this means:** We blend the current gradient with our previous direction. This helps us:
- Move faster in consistent directions
- Dampen oscillations in inconsistent directions
- Escape shallow local minima

**F1 analogy:** This is the engineer who tracks trends across sessions. If adding downforce has improved lap time for the last three practice sessions, momentum says "keep going in that direction with confidence" rather than starting fresh each time. If one session says "add downforce" and the next says "remove downforce," momentum dampens that oscillation. The $\beta = 0.9$ means 90% of the previous direction is retained -- a strong memory of recent trends.

### 1.3 RMSprop

**Intuition: Adaptive step sizes**

Some parameters need big updates, others need small ones. RMSprop tracks how variable each gradient has been and adjusts accordingly.

$$s_{t+1} = \beta s_t + (1-\beta)(\nabla L)^2$$
$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{s_{t+1} + \epsilon}} \nabla L$$

| Component | Meaning | Typical Values |
|-----------|---------|----------------|
| $s_t$ | Running average of squared gradients | - |
| $\epsilon$ | Small constant for stability | $10^{-8}$ |

**What this means:**
- Parameters with large, variable gradients get smaller updates
- Parameters with small, consistent gradients get larger updates
- This "evens out" the optimization across all parameters

### 1.4 Adam (Adaptive Moment Estimation)

**Intuition: The best of both worlds**

Adam combines momentum (first moment) with RMSprop's adaptive learning rates (second moment).

$$m_{t+1} = \beta_1 m_t + (1-\beta_1) \nabla L \quad \text{(momentum)}$$
$$v_{t+1} = \beta_2 v_t + (1-\beta_2) (\nabla L)^2 \quad \text{(adaptive rates)}$$
$$\hat{m} = \frac{m_{t+1}}{1-\beta_1^t}, \quad \hat{v} = \frac{v_{t+1}}{1-\beta_2^t} \quad \text{(bias correction)}$$
$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}} + \epsilon} \hat{m}$$

| Component | Meaning | Default Value |
|-----------|---------|---------------|
| $\beta_1$ | Momentum decay | 0.9 |
| $\beta_2$ | RMSprop decay | 0.999 |
| $\epsilon$ | Stability constant | $10^{-8}$ |

### Deep Dive: Why Adam is Usually the Default

Adam has become the go-to optimizer for most deep learning tasks. Here's why:

1. **Works well out of the box**: The default hyperparameters ($\beta_1=0.9$, $\beta_2=0.999$, $\eta=0.001$) work well for most problems

2. **Handles sparse gradients**: The adaptive learning rates help when some features appear rarely

3. **Robust to hyperparameter choices**: Less sensitive to learning rate than SGD

4. **Fast convergence**: Combines the speed benefits of momentum with adaptive rates

**F1 analogy:** Adam is like the veteran race engineer who has tuned hundreds of cars across dozens of circuits. They use momentum (remembering what worked in recent sessions) combined with adaptive step sizes (making big changes to the wing but tiny adjustments to the differential). They know the default starting point that works 90% of the time, and they adapt from there. A rookie engineer (SGD) might find a better setup given enough time, but the veteran (Adam) gets you competitive much faster.

#### Key Insight

Adam is like having an experienced hiker guide you through the mountains. They know when to speed up on clear paths and slow down on tricky terrain.

#### Common Misconceptions

| Misconception | Reality |
|---------------|--------|
| "Adam always beats SGD" | SGD+momentum often achieves better final accuracy on vision tasks |
| "Adam doesn't need LR tuning" | Still benefits from LR scheduling |
| "Use Adam for everything" | For transformers yes, but try SGD for CNNs |

### Visualization: Optimizer Paths on Loss Surface

In [None]:
def rosenbrock(x, y):
    """Rosenbrock function - a classic optimization test function.
    
    Has a narrow curved valley that's easy to find but hard to follow.
    Minimum at (1, 1).
    """
    return (1 - x)**2 + 100 * (y - x**2)**2

def rosenbrock_grad(x, y):
    """Gradient of Rosenbrock function."""
    dx = -2 * (1 - x) - 400 * x * (y - x**2)
    dy = 200 * (y - x**2)
    return np.array([dx, dy])

# Optimizer implementations from scratch
class SGDOptimizer:
    def __init__(self, lr=0.001):
        self.lr = lr
        
    def step(self, params, grad):
        return params - self.lr * grad

class MomentumOptimizer:
    def __init__(self, lr=0.001, beta=0.9):
        self.lr = lr
        self.beta = beta
        self.v = None
        
    def step(self, params, grad):
        if self.v is None:
            self.v = np.zeros_like(params)
        self.v = self.beta * self.v + grad
        return params - self.lr * self.v

class RMSpropOptimizer:
    def __init__(self, lr=0.001, beta=0.9, eps=1e-8):
        self.lr = lr
        self.beta = beta
        self.eps = eps
        self.s = None
        
    def step(self, params, grad):
        if self.s is None:
            self.s = np.zeros_like(params)
        self.s = self.beta * self.s + (1 - self.beta) * grad**2
        return params - self.lr * grad / (np.sqrt(self.s) + self.eps)

class AdamOptimizer:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.m = None
        self.v = None
        self.t = 0
        
    def step(self, params, grad):
        if self.m is None:
            self.m = np.zeros_like(params)
            self.v = np.zeros_like(params)
        self.t += 1
        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
        self.v = self.beta2 * self.v + (1 - self.beta2) * grad**2
        m_hat = self.m / (1 - self.beta1**self.t)
        v_hat = self.v / (1 - self.beta2**self.t)
        return params - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

In [None]:
def optimize_path(optimizer, start, grad_fn, n_steps=500):
    """Run optimization and record the path."""
    path = [start.copy()]
    params = start.copy()
    
    for _ in range(n_steps):
        grad = grad_fn(params[0], params[1])
        params = optimizer.step(params, grad)
        path.append(params.copy())
        
        # Stop if converged or diverged
        if np.linalg.norm(params - np.array([1.0, 1.0])) < 0.01:
            break
        if np.any(np.abs(params) > 10):
            break
            
    return np.array(path)

# Starting point
start = np.array([-1.0, 1.5])

# Run each optimizer with tuned learning rates
optimizers = {
    'SGD': SGDOptimizer(lr=0.0002),
    'Momentum': MomentumOptimizer(lr=0.0002, beta=0.9),
    'RMSprop': RMSpropOptimizer(lr=0.01),
    'Adam': AdamOptimizer(lr=0.05)
}

paths = {}
for name, opt in optimizers.items():
    paths[name] = optimize_path(opt, start, rosenbrock_grad, n_steps=500)

In [None]:
# Visualize the loss surface and optimization paths
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Create meshgrid for contour plot
x = np.linspace(-2, 2, 200)
y = np.linspace(-1, 3, 200)
X, Y = np.meshgrid(x, y)
Z = rosenbrock(X, Y)

# Left plot: All paths on contour
ax = axes[0]
ax.contour(X, Y, Z, levels=np.logspace(0, 3.5, 20), cmap='viridis', alpha=0.7)
ax.plot(1, 1, 'r*', markersize=15, label='Minimum (1,1)')
ax.plot(start[0], start[1], 'ko', markersize=10, label='Start')

colors = {'SGD': 'blue', 'Momentum': 'red', 'RMSprop': 'green', 'Adam': 'orange'}
for name, path in paths.items():
    ax.plot(path[:, 0], path[:, 1], '-', color=colors[name], 
            label=f'{name} ({len(path)} steps)', linewidth=2, alpha=0.8)

ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Optimizer Paths on Rosenbrock Function')
ax.legend(loc='upper left')
ax.set_xlim(-2, 2)
ax.set_ylim(-1, 3)

# Right plot: Loss over steps
ax = axes[1]
for name, path in paths.items():
    losses = [rosenbrock(p[0], p[1]) for p in path]
    ax.plot(losses, color=colors[name], label=name, linewidth=2)

ax.set_xlabel('Step')
ax.set_ylabel('Loss (log scale)')
ax.set_title('Loss Convergence')
ax.set_yscale('log')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print final positions
print("Final positions and loss:")
for name, path in paths.items():
    final = path[-1]
    final_loss = rosenbrock(final[0], final[1])
    print(f"  {name:10s}: ({final[0]:.4f}, {final[1]:.4f}), loss={final_loss:.6f}")

### Optimizer Comparison Table

| Optimizer | Pros | Cons | When to Use | F1 Parallel |
|-----------|------|------|-------------|-------------|
| **SGD** | Simple, less memory | Slow, sensitive to LR | Final fine-tuning, simple problems | Conservative approach: one change at a time |
| **Momentum** | Faster than SGD, escapes local minima | Still sensitive to LR | CNNs, when SGD is too slow | Trend-following: commit to directions that keep working |
| **RMSprop** | Adaptive LR per parameter | Can diverge with wrong settings | RNNs, non-stationary problems | Parameter-specific tuning intensity |
| **Adam** | Fast, robust, adaptive | Higher memory, can overfit | Default choice for most tasks | Veteran engineer: adapts to each parameter |

### Why This Matters in Machine Learning

| Application | Recommended Optimizer | F1 Parallel |
|-------------|----------------------|-------------|
| Computer vision (CNNs) | SGD + Momentum (best accuracy) or Adam (faster) | Fine-tuning downforce: patient, iterative |
| NLP (Transformers) | Adam or AdamW | Real-time strategy adaptation |
| GANs | Adam with low beta1 (0.5) | Aggressive, exploratory setup changes |
| Fine-tuning pretrained | Adam with small LR, or SGD | Gentle refinements to a known-good baseline |
| Quick prototyping | Adam (works out of the box) | Friday practice: get up to speed fast |

In [None]:
# PyTorch optimizer examples
model = nn.Linear(10, 1)

# SGD with momentum
sgd_optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Adam (most common default)
adam_optimizer = optim.Adam(model.parameters(), lr=0.001)

# AdamW (Adam with proper weight decay - often better)
adamw_optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

print("PyTorch optimizer examples:")
print(f"  SGD: lr={sgd_optimizer.defaults['lr']}, momentum={sgd_optimizer.defaults['momentum']}")
print(f"  Adam: lr={adam_optimizer.defaults['lr']}, betas={adam_optimizer.defaults['betas']}")
print(f"  AdamW: lr={adamw_optimizer.defaults['lr']}, weight_decay={adamw_optimizer.defaults['weight_decay']}")

---

## 2. Learning Rate

### Intuitive Explanation

The **learning rate** is perhaps the single most important hyperparameter. It controls how big of a step you take with each update:

- **Too high**: You overshoot the minimum, bouncing around wildly or even diverging
- **Too low**: Training takes forever and may get stuck in poor solutions
- **Just right**: Steady progress toward the minimum

**F1 analogy:** The learning rate is how big each setup adjustment is between sessions. If you change the front wing by 5 degrees at a time (too high), you will overshoot the sweet spot and oscillate wildly between understeer and oversteer. If you change it by 0.01 degrees (too low), you will never converge on the optimal setting before the weekend is over. The art is finding the right adjustment size -- big enough to make progress, small enough not to overshoot. And as you get closer to the optimal setup, you should make smaller and smaller adjustments (learning rate scheduling).

### Visualization: Too High vs Too Low

In [None]:
def simple_loss(x):
    """Simple 1D loss function: x^2"""
    return x**2

def simple_grad(x):
    """Gradient: 2x"""
    return 2 * x

def gradient_descent_1d(start, lr, n_steps):
    """Run 1D gradient descent."""
    path = [start]
    x = start
    for _ in range(n_steps):
        x = x - lr * simple_grad(x)
        path.append(x)
        if abs(x) > 100:  # Diverged
            break
    return path

# Test different learning rates
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
start = 5.0
n_steps = 20

configs = [
    ('Too High (lr=1.1)', 1.1, 'red'),
    ('Just Right (lr=0.3)', 0.3, 'green'),
    ('Too Low (lr=0.01)', 0.01, 'blue')
]

for ax, (title, lr, color) in zip(axes, configs):
    path = gradient_descent_1d(start, lr, n_steps)
    
    # Plot x value over steps
    ax.plot(path, 'o-', color=color, markersize=6, linewidth=2)
    ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel('Step')
    ax.set_ylabel('x value')
    ax.set_title(title)
    ax.grid(True, alpha=0.3)
    
    if lr > 1:
        ax.set_ylim(-10, 10)
    else:
        ax.set_ylim(-1, 6)

plt.tight_layout()
plt.show()

### 2.1 Learning Rate Schedulers

A common strategy is to **start with a larger learning rate** for fast initial progress, then **gradually reduce it** to fine-tune the solution.

**F1 analogy:** This is exactly how F1 teams approach a race weekend. In FP1 (Friday practice), you make bold setup changes to explore the performance landscape. In FP2, you narrow the range. By qualifying, you are making tiny refinements. Learning rate schedulers automate this progression.

| Scheduler | Strategy | Use Case | F1 Parallel |
|-----------|----------|----------|-------------|
| **StepLR** | Multiply by gamma every N epochs | Simple, predictable decay | Cut adjustment size at scheduled points (FP1 -> FP2 -> Quali) |
| **ExponentialLR** | Multiply by gamma each epoch | Smooth continuous decay | Gradually smaller changes every session |
| **CosineAnnealingLR** | Smooth cosine curve decay | Transformers, good generalization | Smooth transition from exploration to refinement |
| **ReduceLROnPlateau** | Reduce when metric stops improving | Adaptive, data-driven | "If lap time stops improving, try smaller changes" |

In [None]:
# Visualize different learning rate schedules
epochs = 100
base_lr = 0.1

def get_scheduler_lrs(scheduler_class, epochs, **kwargs):
    """Get learning rates for a scheduler over epochs."""
    model = nn.Linear(10, 1)
    optimizer = optim.SGD(model.parameters(), lr=base_lr)
    scheduler = scheduler_class(optimizer, **kwargs)
    
    lrs = []
    for _ in range(epochs):
        lrs.append(optimizer.param_groups[0]['lr'])
        scheduler.step()
    return lrs

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# StepLR
ax = axes[0, 0]
lrs = get_scheduler_lrs(optim.lr_scheduler.StepLR, epochs, step_size=30, gamma=0.1)
ax.plot(lrs, 'b-', linewidth=2)
ax.set_xlabel('Epoch')
ax.set_ylabel('Learning Rate')
ax.set_title('StepLR (step=30, gamma=0.1)')
ax.grid(True, alpha=0.3)

# ExponentialLR
ax = axes[0, 1]
lrs = get_scheduler_lrs(optim.lr_scheduler.ExponentialLR, epochs, gamma=0.95)
ax.plot(lrs, 'g-', linewidth=2)
ax.set_xlabel('Epoch')
ax.set_ylabel('Learning Rate')
ax.set_title('ExponentialLR (gamma=0.95)')
ax.grid(True, alpha=0.3)

# CosineAnnealingLR
ax = axes[1, 0]
lrs = get_scheduler_lrs(optim.lr_scheduler.CosineAnnealingLR, epochs, T_max=epochs)
ax.plot(lrs, 'r-', linewidth=2)
ax.set_xlabel('Epoch')
ax.set_ylabel('Learning Rate')
ax.set_title('CosineAnnealingLR')
ax.grid(True, alpha=0.3)

# MultiStepLR
ax = axes[1, 1]
lrs = get_scheduler_lrs(optim.lr_scheduler.MultiStepLR, epochs, milestones=[30, 60, 80], gamma=0.1)
ax.plot(lrs, 'purple', linewidth=2)
ax.set_xlabel('Epoch')
ax.set_ylabel('Learning Rate')
ax.set_title('MultiStepLR (milestones=[30,60,80])')
ax.grid(True, alpha=0.3)

plt.suptitle('Learning Rate Schedulers', fontsize=14)
plt.tight_layout()
plt.show()

### 2.2 Warmup Explained

**Warmup** starts training with a very small learning rate and gradually increases it. This helps because:

1. **Early gradients are unreliable**: Before the model has learned anything, gradients point in somewhat random directions
2. **Batch normalization needs time**: BatchNorm statistics aren't accurate initially
3. **Prevents early divergence**: Large initial updates can push weights to bad regions

Warmup is especially important for:
- Very deep networks
- Large batch sizes
- Transformer architectures

In [None]:
def warmup_cosine_schedule(epoch, warmup_epochs, total_epochs, max_lr):
    """Linear warmup followed by cosine decay."""
    if epoch < warmup_epochs:
        return max_lr * (epoch / warmup_epochs)
    else:
        progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
        return max_lr * 0.5 * (1 + np.cos(np.pi * progress))

# Visualize warmup
epochs = 100
warmup_epochs = 10
max_lr = 0.1

# Different schedules
warmup_cosine = [warmup_cosine_schedule(e, warmup_epochs, epochs, max_lr) for e in range(epochs)]
no_warmup = [max_lr * 0.5 * (1 + np.cos(np.pi * e / epochs)) for e in range(epochs)]
constant = [max_lr] * epochs

fig, ax = plt.subplots(figsize=(10, 6))

ax.plot(warmup_cosine, 'b-', linewidth=2, label='Warmup + Cosine')
ax.plot(no_warmup, 'g--', linewidth=2, label='Cosine (no warmup)', alpha=0.7)
ax.plot(constant, 'gray', linestyle=':', linewidth=2, label='Constant', alpha=0.5)

ax.axvline(x=warmup_epochs, color='red', linestyle='--', alpha=0.5, label=f'End warmup (epoch {warmup_epochs})')

ax.set_xlabel('Epoch', fontsize=12)
ax.set_ylabel('Learning Rate', fontsize=12)
ax.set_title('Learning Rate Warmup', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 2.3 Learning Rate Finder (Brief Mention)

The **learning rate finder** is a technique to automatically find a good learning rate:

1. Start with a very small learning rate
2. Train for a few iterations, gradually increasing the LR
3. Plot loss vs learning rate
4. Choose the LR where loss is decreasing fastest (steepest slope)

**Rule of thumb:** Pick a learning rate where the loss is clearly decreasing but hasn't started to explode. Often about 10x smaller than where loss starts increasing.

Libraries like `pytorch-lightning` and `fastai` implement this automatically.

---

## 3. Regularization

### Intuitive Explanation

**Overfitting** happens when your model memorizes the training data instead of learning general patterns. It's like a student who memorizes answers to practice problems but can't solve new ones.

**Regularization** techniques prevent overfitting by constraining the model's capacity.

**F1 analogy:** Overfitting is when the car is perfectly tuned for one specific track but falls apart everywhere else. A car that is overfit to Monaco (tight, slow corners) will have massive downforce and soft springs -- but it will be hopeless on Monza (long straights, fast corners). Regularization is the engineering discipline of building a car that works well across the entire calendar, not just the track you tested on. It is the difference between winning one race and winning a championship.

### Visualization: The Overfitting Problem

In [None]:
# Generate data with noise
np.random.seed(42)
n_samples = 30
X_train = np.linspace(0, 1, n_samples).reshape(-1, 1)
y_true = np.sin(2 * np.pi * X_train).ravel()
y_train = y_true + 0.3 * np.random.randn(n_samples)

# Fit polynomials of different degrees
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
X_test = np.linspace(0, 1, 100).reshape(-1, 1)

degrees = [1, 4, 15]
titles = ['Underfitting (degree=1)', 'Good Fit (degree=4)', 'Overfitting (degree=15)']
colors = ['blue', 'green', 'red']

for ax, degree, title, color in zip(axes, degrees, titles, colors):
    # Fit polynomial
    coeffs = np.polyfit(X_train.ravel(), y_train, degree)
    y_pred = np.polyval(coeffs, X_test.ravel())
    
    # Calculate errors
    train_pred = np.polyval(coeffs, X_train.ravel())
    train_mse = np.mean((y_train - train_pred)**2)
    test_mse = np.mean((np.sin(2 * np.pi * X_test.ravel()) - y_pred)**2)
    
    # Plot
    ax.scatter(X_train, y_train, color='black', s=50, label='Training data', zorder=5)
    ax.plot(X_test, np.sin(2 * np.pi * X_test), 'g--', alpha=0.5, label='True function')
    ax.plot(X_test, y_pred, color=color, linewidth=2, label='Model')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_title(f'{title}\nTrain MSE: {train_mse:.3f}, Test MSE: {test_mse:.3f}')
    ax.legend(loc='upper right', fontsize=8)
    ax.set_ylim(-2, 2)
    ax.grid(True, alpha=0.3)

plt.suptitle('The Bias-Variance Tradeoff', fontsize=14)
plt.tight_layout()
plt.show()

### 3.1 L2 Regularization (Weight Decay)

**Intuition:** Penalize large weights to keep the model simple.

$$L_{total} = L_{data} + \lambda \sum_i w_i^2$$

| Component | Meaning | F1 Analogy |
|-----------|--------|------------|
| $L_{data}$ | Original loss (e.g., cross-entropy) | Lap time on the current track |
| $\lambda$ | Regularization strength (weight_decay) | How much you care about all-track performance |
| $\sum w_i^2$ | Sum of squared weights | How "extreme" your setup is from baseline |

**What this means:** Large weights are costly, so the model prefers smaller weights. This leads to:
- Smoother decision boundaries
- Less sensitivity to individual features
- Better generalization

**F1 analogy:** L2 regularization is like penalizing extreme setup deviations. A car with wing angle at max, springs at minimum, and differential locked tight might be fast at one track, but it is "overfit" to those conditions. L2 says "there is a cost to being extreme." The model (car) is incentivized to find a balanced setup that performs well broadly, rather than an extreme setup that only works in one specific condition.

### 3.2 L1 Regularization (Sparsity)

**Intuition:** Encourage weights to be exactly zero.

$$L_{total} = L_{data} + \lambda \sum_i |w_i|$$

**What this means:** Unlike L2 which makes weights small, L1 pushes weights all the way to zero. This creates **sparse** models where many weights are exactly 0.

| Regularization | Effect on Weights | Use Case |
|----------------|-------------------|----------|
| L2 (Ridge) | Small but non-zero | General regularization |
| L1 (Lasso) | Many exactly zero | Feature selection |
| L1 + L2 (Elastic Net) | Some zero, others small | Best of both |

In [None]:
# Visualize L1 vs L2 regularization effect on weights
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Simulate weight distributions after regularization
np.random.seed(42)
n_weights = 1000

# No regularization - weights can be large
no_reg = np.random.randn(n_weights) * 1.5

# L2 regularization - weights are small but non-zero
l2_reg = np.random.randn(n_weights) * 0.3

# L1 regularization - many weights exactly zero (Laplace distribution approximation)
l1_reg = np.random.laplace(0, 0.2, n_weights)
l1_reg[np.abs(l1_reg) < 0.1] = 0  # More weights become exactly 0

ax = axes[0]
ax.hist(no_reg, bins=50, alpha=0.7, color='gray', edgecolor='black')
ax.set_xlabel('Weight Value')
ax.set_ylabel('Count')
ax.set_title('No Regularization')
ax.set_xlim(-4, 4)

ax = axes[1]
ax.hist(l2_reg, bins=50, alpha=0.7, color='blue', edgecolor='black')
ax.set_xlabel('Weight Value')
ax.set_ylabel('Count')
ax.set_title('L2 Regularization\n(small but non-zero)')
ax.set_xlim(-4, 4)

ax = axes[2]
ax.hist(l1_reg, bins=50, alpha=0.7, color='green', edgecolor='black')
ax.set_xlabel('Weight Value')
ax.set_ylabel('Count')
ax.set_title(f'L1 Regularization\n({np.sum(l1_reg == 0)} weights exactly 0)')
ax.set_xlim(-4, 4)

plt.tight_layout()
plt.show()

### 3.3 Dropout

**Intuition: Training an ensemble of networks**

During training, dropout randomly "turns off" neurons with probability $p$. This is like training many different smaller networks simultaneously.

**Why it works:**
1. **Prevents co-adaptation**: Neurons can't rely on specific other neurons always being there
2. **Ensemble effect**: Like training many different networks and averaging them
3. **Forces redundancy**: The network must learn robust features

**Key insight:** Dropout forces neurons to learn features that are useful on their own, not just in combination with specific other neurons.

**F1 analogy:** Dropout is like randomly disabling sensors during testing to build robustness. Imagine running practice sessions where you randomly turn off the tire temperature sensor, or the brake temperature sensor, or the wind speed anemometer. The telemetry system cannot rely on any single sensor always being available -- it must learn to make good predictions even with missing data. When race day comes and all sensors are active, the system is more robust because it never became dependent on any one input. This is exactly what dropout does to neural network layers.

In [None]:
# Visualize dropout as creating sub-networks
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

def draw_network(ax, title, dropout_rate=0):
    """Draw a simple neural network with dropout visualization."""
    np.random.seed(42 + int(dropout_rate * 100))  # Different seeds for variety
    
    layers = [4, 6, 6, 3]
    positions = []
    
    for layer_idx, n_neurons in enumerate(layers):
        layer_pos = []
        for i in range(n_neurons):
            y = (i - (n_neurons - 1) / 2) * 0.15
            x = layer_idx * 0.3
            
            # Determine if neuron is dropped (not for input/output layers)
            is_dropped = False
            if dropout_rate > 0 and layer_idx in [1, 2]:
                is_dropped = np.random.random() < dropout_rate
            
            layer_pos.append((x, y, is_dropped))
            
            # Draw neuron
            color = 'lightgray' if is_dropped else 'steelblue'
            edge = 'gray' if is_dropped else 'darkblue'
            ax.scatter(x, y, s=300, c=color, edgecolors=edge, linewidth=2, zorder=3)
        
        positions.append(layer_pos)
    
    # Draw connections
    for l in range(len(positions) - 1):
        for x1, y1, d1 in positions[l]:
            for x2, y2, d2 in positions[l + 1]:
                if not d1 and not d2:
                    ax.plot([x1, x2], [y1, y2], 'gray', alpha=0.3, linewidth=0.5, zorder=1)
    
    ax.set_xlim(-0.1, 1.0)
    ax.set_ylim(-0.6, 0.6)
    ax.set_title(title)
    ax.axis('off')

draw_network(axes[0], 'No Dropout', 0)
draw_network(axes[1], 'Dropout p=0.3\n(training step 1)', 0.3)
np.random.seed(99)
draw_network(axes[2], 'Dropout p=0.3\n(training step 2)', 0.3)

plt.suptitle('Dropout Creates Different Sub-Networks Each Step', fontsize=14)
plt.tight_layout()
plt.show()

print("Gray neurons are 'dropped' - they don't participate in this forward/backward pass.")
print("Each training step uses a different random sub-network!")

In [None]:
# Dropout in PyTorch: Training vs Eval mode
dropout = nn.Dropout(p=0.5)
x = torch.ones(1, 10)

# Training mode (dropout active)
dropout.train()
print("Training mode (dropout ACTIVE):")
for i in range(3):
    out = dropout(x)
    print(f"  Sample {i+1}: {out.numpy().round(1)}")

# Evaluation mode (dropout disabled)
dropout.eval()
print("\nEvaluation mode (dropout DISABLED):")
for i in range(3):
    out = dropout(x)
    print(f"  Sample {i+1}: {out.numpy().round(1)}")

print("\nNote: In training mode, surviving values are scaled by 1/(1-p)=2 to maintain expected value.")

### 3.4 Early Stopping

**Intuition:** Stop training when validation performance starts to degrade.

Early stopping is a simple but effective form of regularization:
1. Monitor validation loss during training
2. Save the model when validation loss improves
3. Stop if validation loss hasn't improved for N epochs (patience)
4. Restore the best saved model

**F1 analogy:** Early stopping is knowing when to stop chasing setup changes. There is a point in every race weekend where additional setup tweaks start making the car worse, not better -- you have passed the optimum and are now overfitting to noise in the data (track temperature variations, wind gusts, tire batch differences). The experienced engineer knows when to say "the car is as good as it is going to get, lock it in." The patience parameter is like giving yourself 3 more sessions to beat the current best before accepting it.

In [None]:
# Simulate training with/without early stopping
np.random.seed(42)

epochs = 100
train_losses = []
val_losses = []

for epoch in range(epochs):
    # Training loss keeps decreasing
    train_loss = 1.0 * np.exp(-epoch/30) + 0.05 + 0.01 * np.random.randn()
    
    # Validation loss increases after epoch 40 (overfitting)
    if epoch < 40:
        val_loss = 1.0 * np.exp(-epoch/30) + 0.1 + 0.02 * np.random.randn()
    else:
        val_loss = 0.2 + 0.004 * (epoch - 40) + 0.02 * np.random.randn()
    
    train_losses.append(max(0.05, train_loss))
    val_losses.append(max(0.1, val_loss))

best_epoch = np.argmin(val_losses)

fig, ax = plt.subplots(figsize=(10, 6))

ax.plot(train_losses, 'b-', linewidth=2, label='Training Loss')
ax.plot(val_losses, 'r-', linewidth=2, label='Validation Loss')
ax.axvline(x=best_epoch, color='green', linestyle='--', linewidth=2, 
           label=f'Best model (epoch {best_epoch})')
ax.axvspan(best_epoch, epochs, alpha=0.1, color='red', label='Overfitting region')

ax.set_xlabel('Epoch', fontsize=12)
ax.set_ylabel('Loss', fontsize=12)
ax.set_title('Early Stopping Prevents Overfitting', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Best epoch: {best_epoch} with validation loss: {val_losses[best_epoch]:.4f}")
print(f"Final epoch: {epochs-1} with validation loss: {val_losses[-1]:.4f}")
print(f"Early stopping saves {val_losses[-1] - val_losses[best_epoch]:.4f} in validation loss!")

### Regularization Comparison Table

| Technique | How It Works | Hyperparameter | When to Use | F1 Parallel |
|-----------|--------------|----------------|-------------|-------------|
| **L2 (Weight Decay)** | Penalizes large weights | weight_decay (0.01-0.1) | Always (default choice) | Penalizing extreme setup deviations |
| **L1** | Encourages sparse weights | lambda | Feature selection needed | Identifying which sensors actually matter |
| **Dropout** | Randomly drops neurons | p (0.1-0.5) | Deep networks, overfitting | Randomly disabling sensors to build robustness |
| **Early Stopping** | Stop when val loss increases | patience (5-20 epochs) | Always monitor | Knowing when to lock in the setup |
| **Data Augmentation** | Artificially expand dataset | Aug. parameters | Images, audio, text | Simulating varied track conditions |

### Why This Matters in Machine Learning

| Scenario | Recommended Regularization | F1 Parallel |
|----------|---------------------------|-------------|
| Small dataset | Dropout + weight decay + data augmentation | Limited testing: maximize learning from few sessions |
| Large dataset | Lighter regularization, early stopping | Full test schedule: data speaks for itself |
| Very deep network | Dropout (0.2-0.5) between dense layers | Complex telemetry pipeline: prevent over-specialization |
| Transformers | Dropout in attention + AdamW weight decay | Modern F1 analytics with many interacting systems |

---

## 4. Batch Normalization

### Intuitive Explanation: The Problem

As data flows through a deep network, the distribution of activations can shift dramatically between layers. This is called **internal covariate shift**.

**Analogy:** Imagine trying to learn to catch balls, but the thrower keeps changing how they throw - sometimes fast, sometimes slow, sometimes high, sometimes low. It would be much easier if every throw was similar.

**Batch Normalization** normalizes the activations within each mini-batch, making training faster and more stable.

**F1 analogy:** Batch normalization is normalizing telemetry data across different track conditions. Tire temperature readings at Bahrain (50C ambient) look completely different from Finland testing (-5C ambient). If the downstream analysis system expects a certain range of inputs, these wild variations cause instability. BatchNorm standardizes the inputs at each processing stage so that the downstream systems always see data in a consistent range, regardless of whether it was collected at Bahrain or Spa. The learnable scale ($\gamma$) and shift ($\beta$) parameters let each layer find its own optimal operating range.

### What BatchNorm Does

For each mini-batch, BatchNorm:

1. **Normalize**: Subtract mean, divide by standard deviation
$$\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$

2. **Scale and Shift**: Apply learnable parameters $\gamma$ and $\beta$
$$y = \gamma \hat{x} + \beta$$

| Component | Meaning | F1 Analogy |
|-----------|--------|------------|
| $\mu_B$ | Mean of the batch | Average sensor reading across current conditions |
| $\sigma_B^2$ | Variance of the batch | How spread out the readings are |
| $\gamma$ | Learnable scale | Optimal sensitivity range for this processing stage |
| $\beta$ | Learnable shift | Optimal baseline for this processing stage |

**Why the learnable parameters?** They let the network undo the normalization if needed. The network learns the optimal distribution for each layer.

In [None]:
# Visualize what BatchNorm does
np.random.seed(42)

# Simulate activations before BatchNorm (shifted and scaled)
batch_size = 1000
before_bn = np.random.exponential(2, batch_size) + np.random.randn(batch_size) * 0.5 + 5

# Apply BatchNorm manually
mu = before_bn.mean()
sigma = before_bn.std()
normalized = (before_bn - mu) / (sigma + 1e-8)

# Learnable scale and shift
gamma, beta = 1.5, 0.5
after_bn = gamma * normalized + beta

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

ax = axes[0]
ax.hist(before_bn, bins=50, alpha=0.7, color='red', edgecolor='black')
ax.axvline(mu, color='black', linestyle='--', linewidth=2, label=f'Mean: {mu:.1f}')
ax.set_xlabel('Activation Value')
ax.set_ylabel('Count')
ax.set_title('Before BatchNorm\n(shifted, varying scale)')
ax.legend()

ax = axes[1]
ax.hist(normalized, bins=50, alpha=0.7, color='yellow', edgecolor='black')
ax.axvline(0, color='black', linestyle='--', linewidth=2, label='Mean: 0, Std: 1')
ax.set_xlabel('Activation Value')
ax.set_ylabel('Count')
ax.set_title('After Normalization\n(zero mean, unit variance)')
ax.legend()

ax = axes[2]
ax.hist(after_bn, bins=50, alpha=0.7, color='green', edgecolor='black')
ax.axvline(beta, color='black', linestyle='--', linewidth=2, label=f'Mean: {after_bn.mean():.2f}')
ax.set_xlabel('Activation Value')
ax.set_ylabel('Count')
ax.set_title(f'After Scale/Shift\n(gamma={gamma}, beta={beta})')
ax.legend()

plt.tight_layout()
plt.show()

### Training vs Eval Mode

BatchNorm behaves differently during training and evaluation:

| Mode | Mean/Variance | Why |
|------|---------------|-----|
| **Training** | Computed from current batch | Different each batch, adds noise |
| **Evaluation** | Running averages from training | Consistent, deterministic predictions |

**Critical:** Always call `model.train()` before training and `model.eval()` before inference!

In [None]:
# BatchNorm: Training vs Eval mode
bn = nn.BatchNorm1d(num_features=4)

# Training mode - uses batch statistics
bn.train()
x_batch = torch.randn(32, 4) * 5 + 10  # batch of 32

print("Training mode:")
print(f"  Input mean: {x_batch.mean(dim=0).numpy().round(2)}")
print(f"  Input std:  {x_batch.std(dim=0).numpy().round(2)}")

y_batch = bn(x_batch)
print(f"  Output mean: {y_batch.mean(dim=0).detach().numpy().round(2)} (approx 0)")
print(f"  Output std:  {y_batch.std(dim=0).detach().numpy().round(2)} (approx 1)")

# After some training, running statistics are updated
print(f"\n  Running mean: {bn.running_mean.numpy().round(2)}")
print(f"  Running var:  {bn.running_var.numpy().round(2)}")

# Evaluation mode - uses running statistics
bn.eval()
x_single = torch.randn(1, 4) * 5 + 10  # Single sample

print("\nEvaluation mode (single sample):")
print(f"  Input: {x_single.numpy().round(2)}")
y_single = bn(x_single)
print(f"  Output: {y_single.detach().numpy().round(2)}")
print("  (Uses running statistics, not batch statistics)")

### Deep Dive: Why BatchNorm Helps

BatchNorm provides several benefits:

1. **Faster training**: Allows higher learning rates without diverging
2. **Regularization effect**: Batch statistics add noise, like dropout
3. **Reduces initialization sensitivity**: Normalization prevents extreme activations
4. **Smoother loss landscape**: Makes optimization easier

#### Key Insight

BatchNorm doesn't just normalize - it gives each layer a "fresh start" at each training step.

**F1 analogy:** BatchNorm is like recalibrating every sensor at the start of each session. Without recalibration, the tire temperature sensor that read 90C in Bahrain and 40C in Barcelona gives the downstream systems wildly different inputs. With recalibration (normalization), the system always sees "this tire is 1.5 standard deviations above the session mean" -- a consistent, comparable signal regardless of absolute conditions.

#### Common Misconceptions

| Misconception | Reality |
|---------------|--------|
| Always put BatchNorm after activation | Before activation is more common and often better |
| BatchNorm eliminates need for good init | Still helps to initialize properly |
| BatchNorm always helps | Can hurt with very small batches |

### LayerNorm: For Transformers

**Layer Normalization** normalizes across features instead of across the batch.

| Normalization | Normalizes Across | Use Case | F1 Analogy |
|---------------|-------------------|----------|------------|
| **BatchNorm** | Batch dimension | CNNs, large batches | Normalize each sensor across all laps in a session |
| **LayerNorm** | Feature dimension | Transformers, RNNs | Normalize all sensors within a single lap |
| **InstanceNorm** | Spatial dimensions | Style transfer | Normalize within a single corner trace |
| **GroupNorm** | Groups of channels | Small batches, detection | Normalize groups of related sensors together |

In [None]:
# Visualize BatchNorm vs LayerNorm
batch_size, features = 4, 8
x = torch.randn(batch_size, features) * 3 + 2

# BatchNorm normalizes each column (feature) across the batch
batch_norm = nn.BatchNorm1d(features)
batch_norm.train()
bn_out = batch_norm(x)

# LayerNorm normalizes each row (sample) across features
layer_norm = nn.LayerNorm(features)
ln_out = layer_norm(x)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

im0 = axes[0].imshow(x.detach().numpy(), cmap='RdBu', aspect='auto', vmin=-6, vmax=6)
axes[0].set_xlabel('Features')
axes[0].set_ylabel('Batch samples')
axes[0].set_title('Original Input')
plt.colorbar(im0, ax=axes[0])

im1 = axes[1].imshow(bn_out.detach().numpy(), cmap='RdBu', aspect='auto', vmin=-3, vmax=3)
axes[1].set_xlabel('Features')
axes[1].set_ylabel('Batch samples')
axes[1].set_title('BatchNorm\n(normalizes columns)')
plt.colorbar(im1, ax=axes[1])

im2 = axes[2].imshow(ln_out.detach().numpy(), cmap='RdBu', aspect='auto', vmin=-3, vmax=3)
axes[2].set_xlabel('Features')
axes[2].set_ylabel('Batch samples')
axes[2].set_title('LayerNorm\n(normalizes rows)')
plt.colorbar(im2, ax=axes[2])

plt.tight_layout()
plt.show()

print("BatchNorm: Each COLUMN has mean~0, std~1")
print("LayerNorm: Each ROW has mean~0, std~1")

---

## 5. Weight Initialization

### Intuitive Explanation

How we initialize weights dramatically affects whether training succeeds. Poor initialization can cause:
- **Vanishing activations**: All outputs become near-zero, gradients vanish
- **Exploding activations**: Outputs grow without bound, gradients explode
- **Dead neurons**: Some neurons never activate (especially with ReLU)

The goal is to keep activations and gradients at reasonable scales throughout the network.

**F1 analogy:** Weight initialization is choosing your baseline setup before you start tuning. If you start with a completely random setup -- wing at max, springs at minimum, ride height at maximum -- the car might not even be drivable (exploding activations) or it might be so slow it provides no useful data (vanishing activations). A good starting point, like using last year's setup at a similar track, gives you a drivable car from which you can tune effectively. Xavier and Kaiming initialization are the engineering equivalent of "start from a known-good baseline."

### Visualization: Why Initialization Matters

In [None]:
def forward_activations(layers, x, activation=torch.relu):
    """Track activation magnitudes through layers."""
    magnitudes = [x.abs().mean().item()]
    
    for layer in layers:
        x = activation(layer(x))
        magnitudes.append(x.abs().mean().item())
    
    return magnitudes

def create_layers(n_layers, hidden_dim, init_scale):
    """Create linear layers with specified initialization scale."""
    layers = []
    for _ in range(n_layers):
        layer = nn.Linear(hidden_dim, hidden_dim, bias=False)
        nn.init.normal_(layer.weight, mean=0, std=init_scale)
        layers.append(layer)
    return layers

# Test different initialization scales
torch.manual_seed(42)
n_layers = 20
hidden_dim = 256
x = torch.randn(100, hidden_dim)

fig, ax = plt.subplots(figsize=(10, 6))

scales = [0.01, 0.1, 1.0]
colors = ['blue', 'green', 'red']
labels = ['Too Small (std=0.01)', 'Better (std=0.1)', 'Too Large (std=1.0)']

for scale, color, label in zip(scales, colors, labels):
    layers = create_layers(n_layers, hidden_dim, scale)
    mags = forward_activations(layers, x.clone())
    ax.plot(mags, '-o', color=color, label=label, markersize=4, linewidth=2)

ax.set_xlabel('Layer', fontsize=12)
ax.set_ylabel('Mean Activation Magnitude', fontsize=12)
ax.set_title('Effect of Weight Initialization Scale', fontsize=14)
ax.set_yscale('log')
ax.legend()
ax.grid(True, alpha=0.3)
ax.axhline(y=1, color='gray', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

print("Too small: Activations vanish (shrink to 0)")
print("Too large: Activations explode (grow unbounded)")
print("Just right: Activations stay roughly constant")

### 5.1 Xavier/Glorot Initialization

**For tanh and sigmoid activations.**

$$W \sim \mathcal{N}\left(0, \frac{2}{n_{in} + n_{out}}\right)$$

**Intuition:** Balance the variance of inputs and outputs to maintain signal magnitude through layers. Works well when activations are symmetric around zero.

### 5.2 Kaiming/He Initialization

**For ReLU activations.**

$$W \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)$$

**Intuition:** ReLU zeros out half the activations (negative values become 0), so we need larger initial weights to compensate. The factor of 2 accounts for this.

In [None]:
# Compare Xavier vs Kaiming with ReLU
torch.manual_seed(42)

n_layers = 30
hidden_dim = 256
x = torch.randn(100, hidden_dim)

def create_initialized_layers(n_layers, hidden_dim, init_type):
    """Create layers with proper initialization."""
    layers = []
    for _ in range(n_layers):
        layer = nn.Linear(hidden_dim, hidden_dim, bias=False)
        if init_type == 'xavier':
            nn.init.xavier_normal_(layer.weight)
        elif init_type == 'kaiming':
            nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
        elif init_type == 'random':
            nn.init.normal_(layer.weight, mean=0, std=0.1)
        layers.append(layer)
    return layers

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# With ReLU activation
ax = axes[0]
for init_type, color in [('random', 'red'), ('xavier', 'blue'), ('kaiming', 'green')]:
    layers = create_initialized_layers(n_layers, hidden_dim, init_type)
    mags = forward_activations(layers, x.clone(), torch.relu)
    ax.plot(mags, '-o', color=color, label=init_type.capitalize(), markersize=3, linewidth=2)

ax.set_xlabel('Layer')
ax.set_ylabel('Mean Activation Magnitude')
ax.set_title('ReLU Activation\n(Kaiming is designed for ReLU)')
ax.set_yscale('log')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim(1e-6, 1e2)

# With Tanh activation
ax = axes[1]
for init_type, color in [('random', 'red'), ('xavier', 'blue'), ('kaiming', 'orange')]:
    layers = create_initialized_layers(n_layers, hidden_dim, init_type)
    mags = forward_activations(layers, x.clone(), torch.tanh)
    ax.plot(mags, '-o', color=color, label=init_type.capitalize(), markersize=3, linewidth=2)

ax.set_xlabel('Layer')
ax.set_ylabel('Mean Activation Magnitude')
ax.set_title('Tanh Activation\n(Xavier is designed for symmetric activations)')
ax.set_yscale('log')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim(1e-6, 1e2)

plt.tight_layout()
plt.show()

### Which Initialization to Use?

| Activation | Recommended Init | PyTorch Function | F1 Analogy |
|------------|------------------|------------------|------------|
| ReLU, LeakyReLU | Kaiming (He) | `nn.init.kaiming_normal_` | Baseline for threshold-based systems |
| Tanh, Sigmoid | Xavier (Glorot) | `nn.init.xavier_normal_` | Baseline for smooth-response systems |
| GELU, SiLU | Kaiming | `nn.init.kaiming_normal_` | Baseline for modern nonlinear responses |
| Linear (no activation) | Xavier | `nn.init.xavier_normal_` | Baseline for linear processing stages |

**Good news:** PyTorch's default initialization works well for most cases!

In [None]:
# PyTorch initialization examples
layer = nn.Linear(512, 256)

# Xavier initialization (for tanh/sigmoid)
nn.init.xavier_normal_(layer.weight)
print(f"Xavier: mean={layer.weight.mean():.6f}, std={layer.weight.std():.4f}")

# Kaiming initialization (for ReLU)
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
print(f"Kaiming: mean={layer.weight.mean():.6f}, std={layer.weight.std():.4f}")

# For biases, usually zero
nn.init.zeros_(layer.bias)
print(f"Bias: all zeros = {torch.all(layer.bias == 0).item()}")

---

## 6. Putting It All Together

Now let's build a complete training pipeline that incorporates all the techniques:
- Adam optimizer with learning rate scheduling
- BatchNorm and Dropout for regularization
- Proper Kaiming initialization
- Training with validation monitoring

In [None]:
# Create synthetic classification dataset
def make_moons_data(n_samples=1000, noise=0.2):
    """Generate two interleaving half circles (moons)."""
    np.random.seed(42)
    n = n_samples // 2
    
    # First moon
    theta1 = np.linspace(0, np.pi, n)
    x1 = np.cos(theta1)
    y1 = np.sin(theta1)
    
    # Second moon (shifted and flipped)
    theta2 = np.linspace(0, np.pi, n)
    x2 = 1 - np.cos(theta2)
    y2 = 0.5 - np.sin(theta2)
    
    X = np.vstack([
        np.column_stack([x1, y1]),
        np.column_stack([x2, y2])
    ])
    X += np.random.randn(*X.shape) * noise
    y = np.hstack([np.zeros(n), np.ones(n)])
    
    # Shuffle
    idx = np.random.permutation(n_samples)
    return X[idx].astype(np.float32), y[idx].astype(np.float32)

# Create train and validation data
X_train, y_train = make_moons_data(n_samples=800, noise=0.2)
X_val, y_val = make_moons_data(n_samples=200, noise=0.2)

# Convert to PyTorch
train_dataset = TensorDataset(
    torch.from_numpy(X_train), 
    torch.from_numpy(y_train).unsqueeze(1)
)
val_dataset = TensorDataset(
    torch.from_numpy(X_val), 
    torch.from_numpy(y_val).unsqueeze(1)
)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)

# Visualize
plt.figure(figsize=(8, 6))
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], c='blue', label='Class 0', alpha=0.6)
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], c='red', label='Class 1', alpha=0.6)
plt.xlabel('x1')
plt.ylabel('x2')
plt.title('Training Data (Two Moons)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
class BestPracticesNetwork(nn.Module):
    """
    Network with all best practices:
    - BatchNorm after linear layers
    - Dropout for regularization
    - ReLU activation
    - Kaiming initialization
    """
    def __init__(self, input_dim=2, hidden_dims=[64, 32], output_dim=1, dropout_rate=0.3):
        super().__init__()
        
        layers = []
        prev_dim = input_dim
        
        for hidden_dim in hidden_dims:
            # Linear layer with Kaiming init
            linear = nn.Linear(prev_dim, hidden_dim)
            nn.init.kaiming_normal_(linear.weight, mode='fan_in', nonlinearity='relu')
            nn.init.zeros_(linear.bias)
            layers.append(linear)
            
            # BatchNorm
            layers.append(nn.BatchNorm1d(hidden_dim))
            
            # Activation
            layers.append(nn.ReLU())
            
            # Dropout
            if dropout_rate > 0:
                layers.append(nn.Dropout(dropout_rate))
            
            prev_dim = hidden_dim
        
        # Output layer (no BatchNorm, no Dropout)
        output_layer = nn.Linear(prev_dim, output_dim)
        nn.init.xavier_normal_(output_layer.weight)
        nn.init.zeros_(output_layer.bias)
        layers.append(output_layer)
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.network(x)

model = BestPracticesNetwork(input_dim=2, hidden_dims=[64, 32], output_dim=1, dropout_rate=0.3)
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters())}")

In [None]:
def train_model(model, train_loader, val_loader, epochs=100, lr=0.01, weight_decay=0.01):
    """
    Complete training pipeline with:
    - AdamW optimizer with weight decay
    - Cosine annealing learning rate schedule
    - Training and validation tracking
    """
    criterion = nn.BCEWithLogitsLoss()
    optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)
    
    history = {'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': [], 'lr': []}
    
    for epoch in range(epochs):
        # Training
        model.train()
        train_loss, train_correct, train_total = 0, 0, 0
        
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item() * X_batch.size(0)
            predictions = (torch.sigmoid(outputs) > 0.5).float()
            train_correct += (predictions == y_batch).sum().item()
            train_total += X_batch.size(0)
        
        # Validation
        model.eval()
        val_loss, val_correct, val_total = 0, 0, 0
        
        with torch.no_grad():
            for X_batch, y_batch in val_loader:
                outputs = model(X_batch)
                loss = criterion(outputs, y_batch)
                
                val_loss += loss.item() * X_batch.size(0)
                predictions = (torch.sigmoid(outputs) > 0.5).float()
                val_correct += (predictions == y_batch).sum().item()
                val_total += X_batch.size(0)
        
        # Record and update
        history['train_loss'].append(train_loss / train_total)
        history['val_loss'].append(val_loss / val_total)
        history['train_acc'].append(train_correct / train_total)
        history['val_acc'].append(val_correct / val_total)
        history['lr'].append(optimizer.param_groups[0]['lr'])
        
        scheduler.step()
        
        if (epoch + 1) % 20 == 0:
            print(f"Epoch {epoch+1:3d}: train_loss={history['train_loss'][-1]:.4f}, "
                  f"val_loss={history['val_loss'][-1]:.4f}, "
                  f"val_acc={history['val_acc'][-1]:.4f}")
    
    return history

# Train the model
torch.manual_seed(42)
model = BestPracticesNetwork(input_dim=2, hidden_dims=[64, 32], output_dim=1, dropout_rate=0.3)
history = train_model(model, train_loader, val_loader, epochs=100, lr=0.01, weight_decay=0.01)

In [None]:
# Plot training history
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Loss
ax = axes[0]
ax.plot(history['train_loss'], 'b-', label='Train', linewidth=2)
ax.plot(history['val_loss'], 'r-', label='Validation', linewidth=2)
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Loss')
ax.legend()
ax.grid(True, alpha=0.3)

# Accuracy
ax = axes[1]
ax.plot(history['train_acc'], 'b-', label='Train', linewidth=2)
ax.plot(history['val_acc'], 'r-', label='Validation', linewidth=2)
ax.set_xlabel('Epoch')
ax.set_ylabel('Accuracy')
ax.set_title('Accuracy')
ax.legend()
ax.grid(True, alpha=0.3)

# Learning Rate
ax = axes[2]
ax.plot(history['lr'], 'g-', linewidth=2)
ax.set_xlabel('Epoch')
ax.set_ylabel('Learning Rate')
ax.set_title('Learning Rate (Cosine Annealing)')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Visualize decision boundary
def plot_decision_boundary(model, X, y):
    """Plot the decision boundary."""
    model.eval()
    
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    
    grid = np.column_stack([xx.ravel(), yy.ravel()]).astype(np.float32)
    with torch.no_grad():
        Z = torch.sigmoid(model(torch.from_numpy(grid))).numpy()
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, levels=50, cmap='RdBu', alpha=0.7)
    plt.colorbar(label='P(Class 1)')
    plt.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)
    plt.scatter(X[y == 0, 0], X[y == 0, 1], c='blue', edgecolors='white', label='Class 0')
    plt.scatter(X[y == 1, 0], X[y == 1, 1], c='red', edgecolors='white', label='Class 1')
    plt.xlabel('x1')
    plt.ylabel('x2')
    plt.legend()

plt.figure(figsize=(10, 8))
plot_decision_boundary(model, X_val, y_val)
plt.title('Trained Model Decision Boundary')
plt.show()

# Final accuracy
print(f"\nFinal validation accuracy: {history['val_acc'][-1]:.4f}")

---

## Exercises

### Exercise 1: Optimizer Comparison

Compare the performance of different optimizers on the moons dataset.

In [None]:
# EXERCISE 1: Compare optimizers
def train_with_optimizer(optimizer_name, lr=0.01, epochs=50):
    """
    Train a model with a specific optimizer and return validation accuracy.
    
    Args:
        optimizer_name: 'sgd', 'momentum', 'rmsprop', or 'adam'
        lr: learning rate
        epochs: number of epochs
    
    Returns:
        List of validation accuracies per epoch
    """
    # TODO: Implement this!
    # Hint: Create a fresh model
    # Hint: Choose optimizer based on optimizer_name:
    #   - 'sgd': optim.SGD(params, lr=lr)
    #   - 'momentum': optim.SGD(params, lr=lr, momentum=0.9)
    #   - 'rmsprop': optim.RMSprop(params, lr=lr)
    #   - 'adam': optim.Adam(params, lr=lr)
    # Hint: Train and record validation accuracy each epoch
    
    pass  # Replace with your implementation

# Test your implementation
# val_accs = train_with_optimizer('adam', lr=0.01, epochs=50)
# print(f"Adam final accuracy: {val_accs[-1]:.4f}")

# Expected output: Compare all optimizers
# optimizers = ['sgd', 'momentum', 'adam']
# for opt in optimizers:
#     accs = train_with_optimizer(opt, lr=0.01)
#     plt.plot(accs, label=opt)
# plt.legend()
# plt.show()

### Exercise 2: Implement Early Stopping

Implement an early stopping class that stops training when validation loss stops improving.

In [None]:
# EXERCISE 2: Implement early stopping
class EarlyStopping:
    """
    Early stopping to prevent overfitting.
    
    Usage:
        early_stop = EarlyStopping(patience=5)
        for epoch in range(epochs):
            # ... training code ...
            if early_stop(val_loss):
                print("Early stopping!")
                break
    """
    def __init__(self, patience=5, min_delta=0):
        """
        Args:
            patience: Number of epochs to wait before stopping
            min_delta: Minimum improvement to count as improvement
        """
        self.patience = patience
        self.min_delta = min_delta
        # TODO: Initialize tracking variables
        # Hint: Track best_loss and counter
        pass
        
    def __call__(self, val_loss):
        """
        Check if training should stop.
        
        Args:
            val_loss: Current validation loss
            
        Returns:
            True if training should stop, False otherwise
        """
        # TODO: Implement this!
        # Hint: If val_loss improved by min_delta, reset counter
        # Hint: Otherwise, increment counter
        # Hint: Return True if counter >= patience
        
        pass  # Replace with your implementation

# Test your implementation
early_stop = EarlyStopping(patience=3)
losses = [1.0, 0.9, 0.8, 0.85, 0.83, 0.84, 0.86, 0.7]

print("Testing Early Stopping:")
for epoch, loss in enumerate(losses):
    should_stop = early_stop(loss)
    print(f"Epoch {epoch}: loss={loss}, stop={should_stop}")
    if should_stop:
        print(f"Training stopped at epoch {epoch}")
        break

# Expected: Should stop at epoch 6 (after 3 epochs without improvement from 0.8)

### Exercise 3: Dropout Rate Experiment

Find the optimal dropout rate for the moons classification problem.

In [None]:
# EXERCISE 3: Find optimal dropout rate
def test_dropout_rate(dropout_rate, epochs=50):
    """
    Train a model with a specific dropout rate.
    
    Args:
        dropout_rate: float between 0 and 1
        epochs: number of training epochs
    
    Returns:
        Final validation accuracy
    """
    # TODO: Implement this!
    # Hint: Create BestPracticesNetwork with the given dropout_rate
    # Hint: Train the model
    # Hint: Return the final validation accuracy
    
    pass  # Replace with your implementation

# Test your implementation
# dropout_rates = [0.0, 0.1, 0.2, 0.3, 0.5, 0.7]
# accuracies = [test_dropout_rate(dr) for dr in dropout_rates]
#
# plt.plot(dropout_rates, accuracies, 'bo-')
# plt.xlabel('Dropout Rate')
# plt.ylabel('Validation Accuracy')
# plt.title('Effect of Dropout Rate')
# plt.grid(True, alpha=0.3)
# plt.show()
#
# best_idx = np.argmax(accuracies)
# print(f"Best dropout rate: {dropout_rates[best_idx]} with accuracy {accuracies[best_idx]:.4f}")

---

## Summary

### Key Concepts

| Concept | Definition | F1 Parallel |
|---------|-----------|-------------|
| **SGD** | Simple gradient following | Conservative: one small setup change at a time |
| **Momentum** | Accumulate velocity from past gradients | Trend-following: if it kept working, keep going |
| **Adam** | Momentum + adaptive per-parameter rates | Veteran engineer: adapts strategy to each parameter |
| **Learning rate** | Step size for weight updates | How big each setup adjustment is |
| **LR schedulers** | Reduce LR over time | Bold changes early (FP1), tiny refinements late (qualifying) |
| **L2 regularization** | Penalize large weights | Penalizing extreme setup deviations from baseline |
| **Dropout** | Randomly disable neurons during training | Randomly disabling sensors to build robustness |
| **Early stopping** | Stop when validation degrades | Knowing when to lock in the setup |
| **BatchNorm** | Normalize activations per batch | Normalizing telemetry across different track conditions |
| **LayerNorm** | Normalize activations per sample | Normalizing all sensors within a single lap |
| **Weight init** | Smart starting values for parameters | Starting from a known-good baseline setup |

### Connection to Deep Learning

| Technique | Application | F1 Parallel |
|-----------|------------|-------------|
| Adam/AdamW | Default for most tasks | Veteran engineer's adaptive approach |
| SGD + Momentum | Fine-tuning, achieving SOTA on vision | Patient, iterative refinement |
| Cosine LR Schedule | Modern training recipes | Smooth FP1 -> qualifying transition |
| Dropout | Fully connected layers | Sensor robustness training |
| BatchNorm | CNNs, faster training | Cross-condition telemetry normalization |
| LayerNorm | Transformers, RNNs | Within-sample feature normalization |
| Kaiming Init | Any ReLU network | Engineering-informed baseline setup |

### Checklist

- [ ] I can explain why Adam is often the default optimizer
- [ ] I understand the effect of learning rate on training
- [ ] I can choose appropriate regularization for different scenarios
- [ ] I know when to use BatchNorm vs LayerNorm
- [ ] I can select the right initialization for my activation function
- [ ] I can build a complete training pipeline with best practices

---

## Next Steps

Now that you understand how to train deep networks effectively, you're ready to explore:

1. **Convolutional Neural Networks (CNNs)**: Specialized architectures for image data
2. **Recurrent Neural Networks (RNNs)**: Handling sequential data
3. **Transfer Learning**: Using pretrained models as starting points
4. **Transformers**: The architecture behind modern NLP and beyond

The training techniques you learned here apply to all these architectures. Whether you're training a simple classifier or a billion-parameter language model, you'll use:
- Optimizers (usually Adam or AdamW) -- your setup tuning strategy
- Learning rate schedules (often cosine with warmup) -- bold changes early, refinements late
- Regularization (dropout, weight decay) -- preventing the car from being overfit to one track
- Normalization (BatchNorm for CNNs, LayerNorm for Transformers) -- normalizing telemetry across conditions

**Practical next steps:**
- Train a model on MNIST or CIFAR-10
- Experiment with different optimizer/scheduler combinations
- Use TensorBoard or Weights & Biases to visualize training
- Try implementing gradient clipping for very deep networks

You now have the complete engineering toolkit to train deep neural networks. Like an F1 team heading into a race weekend, you know the car (architecture), the tuning process (optimizers), the development discipline (regularization), and the calibration systems (normalization). Time to go racing.