# Concepts: Training Neural Networks

Understanding overfitting, regularization, and hyperparameters.

This notebook covers the **theory** of training. You'll apply these concepts in [04_lab_regularization](04_lab_regularization.ipynb).

---

## The Training Loop

Training a neural network is an iterative process:

```
┌──────────────────────────────────────────────────────────────┐
│                      TRAINING LOOP                           │
│                                                              │
│   for each epoch:                                            │
│       for each batch:                                        │
│           1. Forward pass  → compute predictions             │
│           2. Compute loss  → measure error                   │
│           3. Backward pass → compute gradients               │
│           4. Update weights → gradient descent               │
│                                                              │
└──────────────────────────────────────────────────────────────┘
```

**Epoch:** One complete pass through the entire training dataset  
**Batch:** A subset of training samples processed together

---

## Overfitting vs Underfitting

The fundamental challenge in machine learning: **generalization**.

| Condition | Training Loss | Validation Loss | What's happening |
|-----------|--------------|-----------------|------------------|
| **Underfitting** | High | High | Model too simple |
| **Good fit** | Low | Low | Model generalizes well |
| **Overfitting** | Very low | High | Model memorized training data |

In [None]:
import numpy as np
import matplotlib.pyplot as plt

epochs = np.arange(1, 51)

# Good fit
train_loss_good = 1.0 * np.exp(-0.1 * epochs) + 0.1
val_loss_good = 1.0 * np.exp(-0.08 * epochs) + 0.15

# Overfitting
train_loss_overfit = 1.0 * np.exp(-0.15 * epochs) + 0.02
val_loss_overfit = 1.0 * np.exp(-0.08 * epochs) + 0.1 + 0.01 * epochs

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(epochs, train_loss_good, 'b-', label='Train', linewidth=2)
axes[0].plot(epochs, val_loss_good, 'r-', label='Validation', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Good Fit', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].plot(epochs, train_loss_overfit, 'b-', label='Train', linewidth=2)
axes[1].plot(epochs, val_loss_overfit, 'r-', label='Validation', linewidth=2)
axes[1].axvline(x=20, color='green', linestyle='--', label='Stop here!', alpha=0.7)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Loss')
axes[1].set_title('Overfitting', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

**Key insight:** Watch the **validation loss**, not training loss!

When validation loss starts increasing while training loss keeps decreasing → you're overfitting.

---

## Regularization Techniques

Regularization **reduces overfitting** by constraining the model.

### 1. L2 Regularization (Weight Decay)

Add a penalty for large weights:

$$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \lambda \sum_i w_i^2$$

**Effect:** Keeps weights small, prevents the model from relying too heavily on any single feature.

### 2. Dropout

Randomly "turn off" neurons during training.

```
TRAINING (dropout = 0.5)
─────────────────────────

    ○───○───○        ○───✗───○        ○───○───✗
    │   │   │        │       │        │   │    
    ○───○───○   →    ✗───○───○   →    ○───✗───○
    │   │   │        │   │   │        │       │
    ○───○───○        ○───○───✗        ✗───○───○
    
    Full network     Batch 1          Batch 2
                     (random drop)    (random drop)
```

**Effect:** Forces the network to learn redundant representations. No neuron can be relied upon exclusively.

### 3. Early Stopping

Stop training when validation loss stops improving.

**How it works:**
- Monitor validation loss each epoch
- If it doesn't improve for `patience` epochs, stop
- Restore the best weights

This is the simplest and often most effective regularization!

### 4. Batch Normalization

Normalize the inputs to each layer:

$$\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}$$

Then scale and shift with learnable parameters:

$$y = \gamma \hat{x} + \beta$$

**Effects:**
- Faster training (can use higher learning rates)
- Some regularization effect
- Reduces sensitivity to weight initialization

### Summary: Regularization Techniques

| Technique | What it does | When to use |
|-----------|--------------|-------------|
| **L2 regularization** | Penalizes large weights | Always a good default |
| **Dropout** | Randomly disables neurons | Dense layers, overfitting |
| **Early stopping** | Stops training at best point | Always use! |
| **Batch normalization** | Normalizes layer inputs | Deep networks |

---

## Hyperparameters

**Hyperparameters** are settings you choose *before* training. They're not learned from data.

### Learning Rate

Controls how big the weight updates are.

$$w_{\text{new}} = w_{\text{old}} - \eta \cdot \nabla \mathcal{L}$$

In [None]:
def simulate_training(lr, n_steps=50):
    w = 5.0  # Start far from optimal (w=0)
    history = [w]
    for _ in range(n_steps):
        gradient = 2 * w  # Gradient of w^2
        w = w - lr * gradient
        history.append(w)
    return history

fig, axes = plt.subplots(1, 3, figsize=(14, 4))
learning_rates = [0.01, 0.1, 0.6]
titles = ['Too Small (slow)', 'Good', 'Too Large (unstable)']
colors = ['orange', 'green', 'red']

for ax, lr, title, color in zip(axes, learning_rates, titles, colors):
    history = simulate_training(lr)
    ax.plot(history, linewidth=2, color=color)
    ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5, label='Optimal')
    ax.set_xlabel('Step')
    ax.set_ylabel('Weight (w)')
    ax.set_title(f'{title}\nLR = {lr}', fontsize=12)
    ax.grid(True, alpha=0.3)
    ax.set_ylim(-6, 6)

plt.tight_layout()
plt.show()

**Rule of thumb:** Start with `0.001` and adjust based on training curves.

### Batch Size

Number of samples processed before updating weights.

| Batch Size | Pros | Cons |
|------------|------|------|
| **Small** (16-32) | Better generalization, noisier gradients | Slower, less stable |
| **Large** (128-512) | Faster, more stable | May generalize worse |

**Common choices:** 32, 64, 128

### Number of Epochs

How many times to iterate over the training data.

**Best practice:** Use **early stopping** instead of a fixed number. Set a high max (e.g., 100) and let early stopping decide.

### Network Architecture

| Hyperparameter | Effect of increasing |
|----------------|----------------------|
| **Number of layers** | More complex patterns, risk of overfitting |
| **Neurons per layer** | Higher capacity, more parameters |
| **Dropout rate** | More regularization, may underfit |

---

## Train / Validation / Test Split

Always split your data into three sets:

```
┌─────────────────────────────────────────────────────────────┐
│                        ALL DATA                             │
├─────────────────────────────┬───────────────┬───────────────┤
│         TRAINING            │  VALIDATION   │     TEST      │
│           (70%)             │    (15%)      │    (15%)      │
│                             │               │               │
│    Used to update           │  Used to      │  Final        │
│    weights                  │  tune         │  evaluation   │
│                             │  hyperparams  │  (touch once) │
└─────────────────────────────┴───────────────┴───────────────┘
```

**Important:** Never use the test set to make decisions during training!

---

## Key Takeaways

1. **Watch validation loss** — it tells you if you're overfitting
2. **Use early stopping** — simplest and most effective regularization
3. **Start with small learning rate** — 0.001 is a good default
4. **Dropout** helps with overfitting in dense layers
5. **Batch normalization** speeds up training and adds regularization
6. **Keep a test set** that you only touch at the very end

**Next:** Apply these techniques in [04_lab_regularization](04_lab_regularization.ipynb)