
# MLF Week 4: Neural Networks Part 2 - Training

The notebook accompanies the **MLF Week 4 Slides** and extends Week 3 by focusing on **how neural networks learn**. We connect the ideas of **backpropagation**, **autograd**, and a clean **training loop**. We also compare **SGD** and **Adam** and run small **hyperparameter** and **architecture** experiments.

**You will:**
- Describe backpropagation at a high level (no derivations).
- Use **PyTorch autograd** to obtain gradients automatically.
- Implement the standard **training loop** (forward → loss → backward → step).
- Compare **SGD vs Adam**, tune **learning rate**, and read **loss curves**.
- Try different **depth/width** settings on the same 2D dataset from Week 3.



## 0. Setup

We’ll use the same helper utilities and data setup as in previous weeks.

In [None]:
from utils import *

import math, random, os, sys, time
from typing import Tuple, List

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset

import matplotlib.pyplot as plt

torch.manual_seed(0); random.seed(0); np.random.seed(0)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device



## 1. Backpropagation (Intuition)

**Big picture.** A neural network is a chain of simple computations. The **forward pass** combines inputs with weights to make a prediction and compute a single **loss** value. To learn, we need to know **how changing each weight changes the loss**.

**Backpropagation** applies the **chain rule** in reverse through this chain. Layers close to the output get a clear signal first; earlier layers receive a signal that reflects how much they contributed to the error. The result is one gradient per parameter.

We rarely compute these derivatives by hand. Instead, we rely on **automatic differentiation** (autograd). PyTorch records operations during the forward pass and, when we call `loss.backward()`, it traverses the graph and fills in the gradients for us.



### 1.1 Autograd mini demo

This tiny example shows the mechanism on a simple scalar function.

**Task**
1. Create a tensor with `requires_grad=True`.
2. Build a simple expression from it.
3. Call `.backward()` on the result.
4. Inspect `.grad` on the input tensor.

**Why this matters.** If this works for a scalar function, the same idea scales to millions of parameters in a neural network.


In [None]:
# TODO: Create a small scalar example that uses autograd.
# Example steps:
# x = torch.tensor([2.0], requires_grad=True)
# y = (3*x + 2)**2 / 2
# y.backward()
# print("dy/dx:", x.grad.item())



**Common Mistakes**
- Forgetting `requires_grad=True` means gradients stay `None`.
- Calling `.backward()` more than once on the same graph without `retain_graph=True`.
- Using `.item()` on a tensor and expecting backprop through the Python float (breaks the graph).

## 2. Data

We’ll use the **two-moons** dataset for binary classification.  
It’s a simple 2D dataset with non-linear boundaries, making it ideal for visualizing how neural networks learn complex patterns.


In [None]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

def make_toy_data(n_samples=1200, noise=0.2, test_size=0.2, seed=0):
    X, y = make_moons(n_samples=n_samples, noise=noise, random_state=seed)
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=test_size, random_state=seed)
    return (torch.tensor(X_train, dtype=torch.float32),
            torch.tensor(y_train, dtype=torch.long),
            torch.tensor(X_val,   dtype=torch.float32),
            torch.tensor(y_val,   dtype=torch.long))

X_train, y_train, X_val, y_val = make_toy_data()
train_ds = TensorDataset(X_train, y_train)
val_ds   = TensorDataset(X_val, y_val)

BATCH_SIZE = 64
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)
val_loader   = DataLoader(val_ds,   batch_size=BATCH_SIZE)

len(train_ds), len(val_ds)


In [None]:
# Plotting the data
plt.figure(figsize=(10, 5))
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Two Moons Dataset')
plt.show()


## 3. Model (MLP)

We reuse the Week 3 **MLP**: a stack of `Linear` layers with **ReLU**. This model is expressive enough to learn a curved decision boundary but still simple to read.

> Keeping the model familiar lets us focus on training details and optimizer behavior.


In [None]:
class MLP(nn.Module):
    def __init__(self, in_dim=2, hidden_sizes=(32, 32), out_dim=2):
        super().__init__()
        layers = []
        last = in_dim
        for h in hidden_sizes:
            layers += [nn.Linear(last, h), nn.ReLU()]
            last = h
        layers += [nn.Linear(last, out_dim)]
        self.net = nn.Sequential(*layers)
    def forward(self, x):
        return self.net(x)

model = MLP().to(device)
model



## 4. Training loop

We will use **`nn.CrossEntropyLoss`** for classification. For each mini-batch:

1. **Zero** old gradients (`optimizer.zero_grad()`).
2. **Forward**: pass inputs through the model to get `logits`.
3. **Loss**: compare logits against labels.
4. **Backward**: `loss.backward()` computes all gradients.
5. **Step**: `optimizer.step()` updates parameters.

We also compute **validation loss** and **accuracy** per epoch to monitor learning and detect over/underfitting.



### 4.1 Implement helpers

Fill in `accuracy()` and `train_model()`. Keep the code clear and minimal. Return curves so you can plot them later.


In [None]:
# TODO: Implement accuracy() and train_model().

def accuracy(model, loader, device=device):
    # Compute classification accuracy over a DataLoader.
    # Hint: pred = logits.argmax(dim=1); compare to yb; average across dataset.
    pass

def train_model(model, train_loader, val_loader, epochs=50, lr=0.05, optimizer_name='sgd', device=device):
    # Create optimizer (SGD or Adam). For each epoch, loop over batches:
    # zero_grad → forward → loss → backward → step. Track train/val loss and val acc.
    pass



**Debug checklist**
- If loss is not decreasing at all, then lower or raise LR slightly (e.g., 0.1 → 0.05 or 0.01).
- If `nan` loss, then reduce LR; print a single batch to check labels and shapes.
- If Val accuracy stuck at ~50%, increase capacity a bit or try Adam with smaller LR.



## 5. Baseline run (SGD)

Start with **SGD** and a reasonable learning rate. Plot **train vs val loss** to check that the model is learning and not overfitting immediately.


In [None]:
# TODO: Train once with SGD and plot loss curves.
EPOCHS = 50
LR = 0.05
OPT = 'sgd'  # 'sgd' or 'adam'

model = MLP().to(device)
# out = train_model(model, train_loader, val_loader, epochs=EPOCHS, lr=LR, optimizer_name=OPT, device=device)

# plt.figure()
# plt.plot(out["train_losses"], label="train")
# plt.plot(out["val_losses"], label="val")
# plt.xlabel("Epoch"); plt.ylabel("Loss"); plt.title(f"Loss ({OPT}, lr={LR})")
# plt.legend(); plt.show()
# print("Validation accuracy:", round(out["val_accs"][-1], 4))



**Reflect**
- Does the **validation loss** follow the training loss downwards?
- If training loss falls but validation loss climbs, you may be **overfitting** (reduce epochs or capacity).
- If both losses are flat, try a different LR or optimizer.


In [None]:
# Plot the decision boundary
def plot_decision_boundary(model, X, y):
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    with torch.no_grad():
        # Fix: Convert numpy arrays to tensors properly
        Xgrid = torch.stack([torch.from_numpy(xx.ravel()), torch.from_numpy(yy.ravel())], dim=1).float()
        Xgrid = Xgrid.to(device)  # Move to device
        logits = model(Xgrid)
        # Fix: Use softmax for multi-class, then take probability of class 1
        probs = torch.softmax(logits, dim=1)[:, 1]  # Probability of class 1
        Z = probs.cpu().view(xx.shape)  # Move back to CPU for plotting

    plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
    plt.colorbar(label='Probability of Class 1')

plot_decision_boundary(model, X_train, y_train)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='viridis')
plt.xlabel('Feature 1'); plt.ylabel('Feature 2'); plt.title('Decision Boundary'); plt.show()



## 6. Optimizers (SGD vs Adam)

- **SGD** uses one global learning rate for all parameters. It often needs careful LR tuning but can generalize well.
- **Adam** adapts the step size per-parameter using moving averages of gradients; it often **converges faster** with a smaller LR (e.g., `1e-3`).

We will compare a few settings and plot the curves.


In [None]:
# TODO: Compare a few settings and plot validation loss for each.
# Example:
# settings = [('sgd', 0.01, 30), ('sgd', 0.05, 30), ('adam', 0.001, 30), ('adam', 0.01, 30)]
# plt.figure()
# for opt, lr, epochs in settings:
#     model = MLP().to(device)
#     out = train_model(model, train_loader, val_loader, epochs=epochs, lr=lr, optimizer_name=opt, device=device)
#     plt.plot(out["val_losses"], label=f"{opt}, lr={lr}")
# plt.xlabel("Epoch"); plt.ylabel("Val Loss"); plt.title("Val Loss across settings"); plt.legend(); plt.show()



**Interpretation tips**
- The **lower** the validation curve and the **faster** it drops, the better the setting.
- Very noisy or exploding curves often indicate an LR that is **too high**.
- If the curve descends slowly, increase epochs or try a slightly **larger** LR.



## 7. Practical tips

- **Initialization.** PyTorch defaults are good for ReLU MLPs.
- **Batch size.** 32–128 are reasonable; smaller batches add gradient noise that can help generalization.
- **Learning rate.** The most sensitive hyperparameter. Start around `1e-3` for Adam, `1e-2`–`1e-1` for SGD.
- **Overfitting.** Compare **train vs val** loss across epochs; consider early stopping when val loss rises.
- **Sanity checks.** Overfit a tiny subset (e.g., 100 samples) to verify your loop works.



## 8. Conclusion

In this notebook, you trained a simple neural network end-to-end using **PyTorch’s autograd** and **optimizers**.  
You saw how a forward pass, loss calculation, backward pass, and optimizer step work together to make the model learn.