# 09 — Losses, Optimizers, and Learning-Rate Schedulers
**Goal:** understand *why* these knobs exist and how to pick sane defaults without heavy math.

**We’ll cover:**
- Loss functions (MSE, CrossEntropy) and when to use them
- Optimizers (SGD, Adam, AdamW): intuition + trade-offs
- Learning-rate schedules (StepLR, CosineAnnealing, OneCycle)
- Practical defaults you can memorize


## 0) Setup

In [None]:
import torch, math
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import SGD, Adam, AdamW
from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR, OneCycleLR


## 1) Loss function chooser
- **Regression** (predict a number): use `MSELoss`.
- **Classification** with integer labels: use `CrossEntropyLoss`.
- **Multi-label** (many independent yes/no): use `BCEWithLogitsLoss`.


In [None]:
logits = torch.randn(16, 10)
y_cls  = torch.randint(0, 10, (16,))
y_reg  = torch.randn(16, 1)
print('CrossEntropy:', nn.CrossEntropyLoss()(logits, y_cls).item())
pred   = torch.randn(16,1)
print('MSE:', nn.MSELoss()(pred, y_reg).item())


## 2) Optimizers in one paragraph each
- **SGD**: like walking downhill with short, straight steps. Needs tuning but very solid.
- **Adam**: adapts step sizes per-parameter; fast to good results, can overfit.
- **AdamW**: Adam + correct weight decay (preferred default for many tasks).
**Rule of thumb:** start with `AdamW(lr=1e-3, weight_decay=1e-2)` for small models.


## 3) Learning-rate schedules
- **StepLR**: drop LR every *k* epochs → stable, simple.
- **CosineAnnealingLR**: smooth decay → often works slightly better.
- **OneCycleLR**: good for faster convergence; needs `max_lr` and total steps.


In [None]:
model = nn.Sequential(nn.Linear(32,64), nn.ReLU(), nn.Linear(64,10))
opt = AdamW(model.parameters(), lr=1e-3)
steps_per_epoch = 100
sched = OneCycleLR(opt, max_lr=3e-3, epochs=5, steps_per_epoch=steps_per_epoch)
for epoch in range(5):
    for step in range(steps_per_epoch):
        x = torch.randn(64,32); y = torch.randint(0,10,(64,))
        logits = model(x)
        loss = F.cross_entropy(logits, y)
        opt.zero_grad(); loss.backward(); opt.step(); sched.step()
    print('epoch', epoch+1, 'done')


## 4) Practical defaults cheat-sheet
- **Batch size**: start 64–256 (lower if VRAM errors)
- **Optimizer**: `AdamW(lr=1e-3, weight_decay=1e-2)`
- **LR schedule**: `StepLR(step_size=20, gamma=0.7)` or `CosineAnnealingLR(T_max=EPOCHS)`
- **AMP**: use it on GPU
- **Gradient clipping**: `clip_grad_norm_(params, 1–5)` if unstable
- **Early stopping**: keep the best validation checkpoint
