# **L6: Optimization Techniques**

We are now moving into the engine room of deep learning. You've built neural networks (L4) and learned how to build them in PyTorch (L5). Now, we need to discuss **how** they actually learn.

A neural network is essentially a massive mathematical function with millions of knobs (parameters). Optimization is the specific algorithm used to turn those knobs to minimize error. Choosing the right optimizer can be the difference between a model that converges in minutes versus one that never learns at all.

Here is the roadmap for this session. I have structured this to build from the simplest "blind" descent to modern adaptive methods.

### Phase 1: Topic Breakdown

```text
L6: Optimization Techniques
├── Concept 1: The Baseline (Data & Vanilla SGD)
│   ├── The MNIST Dataset (Implicit Prerequisite)
│   ├── Stochastic Gradient Descent (SGD) Mechanics
│   ├── Intuition: Walking down a hill with small steps
│   ├── Simpler Terms: Learning by trial and error using small batches
│   └── Task: Setup the Data, Model, and a basic Train function using SGD
│
├── Concept 2: Momentum
│   ├── The problem of Local Minima & Saddle Points
│   ├── Velocity vector (accumulating past gradients)
│   ├── Intuition: Rolling a heavy ball down a hill (it gathers speed)
│   ├── Simpler Terms: Using past speed to power through flat areas
│   └── Task: Train the model using SGD with Momentum
│
├── Concept 3: RMSProp (Root Mean Square Propagation)
│   ├── Adaptive Learning Rates
│   ├── Handling different scales for different parameters
│   ├── Intuition: Slowing down on steep slopes, speeding up on flat ones
│   ├── Simpler Terms: Adjusting step size based on how shaky the terrain is
│   └── Task: Train the model using RMSProp
│
├── Concept 4: Adam (Adaptive Moment Estimation)
│   ├── Combining Momentum + RMSProp
│   ├── Bias correction
│   ├── Intuition: The "Gold Standard" general-purpose optimizer
│   ├── Simpler Terms: Smart speed + Smart steering
│   └── Task: Train the model using Adam
│
├── Concept 5: Learning Rate Schedulers (Cosine Annealing)
│   ├── The Learning Rate Decay concept
│   ├── Cosine Annealing mechanics
│   ├── Intuition: Refining the search as we get closer to the target
│   ├── Simpler Terms: Slowing down carefully to park the car perfectly
│   └── Task: Implement a scheduler and visualize the learning rate change
│
└── Mini-Project: Convergence Battle
    └── Train all three methods on MNIST and plot the Loss vs. Epochs comparison

```

**Prerequisite Check:**
To compare these properly, we need a consistent environment. I am treating the **Data Loading (MNIST)** and a **Simple MLP Architecture** as part of Concept 1. We will fix the architecture so the only variable changing is the *Optimizer*.


---

## **Concept 1: The Baseline (Data & Vanilla SGD)**

We begin with the fundamental workhorse of deep learning: Stochastic Gradient Descent (SGD).

### Intuition: The "Drunk" Descent

Imagine you are on a mountain at night (foggy, zero visibility), and your goal is to reach the lowest valley (minimum error).

* **Batch Gradient Descent:** You carefully scan the entire mountain (all data) to calculate the exact slope before taking one step. It is precise but incredibly slow.
* **Stochastic Gradient Descent (SGD):** You grab a handful of data points (a mini-batch), quickly estimate the slope based on just those, and take a step. Because the sample is small, your estimate is noisy. You might stagger left or right (hence "drunk"), but on average, you move downhill much faster because you take thousands of steps in the time it takes Batch GD to take one.

### Mechanics

In strict mathematical terms, "Stochastic" means one sample at a time. In practice (Deep Learning), we almost always use **Mini-batch SGD**.

The update rule for a parameter $\theta$ at step  is:
$$\theta_{t+1} = \theta_t - \eta \cdot \nabla L_{batch}(\theta_t)$$

Where:

* $\eta$ (Eta): Learning Rate (step size).
* $\nabla L_{batch}$: The gradient calculated over a small batch of data (e.g., 32 or 64 images).

### Simpler Explanation

Imagine learning to identify cars.

* **Batch GD:** You look at 10,000 photos of cars, process all of them, and then tweak your brain once.
* **SGD:** You look at 64 photos, tweak your brain immediately. Look at the next 64, tweak again. You learn much faster because you are updating your understanding constantly.

### Trade-offs

* **Pros:** Computational efficiency (doesn't require all data in RAM), frequent updates result in faster convergence initially. The "noise" introduced by random batches can actually help the model jump out of shallow local minima.
* **Cons:** The path to the minimum is jagged and oscillates. It can have trouble settling down into the exact bottom of the valley.

---

### Your Task: Set the Stage

Since we will compare multiple optimizers, we need a reusable training setup.

**Objectives:**

1. **Data:** Use `torchvision` to load the MNIST dataset. Transform it to tensors and normalize. Create **DataLoaders** for training and testing with a batch size of 64.
2. **Model:** Define a Class `SimpleMLP`.
* Input: 784 features (28x28 flattened).
* Hidden: 128 units with ReLU activation.
* Output: 10 units (classes).


3. **Train Function:** Write a function `train_one_epoch(model, train_loader, optimizer, criterion)`.
* It should iterate through the loader.
* Perform the forward pass, loss computation, backward pass, and optimizer step.
* **Crucial:** Remember to zero out gradients before the backward pass.
* Return: The average loss for that epoch.



**Implementation Details:**

* Instantiate the model.
* Use `torch.optim.SGD` (no momentum yet, just `lr=0.01`).
* Use `nn.CrossEntropyLoss`.
* Run it for **1 epoch** just to verify the pipeline works and print the loss.
