##  1. **SGD (Stochastic Gradient Descent)**
```python
torch.optim.SGD(params, lr=0.01, momentum=0.9)
```

####  Equation:
$
x_{t+1} = x_t - \eta \cdot \nabla f(x_t)
$

With momentum:

$
v_{t+1} = \mu v_t + \eta \cdot \nabla f(x_t) 
$


$
x_{t+1} = x_t - v_{t+1}
$


####  Properties:
- Simple and widely used
- Supports **momentum** and **weight decay (L2 regularization)**
- Can escape local minima with momentum

####  When to Use:
- Large datasets
- Classical models like linear/logistic regression
- When you want full control over optimization (e.g., with custom learning rate schedules)

#### Be Careful:
- Sensitive to learning rate
- Requires good manual tuning

---


##  2. **Adam (Adaptive Moment Estimation)**
```python
torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999))
```


####  Equation:
$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla f(x_t) 
$

$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla f(x_t))^2 
$

$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} 
$
$
x_{t+1} = x_t - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
$


####  Properties:
- Combines **momentum** and **adaptive learning rate**
- Keeps track of **moving averages of gradient and squared gradient**
- Usually requires **less tuning** than SGD

####  When to Use:
- Deep learning models (CNNs, RNNs, Transformers)
- Works well out of the box
- Most popular choice for **research and prototyping**

####  Be Careful:
- May generalize worse than SGD in some cases
- Can converge to sharp minima

---

##  3. **AdamW (Adam with decoupled weight decay)**
```python
torch.optim.AdamW(params, lr=0.001, weight_decay=0.01)
```

####  Equation:

$
x_{t+1} = x_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda x_t \right)
$



####  Properties:
- Like Adam but fixes how weight decay is applied
- **Recommended for transformers and NLP models**

####  When to Use:
- Training large models like BERT, Vision Transformers
- When using learning rate schedulers like `cosine annealing`

####  Popular in:
- HuggingFace Transformers
- Vision transformer training

---


<img src="images/step_size_momentum.gif" />

[Full reference](https://distill.pub/2017/momentum/)

[MOMENTUM Gradient Descent](https://www.youtube.com/watch?v=iudXf5n_3ro)

[AdamW - L2 Regularization vs Weight Decay](https://www.youtube.com/watch?v=oWZbcq_figk&t=134s)

[Optimizers - EXPLAINED](https://www.youtube.com/watch?v=mdKjMPmcWjY)

[https://www.youtube.com/watch?v=MD2fYip6QsQ](https://www.youtube.com/watch?v=MD2fYip6QsQ)

[Optimization for Deep Learning (Momentum, RMSprop, AdaGrad, Adam)](https://www.youtube.com/watch?v=NE88eqLngkg)

[Optimization in Deep Learning](https://www.youtube.com/watch?v=M2xkmc2oHUc)

`AdamW` is a variant of the Adam optimizer that **decouples weight decay from the optimization step**, which leads to better regularization and generalization compared to the standard `Adam` with L2 regularization.


In `Adam`, weight decay was incorrectly implemented as L2 regularization—adding it directly to the gradient. But this interferes with Adam’s adaptive moment estimates.

**AdamW corrects this** by applying weight decay **directly to the weights** after the gradient update:



**Adam with Incorrect L2 Regularization**

This is the **incorrect** way (used in classic Adam):

$
\theta_{t+1} = \theta_t - \eta \left( \nabla_{\theta} \mathcal{L}(\theta_t) + \lambda \theta_t \right)
$

- $ \eta $: learning rate  
- $ \lambda $: weight decay coefficient  
- $ \nabla_{\theta} \mathcal{L}(\theta_t) $: gradient of the loss

---

**AdamW (Correct Decoupled Weight Decay)**

This is the **correct** AdamW approach:

$
\theta_{t+1} = \left( \theta_t - \eta \nabla_{\theta} \mathcal{L}(\theta_t) \right) \cdot (1 - \eta \lambda)
$

Alternatively, split into two steps for clarity:

1. Gradient step:
$
\theta_t' = \theta_t - \eta \nabla_{\theta} \mathcal{L}(\theta_t)
$

2. Weight decay step:
$
\theta_{t+1} = \theta_t' \cdot (1 - \eta \lambda)
$

---


####  What is `weight_decay`?
- It’s equivalent to L2 regularization but applied the **correct** way (as a multiplicative decay on weights).
- Helps **prevent overfitting** by penalizing large weights.

####  When you set:
```python
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
```
You're telling it to:
- Use Adam-style updates
- **Decay weights by 1% per step** (adjusted by the learning rate)

Tip: Usually, `weight_decay=1e-2` or `1e-4` is a good starting point. Don’t apply it to **biases or normalization layers** (they don’t benefit from it and can hurt performance).

---


###  **Learning Rate Schedulers (Schedulers)**
 Learning Rate Schedulers (LRS) are **super important** for deep learning optimization, especially when training deep networks like transformers, CNNs, or anything using `AdamW`. They help your model **converge faster, avoid overfitting, and escape local minima**.

---

##  What is a Learning Rate Scheduler?

The **learning rate (LR)** determines how big a step the optimizer takes when updating weights. A **scheduler** dynamically changes the learning rate during training to:

- **Start with a high LR** to explore faster
- **Gradually lower it** to fine-tune the weights


Schedulers help reduce the learning rate during training, especially when the model hits a plateau.

Some common PyTorch schedulers:

#### 1. **StepLR**
```python
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)
```
Every 10 epochs, LR = LR × 0.1

#### 2. **ReduceLROnPlateau**
```python
scheduler = ReduceLROnPlateau(optimizer, mode='min', patience=5, factor=0.5)
```
Reduces LR when validation loss stops improving for 5 epochs.

#### 3. **CosineAnnealingLR**
```python
scheduler = CosineAnnealingLR(optimizer, T_max=50)
```
Cosine decay over 50 epochs — good for smooth convergence.

#### 4. **OneCycleLR**
Very effective for training vision transformers and large models.

```python
scheduler = OneCycleLR(optimizer, max_lr=1e-3, steps_per_epoch=len(train_loader), epochs=10)
```

[PyTorch LR Scheduler](https://www.youtube.com/watch?v=81NJgoR5RfY)

---

###  Putting It All Together: AdamW + Scheduler + Regularization

```python
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

optimizer = AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-2  # regularization
)

scheduler = CosineAnnealingLR(optimizer, T_max=50)

for epoch in range(epochs):
    for batch in train_loader:
        loss = compute_loss(batch)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    
    scheduler.step()  # update LR after each epoch
```

---

### Best Practices
- ✅ Use `AdamW` instead of `Adam + L2`
- ✅ Set `weight_decay` (but not for biases/BatchNorm layers)
- ✅ Use a scheduler like `CosineAnnealingLR` or `OneCycleLR` for smoother training
- ✅ Tune `lr`, `weight_decay`, and scheduler parameters on a validation set


##  4. **RMSprop**
```python
torch.optim.RMSprop(params, lr=0.01, alpha=0.99)
```

#### Equation:

RMSprop (Root Mean Square Propagation) is an adaptive learning rate optimization algorithm designed to deal with the problem of diminishing or exploding learning rates.

Let:
- $ \theta $ be the parameters (weights) of your model,
- $ g_t = \nabla_\theta J(\theta_t) $ be the gradient of the loss function at time step $ t $,
- $ E[g^2]_t $ be the exponentially weighted moving average of the squared gradients,
- $ \eta $ be the learning rate,
- $ \gamma $ be the decay rate (typically around 0.9),
- $ \epsilon $ be a small value to avoid division by zero (e.g., $10^{-8}$).

Then the RMSprop update rule is:

1. **Running average of squared gradients:**
   $
   E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma) g_t^2
   $

2. **Parameter update:**
   $
   \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t
   $

---

**Explanation**

- RMSprop maintains a running average of the squared gradients and divides the learning rate by the root of this average.
- It adapts the learning rate for each parameter individually, allowing faster convergence and better handling of noisy gradients.




####  Properties:
- Adaptive learning rate
- Suitable for non-stationary objectives (e.g., reinforcement learning)

####  When to Use:
- RNNs
- Reinforcement learning
- If Adam performs poorly in your case

---



##  5. **Adagrad**
```python
torch.optim.Adagrad(params, lr=0.01)
```

#### Equations: 

Let:
- $ \theta $ be the parameters of the model,
- $ g_t = \nabla_\theta J(\theta_t) $ be the gradient at time step $ t $,
- $ G_t $ be the **accumulated sum of squared gradients** up to time $ t $,
- $ \eta $ be the initial learning rate,
- $ \epsilon $ be a small constant to prevent division by zero (e.g., $10^{-8}$).

Then:

1. **Accumulated squared gradients:**
   $
   G_t = G_{t-1} + g_t^2
   $

2. **Parameter update:**
   $
   \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t
   $

---

**Key Idea**

- Adagrad adapts the learning rate for each parameter based on how frequently it’s updated.
- Parameters with **larger past gradients** get **smaller updates**, and parameters with **smaller gradients** get **relatively larger updates**.
- It works well for **sparse data** (e.g., NLP or text data).

---

####  Properties:
- Increases stability by adapting learning rate based on past gradients
- Learning rate decreases over time (which can be limiting)

###  Limitation

- Because $ G_t $ keeps accumulating, it can **grow very large** over time, causing the learning rate to **shrink too much** — this can **stop learning prematurely**.
- This is the main reason why RMSprop and Adam were developed to improve on it.


####  When to Use:
- Sparse data (e.g., NLP tasks with sparse inputs)
- Models with infrequent updates (e.g., embeddings)

---


##  6. **Adadelta**
```python
torch.optim.Adadelta(params)
```

####  Properties:
- Improves Adagrad by limiting the decrease in learning rate
- No need to manually set a learning rate



####  When to Use:
- Similar cases as Adagrad, but more robust
---

##  7. **NAdam (Nesterov-accelerated Adam)**
```python
torch.optim.NAdam(params, lr=0.001)
```

####  Properties:
- Combines Adam with Nesterov momentum
- Slightly faster convergence in some settings

####  When to Use:
- Similar tasks as Adam, but try if Adam is not converging fast enough

---


##  8. **LBFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno)**
```python
torch.optim.LBFGS(params)
```

---

### **What is L-BFGS?**

- L-BFGS is an approximation of the **BFGS** optimization algorithm, designed to be **memory-efficient** (hence "limited-memory").
- It approximates the **inverse Hessian matrix** (second derivatives) using gradients and parameter updates from previous steps.
- Unlike SGD, RMSprop, or Adam, L-BFGS is typically **used for smaller datasets** (e.g., in classical ML), not minibatch training.

---

**Core Idea**

Instead of computing and storing the full Hessian $ H_t \in \mathbb{R}^{n \times n} $, L-BFGS maintains a **limited history** (say $ m $ past updates) of:

- $ s_k = \theta_{k+1} - \theta_k $ (parameter differences)
- $ y_k = \nabla f_{k+1} - \nabla f_k $ (gradient differences)

These vectors are used to update the inverse Hessian approximation $ H_k $ via a **recursive formula**.

---

#### Equations: 

At each iteration $ t $:

1. **Compute gradient**: $ g_t = \nabla_\theta J(\theta_t) $

2. **Compute direction** (approximate Newton step):
   $
   p_t = -H_t g_t
   $

3. **Line search** (optional but common): Find optimal step size $ \alpha_t $

4. **Parameter update**:
   $
   \theta_{t+1} = \theta_t + \alpha_t p_t
   $

5. **Update history** with $ s_t = \theta_{t+1} - \theta_t $, $ y_t = g_{t+1} - g_t $

---

**Highlights**

- **No need to store the full Hessian**, just the past $ m $ pairs of $ (s_k, y_k) $
- Typically works **best for convex, smooth problems**
- Can **converge faster** than gradient descent when the curvature of the loss is important

---

###  Limitations

- Not ideal for **large-scale deep learning**, where stochastic gradient methods (SGD, Adam) are better.
- More **sensitive to noise** and **not naturally mini-batch friendly**.

---


####  Properties:
- Second-order optimizer (quasi-Newton method)
- Requires full batch computation
- Slower but more accurate

####  When to Use:
- Small models with a few parameters
- When you want precise convergence (e.g., curve fitting, inverse problems)

---

**Summary: Which Optimizer is Most Common?**

| Use Case                         | Recommended Optimizer |
|----------------------------------|------------------------|
| Most DL models (CNNs, Transformers) | **Adam** / **AdamW**     |
| Fine-tuning Transformers         | **AdamW**              |
| Classical models (Linear, Logistic) | **SGD**                |
| Sparse data (e.g., NLP)          | **Adagrad**            |
| Reinforcement Learning           | **RMSprop**            |
| Small precise models             | **LBFGS**              |

---
