## **1. What is a Learning Rate Scheduler?**

The **learning rate (LR)** controls how big each update step is during optimization:

$$
\theta_{t+1} = \theta_t - \eta_t , \nabla_\theta \mathcal{L}(\theta_t)
$$

where $ \eta_t $ is the **learning rate at step $t$**.

A **scheduler** automatically adjusts $ \eta_t $ during training to:

* Speed up convergence early on,
* Avoid overshooting minima,
* Fine-tune learning near convergence.

---

## **2. Setup: Basic Training Loop (Before Scheduler)**


In [24]:
import torch.nn as nn
import torch
import torch.optim as optim


def func(x):
    return torch.sin(x)+x**2


device = "cuda" if torch.cuda.is_available() else "cpu"
x = torch.arange(start=0, end=10, step=0.5).reshape(-1, 1)


labels = func(x)

class Model(torch.nn.Module):
    def __init__(self, ):
        super().__init__()
        # self.conv1=torch.nn.Conv2d()
        self.fc1 = nn.Linear(in_features=1, out_features=32)
        self.fc2 = nn.Linear(in_features=32, out_features=10)
        self.fc3 = nn.Linear(in_features=10, out_features=1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        return x


class NumberDataset(torch.utils.data.Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels
        super().__init__()
        pass

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

    def __len__(self):

        return len(self.data)


input_dataset = NumberDataset(x, labels)

batch_size = 2
data_loader = torch.utils.data.DataLoader(dataset=input_dataset, batch_size=batch_size, pin_memory=True, num_workers=4)



#### Basic training loop without scheduler

```python
for epoch in range(10):
    for inputs, targets in dataloader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
```

---

## **3. Add a Scheduler**

You create a scheduler **after** the optimizer.

#### **3.1 StepLR**

This scheduler **decays the learning rate by a factor of `gamma` every `step_size` epochs**.

```python
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)
```

Here, every `5` epochs the $LR$ will be multiplied by `0.1`.

---


In [25]:
model = Model().to(device)
criterion = torch.nn.CrossEntropyLoss()
lr = 1e-3
opt = torch.optim.AdamW(lr=lr, params=model.parameters())
lr_scheduler_StepLR = torch.optim.lr_scheduler.StepLR(optimizer=opt, step_size=5, gamma=0.1)


epochs = 20
for epoch in range(epochs):
    model.train()
    for x, label in data_loader:
        opt.zero_grad()
        x = x.to(device, non_blocking=True)
        label = label.to(device, non_blocking=True)
        y = model(x)
        loss = criterion(y, label)
        loss.backward()
        opt.step()
        
    # Step the scheduler after each epoch        
    lr_scheduler_StepLR.step()
    lr = lr_scheduler_StepLR.get_last_lr()[0]
    # Print current learning rate
    print(f"Epoch {epoch+1}: LR = {lr:.5f}")


Epoch 1: LR = 0.00100
Epoch 2: LR = 0.00100
Epoch 3: LR = 0.00100
Epoch 4: LR = 0.00100
Epoch 5: LR = 0.00010
Epoch 6: LR = 0.00010
Epoch 7: LR = 0.00010
Epoch 8: LR = 0.00010
Epoch 9: LR = 0.00010
Epoch 10: LR = 0.00001
Epoch 11: LR = 0.00001
Epoch 12: LR = 0.00001
Epoch 13: LR = 0.00001
Epoch 14: LR = 0.00001
Epoch 15: LR = 0.00000
Epoch 16: LR = 0.00000
Epoch 17: LR = 0.00000
Epoch 18: LR = 0.00000
Epoch 19: LR = 0.00000
Epoch 20: LR = 0.00000


Here is **exactly why StepLR gives you this LR schedule** and why it quickly becomes **0.0**.

---

#### **3.2. What StepLR does**

The definition:

```python
torch.optim.lr_scheduler.StepLR(
    optimizer=opt, 
    step_size=5, 
    gamma=0.1
)
```

means:

* Every **step_size** epochs
* Multiply the learning rate by **gamma**

Formally:


$$
\text{lr}(t) = \text{lr}_0 \cdot \gamma^{\left\lfloor \frac{t}{\text{step\_size}} \right\rfloor}
$$


Where t = epoch.

---

#### **3.3. Your schedule numerically**

You start with:

* initial LR = 0.001
* step_size = 5
* gamma = 0.1

Thus:

**First decay: at epoch 5**

$$
0.001 \times 0.1 = 0.0001
$$

**Second decay: at epoch 10**

$$
0.0001 \times 0.1 = 0.00001
$$

**Third decay: at epoch 15**

$$
0.00001 \times 0.1 = 0.000001
$$

But PyTorch rounds this into **scientific precision** and depending on how you print it you get:

* 0.000001 → printed as 0.00000 (rounded)
* Then next decay:
  $$
  0.000001 \times 0.1 = 0.0000001
  $$
* Then:
  $$
  0.0000001 \times 0.1 = 0.00000001
  $$

Python formatting:

```python
print(f"{lr:.5f}")
```

will show:

```
0.00000
0.00000
0.00000
...
```

But the LR is not literally zero; it is just **very small and rounded**.

---

#### **3.4 The key reason your LR “becomes zero”**

Because:

**StepLR decays too aggressively.**

With gamma=0.1 every 5 epochs, the LR becomes:

| Epoch | LR   |
| ----- | ---- |
| 0–4   | 1e-3 |
| 5–9   | 1e-4 |
| 10–14 | 1e-5 |
| 15–19 | 1e-6 |
| 20–24 | 1e-7 |
| 25–29 | 1e-8 |
| 30–34 | 1e-9 |
| ...   | ...  |

After ~25–30 epochs LR is:

$$
10^{-8}, 10^{-9}, 10^{-10}, \dots
$$

In **formatted printing**, any value < 0.000005 will print as:

```
0.00000
```

So the LR is not mathematically zero, but it is effectively too small to be useful.

---

#### **3.5. Why StepLR is rarely used today**

StepLR is considered **too abrupt**, especially for deep networks.

It behaves like:

```
LR suddenly drops → sudden jump in loss → unstable training
```

Modern schedules avoid hard jumps:

* CosineAnnealingLR
* CosineAnnealingWarmRestarts
* OneCycleLR
* LambdaLR (custom smooth schedule)
* ReduceLROnPlateau (validation-based)

---

## **4. ConstantLR Schedules**

In [None]:
lr = 1e-3
opt = torch.optim.AdamW(lr=lr, params=model.parameters())
lr_scheduler_ConstantLR = torch.optim.lr_scheduler.ConstantLR(optimizer=opt)

print("factor:", lr_scheduler_ConstantLR.factor)
print("total_iters:", lr_scheduler_ConstantLR.total_iters)
print("last_epoch:", lr_scheduler_ConstantLR.last_epoch)


epochs = 20
for epoch in range(epochs):
    model.train()
    for x, label in data_loader:
        opt.zero_grad()
        x = x.to(device, non_blocking=True)
        label = label.to(device, non_blocking=True)
        y = model(x)
        loss = criterion(y, label)
        loss.backward()
        opt.step()

    lr_scheduler_ConstantLR.step()
    lr = lr_scheduler_ConstantLR.get_last_lr()[0]


    print(f"Epoch {epoch+1}: LR = {lr:.5f}")


factor: 0.3333333333333333
total_iters: 5
last_epoch: 0
Epoch 1: LR = 0.00033
Epoch 2: LR = 0.00033
Epoch 3: LR = 0.00033
Epoch 4: LR = 0.00033
Epoch 5: LR = 0.00100
Epoch 6: LR = 0.00100
Epoch 7: LR = 0.00100
Epoch 8: LR = 0.00100


---

#### **4.1. Why your LR starts at 0.00033**

You used:

```python
lr_scheduler_ConstantLR = torch.optim.lr_scheduler.ConstantLR(optimizer=opt)
```

Default values are:

```python
factor = 1/3       # = 0.333333...
total_iters = 5
```

Meaning:

**During the first 5 scheduler.step() calls:**


$$
\text{lr}_t = \text{base\_lr} \cdot \text{factor} = 0.001 \times 0.33333 = 0.00033333
$$

After **5 iterations**, the LR snaps back to:

$$
\text{lr}_t = \text{base\_lr} = 0.001
$$

This matches your output:

* Epoch 1–4: approx **0.00033**
* Epoch 5: **0.001** again
* Rest of training: **0.001**

So the scheduler is doing exactly what it is designed for.

---

#### **4.2. Why ConstantLR is *not* actually constant**

Its name is misleading.

It means:

**Hold the LR at a *constant fraction* of initial LR for a few steps**,
then automatically restore the original LR.

It is usually used for **warmup**:

* Keep LR low for first *k* steps
* Then return to normal LR and continue with a main schedule

---

#### **4.3. Your code applies ConstantLR per **epoch**, not per batch**

You do:

```python
lr_scheduler_ConstantLR.step()
```

**once per epoch**.

Thus:

* It warms up for 5 epochs
* Then normal LR for the remaining 95 epochs

If you wanted warmup per **batch**, you should call `.step()` inside the batch loop.

---



## **5. ReduceLROnPlateau Scheduler**

In [None]:
import torch.nn as nn
import torch
import torch.optim as optim


device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)


data_loader = [(torch.randn(4, 10), torch.randn(4, 2))]


# Dummy model
model = nn.Linear(10, 2).to(device)

criterion = torch.nn.CrossEntropyLoss()

lr = 1e-3
opt = torch.optim.AdamW(lr=lr, params=model.parameters())

# ReduceLROnPlateau scheduler
lr_scheduler_ReduceLROnPlateau = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer=opt,
    mode="min",
    factor=0.1,
    patience=5,     # reduce LR after 5 bad epochs
)

print("eps: ", lr_scheduler_ReduceLROnPlateau.eps)

epochs = 30


# ---------------------------------------------
# Manually craft validation loss that plateaus
# ---------------------------------------------
# First improves, then stops improving after epoch 12
val_losses = [
    1.0, 0.9, 0.8, 0.7, 0.6,
    0.55, 0.52, 0.50, 0.49, 0.485,
    0.480, 0.479,       # still improving
    0.479, 0.479, 0.479, 0.480, 0.481,   # plateau begins
    0.482, 0.481, 0.483, 0.484, 0.485,
    0.486, 0.487, 0.488, 0.487,
    0.489, 0.490, 0.491, 0.492
]
# length = 30


# Dummy dataloader (not relevant for LR scheduler demo)
data_loader = [(torch.randn(4, 10), torch.randn(4, 2))]


best = float("inf")

eps = lr_scheduler_ReduceLROnPlateau.eps

print("\nDEBUG INFO\n----------")

for epoch in range(len(val_losses)):

    # fake training step
    model.train()
    for x, label in data_loader:

        x = x.to(device, non_blocking=True)
        label = label.to(device, non_blocking=True)
        opt.zero_grad()
        out = model(x)
        loss = criterion(out, label)
        loss.backward()
        opt.step()

    # --------------------------------
    # Validation loss for this epoch
    # --------------------------------
    current = val_losses[epoch]

    # --------------------------------
    # Condition for improvement
    # --------------------------------
    improved = current < best - eps

    # --------------------------------
    # Debug printing
    # --------------------------------
    print(
        f"Epoch {epoch+1:02d}: "
        f"current={current:.6f}, best={best:.6f}, "
        f"\ncurrent < best - eps=\n{current:.6f} < {best:.6f}-{eps}= {best-eps}\n"
        f"condition(current < best - eps) = {improved}"
    )

    # Update best manually
    if improved:
        best = current
    else:
        pass  # counted as bad epoch (scheduler tracks internally)

    # --------------------------------
    # Call scheduler
    # --------------------------------
    lr_scheduler_ReduceLROnPlateau.step(current)

    lr = opt.param_groups[0]["lr"]

    print(f"          LR = {lr:.6f}\n")
    print(f"Epoch {epoch+1}: LR = {lr:.5f}")
    print("-"*60)


#### **5.1. How ReduceLROnPlateau**

The scheduler monitors a metric (usually validation loss) and **reduces the learning rate when the metric has stopped improving**.

You call it like:

```python
scheduler.step(val_loss)
```

each epoch.

---

#### **5.2. What triggers the LR reduction?**

The scheduler tracks the **best value seen so far**.

For `mode="min"`:

* Improvement means:
  $$\text{current} < \text{best} - \epsilon$$
* If this happens → **reset patience counter**.

For `mode="max"`:

* Improvement means:
  $$\text{current} > \text{best} + \epsilon$$

Here

* **patience** = number of "bad epochs" allowed before reducing LR
* **epsilon** = `eps` parameter (default `eps=1e-8`)
  A very tiny threshold. Not a true tolerance — just a safety margin.

---

#### **5.3. What counts as “no improvement”?**

An epoch without improvement is a **bad epoch**.
Bad epoch happens when:

$$\text{current loss} \ge \text{best loss} - \text{eps}$$

So even a very small improvement still counts as improvement (unless < eps).

Example:

* best loss: 0.250000
* current loss: 0.2499998
* eps: 1e-8

This **is considered an improvement**.

---

#### **5.4. After *patience* bad epochs → LR decreases**

If you set:

```python
patience=5
factor=0.1
```

Then:

* If 10 consecutive epochs show no improvement →
  new learning rate becomes:

$$\text{lr}*{new} = \text{lr}*{old} \times \text{factor}$$

Example:

Initial LR = 1e-3
factor = 0.1

After plateau:

New LR =
$$10^{-3} \times 0.1 = 10^{-4}$$

---




## **6. Linear Warmup**


#### 1. Warmup

Warmup is a **temporary increase** of the learning rate from a very small value up to the target learning rate.

Why?
Early in training, gradients are unstable. If LR is too large, updates can explode or overshoot.
Warmup prevents this by starting small, then ramping up to the desired value.

---

#### 1.1. General Warmup Concept

During warmup steps $ t \in [0, T_{\text{warmup}}] $:

The LR increases from:

* starting LR: $ \eta_{\text{start}} $
* target LR:   $ \eta_{\text{max}} $

Warmup is usually **linear**, but can be exponential or cosine.

---

#### 1.2. Linear Warmup Equation

For step $ t $ during warmup:

$$ \eta(t) = \eta_{\text{start}} +
 \frac{t}{T_{\text{warmup}}} \left(\eta_{\text{max}} - \eta_{\text{start}}\right)  $$

At end of warmup:

* $ t = T_{\text{warmup}} $
* $ \eta(T_{\text{warmup}}) = \eta_{\text{max}} $

After this point, a different scheduler takes over (cosine, exponential, etc.)

---

#### 1.3. Why Warmup Works

During the first batches:

* gradients are noisy
* batch statistics are unstable
* weights are unscaled
* Adam/AdamW bias corrections are still adapting

Warmup prevents catastrophic updates.

Transformers and ViTs require warmup almost always.

---

## **7. Cosine Annealing**


#### 7.1. Cosine Annealing (without warmup)

Cosine annealing reduces LR following half a cosine wave.

Idea: **large LR at beginning, very slow LR at the end**, smooth + no abrupt changes.

Time variable: $ t = 0, 1, 2, \ldots, T_{\text{max}}$

---

#### 7.2.Equation

$$\eta(t) = \eta_{\text{min}}+ \frac{\eta_{\text{max}} - \eta_{\text{min}}}{2}
  \left(1 + \cos\left(\pi \frac{t}{T_{\text{max}}}\right)\right)  $$

This is one half-period of cosine.

---

#### 7.3. Why decreasing?

Cosine goes from:

* $ \cos(0) = 1 $
* $ \cos(\pi) = -1 $

So the LR decreases smoothly:

$$
\eta(0) = \eta_{\text{max}}
$$

$$
\eta(T_{\text{max}}) = \eta_{\text{min}}
$$

No oscillation because only **half** cosine is used (not periodic).

---

#### 7.4. Intuition

Cosine annealing decreases fast at start, then extremely slow near the end.
This helps the model settle gently into a good minimum.

Used in transformers, ViT, GPT, diffusion, Stable Diffusion.

---

## **8. Exponential Decay**

LR decays by a constant ratio every epoch.

---

#### **8.1. Equation**

Let decay rate = $ \gamma $ (e.g., $ \gamma = 0.95 $).

Then:

$$
\eta(t) = \eta_{0} \cdot \gamma^{t}
$$

Example:

* $ \gamma = 0.95 $
* after 20 epochs:

$$
\eta(20) = \eta_0 \cdot 0.95^{20}
$$

Exponential decay drops quickly early, then flattens.

Used in older CNN models, RNNs, classical ML.

---

#### **8.2. Intuition**

Simple, predictable, but not smooth near the end (keeps decreasing forever).

---

## **9. Linear Warmup + Cosine Decay**

Most common combined schedule:

1. **Warmup phase**: LR increases linearly
2. **Cosine schedule**: LR decreases smoothly

---

#### **9.1. Equation (piecewise)**

Let warmup last $ T_{\text{warmup}} $ steps.

Warmup:

$$\eta(t) = \eta_{\text{start}} + \frac{t}{T_{\text{warmup}}}(\eta_{\text{max}} - \eta_{\text{start}}),
  \quad t < T_{\text{warmup}}
  $$

Cosine (after warmup):

Let $ s = t - T_{\text{warmup}} $

$$\eta(t) = \eta_{\text{min}} + \frac{\eta_{\text{max}} - \eta_{\text{min}}}{2}
  \left( 1 + \cos\left(\pi \frac{s}{T_{\text{total}} - T_{\text{warmup}}} \right)\right),
  \quad t \ge T_{\text{warmup}}
  $$

---

#### **9.2. Intuition**

Warmup stabilizes the start.
Cosine provides a smooth landing.

This is the standard schedule for:

* ViT
* GPT
* BERT
* diffusion models
* Stable Diffusion
* large CNNs
* all modern LLMs

---


## **10. When should you choose which?**

#### Use ReduceLROnPlateau when:

* training small/medium models
* val loss is reliable
* dataset is small
* want fully automatic LR reduction

Typical in:

* medical imaging
* segmentation tasks
* small CNNs
* Kaggle tasks
* classic PyTorch tutorials

---

#### Use Cosine Annealing when:

* training modern architectures
* training transformers, ViT, CLIP, GPT
* using AdamW
* training large models on large datasets
* want smooth LR decay
* want reproducibility and stability

This includes:

* ImageNet training
* ViT fine-tuning
* LLM training
* diffusion models
* all JAX/DeepMind/Google setups

---



In [None]:
import torch
import matplotlib.pyplot as plt
import numpy as np

steps = 200                     # number of epochs or iterations to simulate
base_lr = 1e-3                  # starting LR


def new_opt(lr=base_lr):
    return torch.optim.Adam([torch.zeros(1)], lr=lr)


def record_scheduler(name, scheduler, optimizer, mode="epoch"):
    """
    mode = 'epoch' or 'iter'
    """
    lrs = []

    for t in range(steps):
        # required to avoid warning: optimizer.step() must run before scheduler.step()
        optimizer.step()

        if name == "ReduceLROnPlateau":
            # create fake validation loss curve
            # first improvements then plateau, then slight increase
            val_loss = 1 + np.sin(t / 15) * 0.1 + (t / steps) * 0.3
            scheduler.step(val_loss)
        else:
            scheduler.step()

        lrs.append(optimizer.param_groups[0]["lr"])

    return name, lrs


results = []

# ------------------------------------------------------------
# 1) StepLR
# ------------------------------------------------------------
opt = new_opt()
sched = torch.optim.lr_scheduler.StepLR(opt, step_size=40, gamma=0.5)
results.append(record_scheduler("StepLR", sched, opt))

# ------------------------------------------------------------
# 2) Cosine Annealing
# ------------------------------------------------------------
opt = new_opt()
sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=steps, eta_min=1e-6)
results.append(record_scheduler("CosineAnnealing", sched, opt))

# ------------------------------------------------------------
# 3) Warmup + Cosine (Linear warmup first 20 steps)
# ------------------------------------------------------------
total_warmup = 20
opt = new_opt()

scheduler_warmup = torch.optim.lr_scheduler.LinearLR(
    opt, start_factor=0.1, total_iters=total_warmup
)
scheduler_cosine = torch.optim.lr_scheduler.CosineAnnealingLR(
    opt, T_max=steps - total_warmup, eta_min=1e-6
)

lrs = []
for t in range(steps):
    opt.step()

    if t < total_warmup:
        scheduler_warmup.step()
    else:
        scheduler_cosine.step()

    lrs.append(opt.param_groups[0]["lr"])

results.append(("Warmup+Cosine", lrs))

# ------------------------------------------------------------
# 4) ReduceLROnPlateau
# ------------------------------------------------------------
opt = new_opt()
sched = torch.optim.lr_scheduler.ReduceLROnPlateau(
    opt, mode="min", factor=0.5, patience=10, threshold=1e-4
)
results.append(record_scheduler("ReduceLROnPlateau", sched, opt))

# ------------------------------------------------------------
# 5) ExponentialLR
# ------------------------------------------------------------
opt = new_opt()
sched = torch.optim.lr_scheduler.ExponentialLR(opt, gamma=0.97)
results.append(record_scheduler("ExponentialLR", sched, opt))

# ------------------------------------------------------------
# 6) PolynomialLR
# ------------------------------------------------------------
opt = new_opt()
sched = torch.optim.lr_scheduler.PolynomialLR(opt, total_iters=steps, power=2)
results.append(record_scheduler("PolynomialLR", sched, opt))

# ------------------------------------------------------------
# 7) OneCycleLR (per iteration)
# ------------------------------------------------------------
max_lr = 1e-2
opt = new_opt(lr=1e-4)
sched = torch.optim.lr_scheduler.OneCycleLR(
    opt, max_lr=max_lr, total_steps=steps
)
results.append(record_scheduler("OneCycleLR", sched, opt, mode="iter"))

# ------------------------------------------------------------
# 8) CyclicLR (per iteration)
# ------------------------------------------------------------
opt = new_opt(lr=1e-4)
sched = torch.optim.lr_scheduler.CyclicLR(
    opt,
    base_lr=1e-4,
    max_lr=1e-2,
    step_size_up=steps // 4,
    cycle_momentum=False,
)
results.append(record_scheduler("CyclicLR", sched, opt, mode="iter"))

# ------------------------------------------------------------
# 9) Linear Warmup only
# ------------------------------------------------------------
opt = new_opt(lr=1e-6)
sched = torch.optim.lr_scheduler.LinearLR(
    opt, start_factor=1e-3, total_iters=steps
)
results.append(record_scheduler("LinearWarmupOnly", sched, opt))

# ------------------------------------------------------------
# 10) LambdaLR
# ------------------------------------------------------------
opt = new_opt()

def lr_lambda(epoch):
    # simple example: inverse sqrt decay
    return 1.0 / np.sqrt(epoch + 1)

sched = torch.optim.lr_scheduler.LambdaLR(opt, lr_lambda=lr_lambda)
results.append(record_scheduler("LambdaLR", sched, opt))

# ------------------------------------------------------------
# Plot all curves
# ------------------------------------------------------------
plt.figure(figsize=(14, 8))

for name, lrs in results:
    plt.plot(lrs, label=name)

plt.xlabel("Step")
plt.ylabel("Learning Rate")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


## **11. The Intuition of the Most Common LR Schedules**




#### **11.1. StepLR**

Decreases LR with sudden drops.

```
High  ————
           |
           ————
               |
               ———— low
```

Used in **classic CNNs** where step-decay empirically worked well.

Interpretation
Model improves → reach plateau → drop LR → improve again.

---

#### **11.2. Cosine Annealing**

LR follows half of a cosine wave:

$$
LR(t) = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})(1 + \cos(\pi t / T_{\max}))
$$

Intuition
Starts high → decreases smoothly → tiny values at the end.

Used in **ViT / Transformers** and modern architectures.

Why popular?
Because cosine gives a **very smooth**, **stable**, non-jumpy decay.

---

#### **11.3. Warmup + Cosine**

Two phases:

**Phase A: Warmup**

LR increases from very small → target LR.

Why warmup?
During first batches, gradients are unstable.
Warmup prevents “explosions”.

**Phase B: Cosine decay**

Same as above.

Used in **GPT, ViT, diffusion models**.

---

#### **11.4. ReduceLROnPlateau**

Monitors validation loss:

If val_loss stops improving → reduce LR.

Very intuitive:
Model stuck? → make smaller steps.

Perfect for unstable training.

---

#### **11.5. Exponential decay**

Decays LR every epoch by a factor:

$$LR = LR_0 \cdot \gamma^t$$

Simple, steady shrink.
Used in simple CNNs or classic ML.

---

#### **11.6. Polynomial decay**

Used in segmentation:

$$LR = LR_0 (1 - t/T)^{p}$$

Sharper decay near the end.
Helps fine-grained pixel-level tasks.

---

#### **11.7. OneCycleLR**

Starts low → goes high → drops very low.

Intuition:
Large LR in the middle helps escape poor minima.
Final very low LR helps convergence.

Used in YOLO & fastai.

---

#### **11.8. CyclicLR**

LR oscillates between low and high:

Low → High → Low → High …

Good when training is noisy and non-stationary.

---

#### **11.9. Linear warmup**

Just the warmup phase:

Low → gradually → target LR.

Used in **huge models** where cosine is added later.

---

#### **11.10. LambdaLR**

Fully custom schedule.

You define a function:

```python
lr = lr0 * f(epoch)
```
---

## **12. Weight Decay**

Weight decay (L2 regularization) **always pushes weights toward zero**, and it does so in a very precise mathematical way.
Let’s make it extremely simple, intuitive, and fully correct.

Why shrink weights?

Because large weights mean the model **relies too heavily on specific features**, leading to:

* overfitting
* sharp minima
* unstable training
* poor generalization

Weight decay forces smoother, simpler models.

---

**Weight decay ≈ push weights to stay small**

The update rule is:

$$
w = w - \eta (\nabla w + \lambda w)
$$

The term
$$\lambda w$$
is what pulls weights toward zero.

---



#### **12.1. Core idea**

Weight decay adds a penalty term
$$
\lambda \lVert w \rVert^2
$$
to the loss.

The gradient of this penalty is:

$$
\frac{\partial}{\partial w} \lambda \lVert w \rVert^2 = 2\lambda w
$$

This is **always in the same direction as the weight itself**.

Then the optimizer update becomes:

$$
w \leftarrow w - \eta ( \nabla L + 2\lambda w )
$$

Rewrite:

$$
w \leftarrow w - \eta \nabla L - 2\lambda \eta w
$$

Group the weight-dependent term:

$$
w \leftarrow (1 - 2\lambda \eta), w - \eta \nabla L
$$

This is the key.

---

#### **12.2. Why this always shrinks the weight**

Look at the factor

$$
1 - 2\lambda \eta
$$

Since

* learning rate $\eta > 0$
* weight decay $\lambda > 0$

you have:

$$
1 - 2\lambda \eta < 1.
$$

Therefore every update multiplies the weight by some number slightly less than 1.

Example:
If

* $\eta = 0.001$
* $\lambda = 0.01$

then

$$
1 - 2\lambda\eta = 1 - 0.0002 = 0.9998.
$$

So each step:

$$
w_{\text{new}} = 0.9998 \cdot w - \eta \nabla L.
$$

This means that **even if the gradient is zero**, the weight shrinks:

$$
w_{t+1} = 0.9998 w_t.
$$

After many steps, it approaches zero.

---

#### **12.3. Addressing your question directly**

> *"How does λw pull weights toward zero? I mean it is adding something to it, it could be plus or minus."*

Good confusion — here is the clean resolution:

#### **12.3 The added gradient is always aligned with w**

The penalty contributes a gradient:

$$
2\lambda w
$$

If $w$ is positive → gradient is positive
If $w$ is negative → gradient is negative

But the optimizer **subtracts** gradients:

Update:

$$
w \leftarrow w - \eta (2\lambda w)
$$

Thus:

* If $w > 0$:
  $$w \leftarrow w - \eta (2\lambda w) = w(1 - 2\lambda\eta)$$
  → weight decreases

* If $w < 0$:
  $$w \leftarrow w - \eta (2\lambda w) = w(1 - 2\lambda\eta)$$
  → weight increases (toward 0)

So no matter the sign:

**Why it works:**
You are subtracting a term proportional to the weight itself.

This is like “friction” in physics — it slows motion and reduces magnitude.

---

#### **12.4. Pure intuition (best explanation)**

Think of weight decay as applying a **brake** to parameter magnitude.

The bigger a weight is, the stronger the “pull-to-zero” is.

The penalty gradient is of the form:

* positive if weight is positive
* negative if weight is negative

But because we **subtract** the gradient during update, that pushes the weight toward zero in both cases.

---

#### **12.5. Why weight decay is essential**

- **5.1 It prevents exploding weights**

Without decay, weights may drift outward during noisy updates.

- **5.2 It smooths the loss landscape**

Large weights → sharp minima → poor generalization
Shrinked weights → flatter minima → good generalization

- **5.3 Why AdamW is preferred**

Adam incorrectly mixes weight decay with gradient updates.
AdamW decouples the term, correctly applying:

$$
w \leftarrow w (1 - \eta \lambda)
$$

as a separate operation.

---

#### **12.6. Minimal numeric example**

Let:

* $w = 10$
* $\eta = 0.1$
* $\lambda = 0.01$

Decay factor:

$$
1 - \eta \lambda = 0.999
$$

Update:

$$
w_{\text{new}} = 10 \cdot 0.999 = 9.99.
$$

If $w = -10$:

$$
-10 \cdot 0.999 = -9.99.
$$

Both cases → magnitude shrinks.

---

### **12.One-sentence summary**

> **Weight decay works because we subtract a gradient term proportional to the weight itself, which multiplies the weight by a factor slightly less than 1 every step, shrinking it toward zero regardless of its sign.**

---

If you'd like, I can also:

* Show the difference between L2 vs weight decay in Adam
* Plot weight decay dynamics
* Explain why biases and LayerNorm weights should not have weight decay


## **13. The relationship between LR and Weight Decay?**

They interact **directly** in the update rule.

High LR makes weight decay *act stronger*.
Low LR makes weight decay *act weaker*.

So:

### Rule of thumb

* When you use **large LR**, weight decay is **very influential**.
* When you use **very small LR**, weight decay barely changes anything.

This is why modern optimizers (AdamW) separate them cleanly.

---

# Best practical settings

### Vision models (ConvNeXt, ViT)

* Warmup + Cosine
* Weight decay: 0.02–0.1

### Small CNNs

* StepLR
* Weight decay: 1e-4

### Transformers

* Warmup (5–10% of training)
* Cosine decay
* Weight decay: 0.05

### YOLO / Object detection

* OneCycleLR
* Weight decay: 5e-4

---

