In [1]:
import torch as t
from torch.optim.lr_scheduler import *
import numpy as np
import plotly.express as px
import pandas as pd
from functools import partial

# Learning Rate Schedulers
Every time the `step()` method is called on the learning scheduler, it can potentially change the learning rate. Some learning rate schedulers actually change the learning rate only after `step()` has been called some number of times and not on every call. Typically, `step()` is called every epoch. So in most cases the learning rate will change every so many epochs.

**In my notes below, I'll use the terms "epoch" and "scheduler step" interchangeably.**

In [2]:
def explore(lrsched, *, n_epochs=50, steps_per_epoch=7, optim_lr=1., change_every_step=False):
    """
    Typically learning rate is changed with each epoch, but sometimes I have seen it change with every step.
    change_every_step controls this behaviour. If set to False, it changes the lr only every epoch.
    """
    model = t.nn.Linear(5, 1)
    optim = t.optim.SGD(model.parameters(), lr=optim_lr)
    # lr = t.optim.lr_scheduler.StepLR(optim, step_size=3, gamma=0.5)
    lr = lrsched(optim)
    lrs = []
    epochs = []

    for epoch in range(n_epochs):
        for step in range(steps_per_epoch):
            optim.zero_grad()
            curr_lr = lr.get_last_lr()[0]
            # curr_lr_ = optim.state_dict()["param_groups"][0]["lr"]
            lrs.append(curr_lr)
            epochs.append(str(epoch))
            optim.step()
            if change_every_step:
                lr.step()
        if not change_every_step:
            lr.step()

    df = pd.DataFrame({
        "epoch": pd.Series(epochs),
        "lr": pd.Series(lrs)
    })  
    px.scatter(df, x=df.index, y="lr", color="epoch").show()
    return df

## ExponentialLR
Exponentially decays the optimizer's configured learning rate by multiplying it with $\gamma$ at every scheduler step (i.e., every epoch).

In [3]:
lrsched = partial(ExponentialLR, gamma=0.8)
df = explore(lrsched, optim_lr=1.0)

## StepLR
This is a more general form of `ExponentialLR` where instead of decaying the learning rate with every epoch, I can decay it every so many epochs given by the `step_size`.

In [4]:
lrsched = partial(StepLR, step_size=2, gamma=0.8)
df = explore(lrsched, optim_lr=1.0)

## MultiStepLR
This is a more flexible version of `StepLR` where instead of decaying the learning rate at a regular interval of every so many epochs, I can explicitly set the scheduler step number at which the learning rate should be changed.

In [5]:
lrsched = partial(MultiStepLR, milestones=[5, 20, 40], gamma=0.8)
df = explore(lrsched, optim_lr=1.0)

## ConstantLR
Makes one shift in the learning rate. Ends with what the optimizers is configured with. But starts with some other learning rate which is `factor` * `optim_lr` and stays at that reduced rate until `total_iters`, after which is shifts to the original learning rate that the optimizer was configured with, i.e., `optim_lr`.

In [6]:
lrsched = partial(ConstantLR, factor=1/2, total_iters=10)
df = explore(lrsched, optim_lr=0.8)

## LinearLR
This increases the learning rate linearly by multiplying it with the `start_factor`, until the learning rate reaches the value that the optimizer is configured with. I can end with any other multiple of the learning rate. This is configured via the `end_factor` which defaults to 1.0.

I am guessing it back-calculates the starting learning rate based on the ending learning rate (which is the learning rate the optimizer is configured with), the number of iters it has to reach this ending learning rate, and the multiplicative factor (deafults to 1/3).

In [7]:
lrsched = partial(LinearLR, start_factor=0.5, total_iters=15, end_factor=1.0, )
df = explore(lrsched, optim_lr=0.8)

In [8]:
n_epochs = 50
change_ever_step = False
total_iters = n_epochs
lrsched = partial(LinearLR, total_iters=total_iters, start_factor=1., end_factor=0.3)
df = explore(lrsched, optim_lr=1.)
df

Unnamed: 0,epoch,lr
0,0,1.000
1,0,1.000
2,0,1.000
3,0,1.000
4,0,1.000
...,...,...
345,49,0.314
346,49,0.314
347,49,0.314
348,49,0.314


## CosineAnnealingLR
This varies the learning rate in a consine wave starting from the learning rate that the optimizer is configured with and going down to some minimum set by $\eta_{min}$ which defaults to $0$. The period of the cosine wave is $2T_{max}$.

In [9]:
lrsched = partial(CosineAnnealingLR, eta_min=0, T_max=10)
df = explore(lrsched, optim_lr=0.8)

## CosineAnnealingWarmRestart
This is based on the [SGDR: Stochastic Gradient Descent with Warm Restarts](https://arxiv.org/abs/1608.03983) paper. 

### Warm Restarts
First lets understand what are "warm restarts". Normally, we take a learning rate and slowly decrease it over the entire duration of the training run spanning multiple epochs. With warm restart, we "restart" the learning by suddenly increasing the learning rate to its initial high value, but without resetting the weights. E.g., we take the learning rate from 0.1 to 0.001 over 25 epochs, but then in the 21st epoch we bump it upto to 0.1 but don't mess with the weights. We end up training with this large learning rate on the trained weights.

### Cosine Annealing
Learning rate can be descreased using any function, but in the above paper, they used the cosine annealing function to bring the learning rate down but then it is bumped as before, in a single epoch.

In [10]:
lrsched = partial(CosineAnnealingWarmRestarts, T_0=20, T_mult=1, eta_min=0)
df = explore(lrsched, optim_lr=0.8)

## CyclicLR
Based on [Cyclical Learning Rates for Training Neural Networks](https://arxiv.org/abs/1506.01186)

This scheduler is supposed to change on a step (i.e., training loop step **not** scheduler step a.k.a epoch) basis. The actual idea is quite simple - it varies the learning rate from `base_lr` to `max_lr` while completely disregarding the learning rate that the optimzer was configured with. There are three scaling modes - `traingular`, `triangular2`, and `exp_range`.

In [11]:
lrsched = partial(
    CyclicLR, 
    base_lr=0.01,
    max_lr=0.8,
    step_size_up=100,
    mode='triangular2',
)
df = explore(lrsched, n_epochs=50, steps_per_epoch=7, optim_lr=1., change_every_step=True)

## OneCycleLR
Based on the [Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates](https://arxiv.org/abs/1708.07120).

The basic idea is that throughout the entire training run (across all the epochs) the learning rate goes through one cycle. At the very least I need to specify three things - 
  * The maximum learning rate the cycle should go through. This is usually the learning rate the optimizer is configured with.
  * The total number of steps over which the cycle should last. This is given directly or the number of epochs and the number of steps per epoch are givne.
  * The number of steps that the learning rate should be increasing. This is given as a percentage of the total cycle. After these many steps, the scheduler will start to ramp down the learning rate.


In [12]:
lrsched = partial(
    OneCycleLR, 
    max_lr = 0.8,
    epochs=50,
    steps_per_epoch=7,
    pct_start=0.3
)

df = explore(lrsched, n_epochs=50, steps_per_epoch=7, optim_lr=1., change_every_step=True)

There are three other learing rate schedulers that I have not discussed - 
  * ReduceLROnPlateauLR
  * LambdaLR
  * MultiplicativeLR
  
`ReduceLROnPlateau` is a pretty straightforward one where it takes in the metric value with every call to step. If the matrix value has not improved in a few calls to step, then it will change the learning rate. Both `LambdaLR` and `MultiplicativeLR` are flexible exponential decay schedulers but the decay factor is provided by a lambda that is a function of the current epoch.

## Scratch Pad

In [13]:
N_EPOCHS = 3
BATCH_SIZE = 256

n_steps = 45_000//(2*BATCH_SIZE)
lrsched = partial(
    OneCycleLR, 
    max_lr = 0.1,
    epochs=N_EPOCHS,
    steps_per_epoch=n_steps,
    pct_start=0.3
)

df = explore(lrsched, n_epochs=N_EPOCHS, steps_per_epoch=n_steps, optim_lr=1., change_every_step=True)

In [14]:
n_steps

87

In [15]:
n_steps * N_EPOCHS

261