<a href="https://colab.research.google.com/github/arkeodev/pytorch-tutorial/blob/main/pytorch_lr_schedulers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Learning Rate Schedulers

## Definition

Defining a learning rate schedule in PyTorch is a way to adjust the learning rate during training, which can help improve the performance of the model by controlling how weights are updated during the training process. This can be particularly useful for overcoming challenges like overfitting or speeding up the convergence.

PyTorch provides several built-in schedulers through its torch.optim.lr_scheduler module. Here's a general approach to define and use a learning rate scheduler in PyTorch:

1. Choose a Learning Rate Scheduler: First, decide which scheduler fits the needs. PyTorch offers several options, such as `StepLR`, `MultiStepLR`, `ExponentialLR`, `ReduceLROnPlateau`, and `CosineAnnealingLR`, among others. Each scheduler adjusts the learning rate according to its specific strategy.

2. Define Optimizer: Before using a scheduler, it is needed an optimizer. The scheduler adjusts the learning rate for this optimizer.

3. Instantiate the Scheduler: After defining the optimizer, it can be created an instance of the scheduler by passing it the optimizer and other parameters specific to the scheduler's strategy.

4. Update Learning Rate During Training: In the training loop, it is updated the learning rate according to the scheduler's strategy. The method to update the learning rate depends on the scheduler. For some schedulers, it can be updated the learning rate at each epoch, while for others, it might be at each batch.

## Samples

### StepLR



The `StepLR` scheduler decays the learning rate of each parameter group by a factor of `gamma` every `step_size` epochs. It's a simple yet effective way to decrease the learning rate over time.

$$
LR_{t} = LR_{0} \cdot \gamma^{\left\lfloor\frac{epoch}{step\_size}\right\rfloor}
$$

- **$(LR_{t})$**: The learning rate at epoch $(t)$.
- **$(LR_{0})$**: The initial learning rate set at the beginning of training.
- **$(\gamma)$**: The factor by which the learning rate is multiplied at each step. It's a value between 0 and 1. A smaller value decays the learning rate more.
- **$(epoch)$**: The current epoch number during the training process.
- **$(step\_size)$**: The frequency, in epochs, with which to multiply the learning rate by $(\gamma)$.
- **$(\left\lfloor\frac{epoch}{step\_size}\right\rfloor)$**: This denotes the floor division between the current epoch and the step size, effectively counting how many times the learning rate should have been updated. The learning rate is updated every time the division result increases.

The formula describes how the learning rate decreases in discrete steps. Every $(step\_size)$ epochs, the learning rate is multiplied by a factor of $(\gamma)$, leading to a piecewise constant decay schedule.

In [6]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler

# Sample model
model = nn.Linear(10, 2)

# Optimizer
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Scheduler
scheduler = lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

# Training loop
for epoch in range(30):
    # Dummy training step
    optimizer.zero_grad()
    output = model(torch.randn(5, 10))
    loss = output.sum()
    loss.backward()
    optimizer.step()

    print(f"Epoch {epoch+1}, Current LR: {scheduler.get_last_lr()[0]:.2e}")

    scheduler.step()

Epoch 1, Current LR: 1.00e-01
Epoch 2, Current LR: 1.00e-01
Epoch 3, Current LR: 1.00e-01
Epoch 4, Current LR: 1.00e-01
Epoch 5, Current LR: 1.00e-01
Epoch 6, Current LR: 1.00e-01
Epoch 7, Current LR: 1.00e-01
Epoch 8, Current LR: 1.00e-01
Epoch 9, Current LR: 1.00e-01
Epoch 10, Current LR: 1.00e-01
Epoch 11, Current LR: 1.00e-02
Epoch 12, Current LR: 1.00e-02
Epoch 13, Current LR: 1.00e-02
Epoch 14, Current LR: 1.00e-02
Epoch 15, Current LR: 1.00e-02
Epoch 16, Current LR: 1.00e-02
Epoch 17, Current LR: 1.00e-02
Epoch 18, Current LR: 1.00e-02
Epoch 19, Current LR: 1.00e-02
Epoch 20, Current LR: 1.00e-02
Epoch 21, Current LR: 1.00e-03
Epoch 22, Current LR: 1.00e-03
Epoch 23, Current LR: 1.00e-03
Epoch 24, Current LR: 1.00e-03
Epoch 25, Current LR: 1.00e-03
Epoch 26, Current LR: 1.00e-03
Epoch 27, Current LR: 1.00e-03
Epoch 28, Current LR: 1.00e-03
Epoch 29, Current LR: 1.00e-03
Epoch 30, Current LR: 1.00e-03


### MultiStepLR

`MultiStepLR` decays the learning rate of each parameter group by `gamma` once the number of epochs reaches one of the milestones. It offers more flexibility than `StepLR` by allowing for non-uniform step sizes.

$$
LR_{t} = LR_{0} \cdot \gamma^{\sum_{m \in milestones} \mathbb{1}_{epoch > m}}
$$

- **$(\sum_{m \in milestones} \mathbb{1}_{epoch > m})$**: This summation counts the number of milestones that have been passed. The function $(\mathbb{1}_{epoch > m})$ is an indicator function that returns 1 if the current epoch is greater than the milestone $(m)$ and 0 otherwise.
- **$(milestones)$**: A set of epoch numbers at which the learning rate should be decreased.

Similar to `StepLR`, `MultiStepLR` reduces the learning rate by multiplying it with $(\gamma)$, but it does so whenever the current epoch surpasses any of the predefined milestones, allowing for more flexibility compared to the regular step decay.

In [11]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler

# Sample model
model = nn.Linear(10, 2)

# Optimizer
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Scheduler
scheduler = lr_scheduler.MultiStepLR(optimizer, milestones=[15, 25], gamma=0.1)

# Training loop
for epoch in range(30):
    # Dummy training step
    optimizer.zero_grad()
    output = model(torch.randn(5, 10))
    loss = output.sum()
    loss.backward()
    optimizer.step()

    print(f"Epoch {epoch+1}, Current LR: {scheduler.get_last_lr()[0]:.2e}")

    scheduler.step()

Epoch 1, Current LR: 1.00e-01
Epoch 2, Current LR: 1.00e-01
Epoch 3, Current LR: 1.00e-01
Epoch 4, Current LR: 1.00e-01
Epoch 5, Current LR: 1.00e-01
Epoch 6, Current LR: 1.00e-01
Epoch 7, Current LR: 1.00e-01
Epoch 8, Current LR: 1.00e-01
Epoch 9, Current LR: 1.00e-01
Epoch 10, Current LR: 1.00e-01
Epoch 11, Current LR: 1.00e-01
Epoch 12, Current LR: 1.00e-01
Epoch 13, Current LR: 1.00e-01
Epoch 14, Current LR: 1.00e-01
Epoch 15, Current LR: 1.00e-01
Epoch 16, Current LR: 1.00e-02
Epoch 17, Current LR: 1.00e-02
Epoch 18, Current LR: 1.00e-02
Epoch 19, Current LR: 1.00e-02
Epoch 20, Current LR: 1.00e-02
Epoch 21, Current LR: 1.00e-02
Epoch 22, Current LR: 1.00e-02
Epoch 23, Current LR: 1.00e-02
Epoch 24, Current LR: 1.00e-02
Epoch 25, Current LR: 1.00e-02
Epoch 26, Current LR: 1.00e-03
Epoch 27, Current LR: 1.00e-03
Epoch 28, Current LR: 1.00e-03
Epoch 29, Current LR: 1.00e-03
Epoch 30, Current LR: 1.00e-03


### ExponentialLR

`ExponentialLR` decays the learning rate of each parameter group by a factor of `gamma` every epoch. It provides a smooth, exponential decrease in the learning rate.

$$
LR_{t} = LR_{0} \cdot \gamma^{epoch}
$$

This formula provides a smooth, exponential decrease in the learning rate over time:

- **$(\gamma^{epoch})$**: The exponential decay factor, raised to the power of the current epoch. The learning rate is exponentially decreased by this factor every epoch.


`ExponentialLR` ensures a smooth and continuous decay of the learning rate, making it suitable for fine-grained adjustments over the course of training.

In [12]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler

# Sample model
model = nn.Linear(10, 2)

# Optimizer
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Scheduler
scheduler = lr_scheduler.ExponentialLR(optimizer, gamma=0.95)

# Training loop
for epoch in range(30):
    # Dummy training step
    optimizer.zero_grad()
    output = model(torch.randn(5, 10))
    loss = output.sum()
    loss.backward()
    optimizer.step()

    print(f"Epoch {epoch+1}, Current LR: {scheduler.get_last_lr()[0]:.2e}")

    scheduler.step()

Epoch 1, Current LR: 1.00e-01
Epoch 2, Current LR: 9.50e-02
Epoch 3, Current LR: 9.02e-02
Epoch 4, Current LR: 8.57e-02
Epoch 5, Current LR: 8.15e-02
Epoch 6, Current LR: 7.74e-02
Epoch 7, Current LR: 7.35e-02
Epoch 8, Current LR: 6.98e-02
Epoch 9, Current LR: 6.63e-02
Epoch 10, Current LR: 6.30e-02
Epoch 11, Current LR: 5.99e-02
Epoch 12, Current LR: 5.69e-02
Epoch 13, Current LR: 5.40e-02
Epoch 14, Current LR: 5.13e-02
Epoch 15, Current LR: 4.88e-02
Epoch 16, Current LR: 4.63e-02
Epoch 17, Current LR: 4.40e-02
Epoch 18, Current LR: 4.18e-02
Epoch 19, Current LR: 3.97e-02
Epoch 20, Current LR: 3.77e-02
Epoch 21, Current LR: 3.58e-02
Epoch 22, Current LR: 3.41e-02
Epoch 23, Current LR: 3.24e-02
Epoch 24, Current LR: 3.07e-02
Epoch 25, Current LR: 2.92e-02
Epoch 26, Current LR: 2.77e-02
Epoch 27, Current LR: 2.64e-02
Epoch 28, Current LR: 2.50e-02
Epoch 29, Current LR: 2.38e-02
Epoch 30, Current LR: 2.26e-02


### ReduceLROnPlateau

`ReduceLROnPlateau` reduces the learning rate when a metric has stopped improving, offering a way to fine-tune models when using validation metrics.

Reduces the learning rate by `factor` if the selected metric does not improve for a `patience` number of epochs.

Its principle can be summarized as follows:

- If the metric (e.g., validation loss) does not improve for a specified number of epochs $(patience)$, the learning rate is reduced by a factor of $(factor)$.

This approach helps to fine-tune the model by reducing the learning rate when progress stalls, potentially aiding in escaping local minima or making fine adjustments as the model approaches convergence.

In [15]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler
import numpy as np

# Define a simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(10, 5)
        self.fc2 = nn.Linear(5, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

model = SimpleModel()

# Define optimizer
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Define ReduceLROnPlateau scheduler
scheduler = lr_scheduler.ReduceLROnPlateau(optimizer, 'min', factor=0.1, patience=10)

# Mock training and validation loop
num_epochs = 30
for epoch in range(num_epochs):
    # Simulate a training step. Normally, it should be:

    # optimizer.zero_grad()
    # loss.backward()
    # optimizer.step()

    model.train()

    # Simulated training loss (decreases)
    train_loss = 1.0 / (0.1 * epoch + 1)

    # Simulate a validation step
    model.eval()

    # Simulated validation loss (random fluctuation around 0.5)
    val_loss = 0.5 + 0.05 * np.sin(epoch)

    # Print current epoch, training loss, validation loss, and current learning rate
    print(f"Epoch {epoch+1}, Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}, Current LR: {scheduler.optimizer.param_groups[0]['lr']:.2e}")

    # Update the learning rate based on validation loss
    scheduler.step(val_loss)

# Note: This example is entirely fictional and designed to demonstrate how to use the ReduceLROnPlateau scheduler.
# The losses are simulated and do not reflect actual training and validation processes.

Epoch 1, Train Loss: 1.0000, Val Loss: 0.5000, Current LR: 1.00e-01
Epoch 2, Train Loss: 0.9091, Val Loss: 0.5421, Current LR: 1.00e-01
Epoch 3, Train Loss: 0.8333, Val Loss: 0.5455, Current LR: 1.00e-01
Epoch 4, Train Loss: 0.7692, Val Loss: 0.5071, Current LR: 1.00e-01
Epoch 5, Train Loss: 0.7143, Val Loss: 0.4622, Current LR: 1.00e-01
Epoch 6, Train Loss: 0.6667, Val Loss: 0.4521, Current LR: 1.00e-01
Epoch 7, Train Loss: 0.6250, Val Loss: 0.4860, Current LR: 1.00e-01
Epoch 8, Train Loss: 0.5882, Val Loss: 0.5328, Current LR: 1.00e-01
Epoch 9, Train Loss: 0.5556, Val Loss: 0.5495, Current LR: 1.00e-01
Epoch 10, Train Loss: 0.5263, Val Loss: 0.5206, Current LR: 1.00e-01
Epoch 11, Train Loss: 0.5000, Val Loss: 0.4728, Current LR: 1.00e-01
Epoch 12, Train Loss: 0.4762, Val Loss: 0.4500, Current LR: 1.00e-01
Epoch 13, Train Loss: 0.4545, Val Loss: 0.4732, Current LR: 1.00e-01
Epoch 14, Train Loss: 0.4348, Val Loss: 0.5210, Current LR: 1.00e-01
Epoch 15, Train Loss: 0.4167, Val Loss: 0.5

### CosineAnnealingLR

`CosineAnnealingLR` provides a cosine annealing schedule, where the learning rate decreases following a cosine curve between an initial lr set by the optimizer and a minimum lr, over a given number of epochs.

$$
LR_{t} = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{T_{cur}}{T_{max}}\pi))
$$

- **$(\eta_{min})$** and **$(\eta_{max})$**: The minimum and maximum boundaries for the learning rate.
- **$(T_{cur})$**: The current epoch.
- **$(T_{max})$**: The maximum number of epochs.

This scheduler decreases the learning rate following a cosine curve between $(\eta_{max})$ and $(\eta_{min})$, simulating a restart by returning to $(\eta_{max})$ after each cycle (of length $(T_{max}))$.

This pattern helps to navigate the loss landscape by periodically resetting the learning rate, potentially avoiding local minima and encouraging exploration of the parameter space.

In [16]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler

# Sample model
model = nn.Linear(10, 2)

# Optimizer
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Scheduler
scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

# Training loop
for epoch in range(30):
    # Dummy training step
    optimizer.zero_grad()
    output = model(torch.randn(5, 10))
    loss = output.sum()
    loss.backward()
    optimizer.step()

    print(f"Epoch {epoch+1}, Current LR: {scheduler.get_last_lr()[0]:.2e}")

    scheduler.step()

Epoch 1, Current LR: 1.00e-01
Epoch 2, Current LR: 9.99e-02
Epoch 3, Current LR: 9.96e-02
Epoch 4, Current LR: 9.91e-02
Epoch 5, Current LR: 9.84e-02
Epoch 6, Current LR: 9.76e-02
Epoch 7, Current LR: 9.65e-02
Epoch 8, Current LR: 9.52e-02
Epoch 9, Current LR: 9.38e-02
Epoch 10, Current LR: 9.22e-02
Epoch 11, Current LR: 9.05e-02
Epoch 12, Current LR: 8.85e-02
Epoch 13, Current LR: 8.64e-02
Epoch 14, Current LR: 8.42e-02
Epoch 15, Current LR: 8.19e-02
Epoch 16, Current LR: 7.94e-02
Epoch 17, Current LR: 7.68e-02
Epoch 18, Current LR: 7.41e-02
Epoch 19, Current LR: 7.13e-02
Epoch 20, Current LR: 6.84e-02
Epoch 21, Current LR: 6.55e-02
Epoch 22, Current LR: 6.24e-02
Epoch 23, Current LR: 5.94e-02
Epoch 24, Current LR: 5.63e-02
Epoch 25, Current LR: 5.31e-02
Epoch 26, Current LR: 5.00e-02
Epoch 27, Current LR: 4.69e-02
Epoch 28, Current LR: 4.37e-02
Epoch 29, Current LR: 4.06e-02
Epoch 30, Current LR: 3.76e-02


## Conclusion

Each scheduler is suited to different scenarios and requirements. Experimentation is key to finding the most effective scheduler and parameters for the specific problem and dataset.