In [None]:
#r "nuget: TorchSharp-cpu"

open TorchSharp
open type TorchSharp.torch
open type TorchSharp.TensorExtensionMethods
open type TorchSharp.torch.distributions

# Training with a Learning Rate Scheduler

In Tutorial 6, we saw how the optimizers took an argument called the 'learning rate,' but didn't spend much time on it except to say that it could have a great impact on how quickly training would converge toward a solution. In fact, you can choose the learning rate (LR) so poorly, that the training doesn't converge at all.

If the LR is too small, training will go very slowly, wasting compute resources. If it is too large, training could result in numeric overflow, or NaNs. Either way, you're in trouble.

To further complicate matters, it turns out that the learning rate shouldn't necessarily be constant. Training can go much better if the learning rate starts out relatively large and gets smaller as you get closer to the end.

There's a solution for this, called a Learning Rate Scheduler. An LRS instance has access to the internal state of the optimizer, and can modify the LR as it goes along. There are several algorithms for scheduling, of which TorchSharp currently implements a significant subset.

Before demonstrating, let's have a model and a baseline training loop.

In [None]:
type Trivial() as this = 
    inherit nn.Module<Tensor,Tensor>("Trivial")

    let lin1 = nn.Linear(1000L, 100L)
    let lin2 = nn.Linear(100L, 10L)

    do
        this.RegisterComponents()

    override _.forward(input) = 
    
        use x = lin1.forward(input)
        use y = nn.functional.relu(x)
        lin2.forward(y)

To demonstrate how to correctly use an LR scheduler, our training data needs to look more like real training data, that is, it needs to be divided into batches.

In [None]:
let learning_rate = 0.01
let model = Trivial()

let data = [for i = 1 to 16 do rand(32,1000)]  // Our pretend input data
let result = [for i = 1 to 16 do rand(32,10)]  // Our pretend ground truth.

let loss x y = nn.functional.mse_loss(x,y)

let optimizer = torch.optim.SGD(model.parameters(), learning_rate)

for epoch = 1 to 300 do

    for idx = 0 to data.Length-1 do
        // Compute the loss
        let pred = model.forward(data.[idx])
        let output = loss pred result.[idx]

        // Clear the gradients before doing the back-propagation
        model.zero_grad()

        // Do back-progatation, which computes all the gradients.
        output.backward()

        optimizer.step() |> ignore

let pred = model.forward(data.[0])
(loss pred result.[0]).item<single>()

When I ran this, the loss was down to 0.051 after 3 seconds. (It took longer the first time around.)

## StepLR

StepLR uses subtraction to adjust the learning rate every so often. The difference it makes to the training loop is that you wrap the optimizer, and then call `step` on the scheduler (once per epoch) as well as the optimizer (once per batch).

In [None]:
let learning_rate = 0.01
let model = Trivial()

let data = [for i = 1 to 16 do rand(32,1000)]  // Our pretend input data
let result = [for i = 1 to 16 do rand(32,10)]  // Our pretend ground truth.

let loss x y = nn.functional.mse_loss(x,y)

let optimizer = torch.optim.SGD(model.parameters(), learning_rate)
let scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 25, 0.95, verbose=true)

for epoch = 1 to 300 do

    for idx = 0 to data.Length-1 do
        // Compute the loss
        let pred = model.forward(data.[idx])
        let output = loss pred result.[idx]

        // Clear the gradients before doing the back-propagation
        model.zero_grad()

        // Do back-progatation, which computes all the gradients.
        output.backward()

        optimizer.step() |> ignore

    scheduler.step() |> ignore

let pred = model.forward(data.[0])
(loss pred result.[0]).item<single>()

Well, that was underwhelming. The loss (in my case) went up a bit, so that's nothing to get excited about. For this trivial model, using a scheduler isn't going to make a huge difference, and it may not make much of a difference even for complex models. It's very hard to know until you try it, but now you know how to try it out. If you try this trivial example over and over, you will see that the results vary quite a bit. It's simply too simple.

Regardless, you can see from the verbose output that the learning rate is adjusted as the epochs proceed.