### Finding model parameters using optimizer
### This notebook is an illustration for chapter 5 of
### Deep Learning with PyTorch by Eli Stevens, Luca Aantiga, Thomas Viehmann, Manning 2020

In [None]:
%matplotlib inline
import numpy as np
import torch
torch.set_printoptions(edgeitems=2, linewidth=75)

In [None]:
t_c = torch.tensor([0.5, 14.0, 15.0, 28.0, 11.0,
                    8.0, 3.0, -4.0, 6.0, 13.0, 21.0])
t_f = torch.tensor([35.7, 55.9, 58.2, 81.9, 56.3, 48.9,
                    33.9, 21.8, 48.4, 60.4, 68.4])
t_fn = 0.1 * t_f

We keep the same model and the loss function

In [None]:
def model(t_f, w, b):
    return w * t_f + b

In [None]:
def loss_fn(t_p, t_c):
    squared_diffs = (t_p - t_c)**2
    return squared_diffs.mean()

## Provided optimizers
So far, we used vanilla gradient descent for optimization, which worked
fine for our simple case.Tthere are several optimization strategies and
tricks that can assist convergence, especially when models get complicatedNext, we will  to
introduce the way PyTorch abstracts the optimization strategy away from user mined. This saves usrk
of having to update each and every parameter to our model ourselves`. The` torch
module `has a`  optim submodule where we can find classes implementing different
optimization algorithms. Here’s an  listabridged

In [None]:
import torch.optim as optim

dir(optim)

Every optimizer constructor takes a list of parameters (aka PyTorch tensors, typically
with` requires_gra`d set to` Tru`e) as the first input. All parameters passed to the optimizer
are retained inside the optimizer object so the optimizer can update their values
and access the`ir g`rad attribute,

Each optimizer exposes two methods: `zero_grad` and `step`. `zero_grad` zeroes the`
gra`d attribute of all the parameters passed to the optimizer upon construction.` ste`p
updates the value of those parameters according to the optimization strategy implemented
by the specific optimizer.

In [None]:
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-5
optimizer = optim.SGD([params], lr=learning_rate)

Here SGD stands for stochastic gradient descent. Actually, the optimizer itself is exactly a
vanilla gradient descent (as long as the` momentu`m argument is set to 0.0, which is the
default). The term stochastic comes from the fact that the gradient is typically obtained
by averaging over a random subset of all input samples, called a minibatch. However, the
optimizer does not know if the loss was evaluated on all the samples (vanilla) or a random
subset of them (stochastic), so the algorithm is literally the same in the two cases.

In [None]:
t_p = model(t_f, *params)
loss = loss_fn(t_p, t_c)
loss.backward()

optimizer.step()

params

The value of `params` is updated upon calling step without us having to touch it ourselves!
What happens is that the optimizer looks into `params.grad` and updates
`params`, subtracting `learning_rate` times `grad` from it, exactly as in our former handrolled
code.
When using this code in a training loop we have to remember to zero out the gradients in every loop. Otherwise, the gradients
would accumulate in the leaves at every call to `backward`. Below is the loop-ready code, with the extra
`zero_grad` at the correct spot (right before the call to `backward`):

In [None]:
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate)

t_p = model(t_fn, *params)
loss = loss_fn(t_p, t_c)

# The exact placement of this call is somewhat arbitrary. 
# It could be earlier in the loop as well.
optimizer.zero_grad() 
loss.backward()
optimizer.step()

params

Updated training loop now reads:#

In [None]:
def training_loop(n_epochs, optimizer, params, t_f, t_c):
    for epoch in range(1, n_epochs + 1):
        t_p = model(t_f, *params) 
        loss = loss_fn(t_p, t_c)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if epoch % 500 == 0:
            print('Epoch %d, Loss %f' % (epoch, float(loss)))
            
    return params

In [None]:
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate) # <1>

training_loop(
    n_epochs = 5000, 
    optimizer = optimizer,
    params = params, # <1> 
    t_f = t_fn,
    t_c = t_c)

## Testing other optimizers
In order to test more optimizers, all we have to do is instantiate a different optimizer,
sa`y Ad`am, instead o`f S`GD. The rest of the code stays as it isf.
We won’t go into much detail abo`ut A`. It it is a more sophisticated
optimizer in which the learning rate is set adaptively. In addition, it is a lot less
sensitive to the scaling of the parameters—so insensitive that we can go back to  the original (non-normalized) input t_f, rather than t_fn, and even increase the learning rate to 1e-1.d Ada will handle it all.kusing

In [None]:
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-1
optimizer = optim.Adam([params], lr=learning_rate) # <1>

training_loop(
    n_epochs = 2000, 
    optimizer = optimizer,
    params = params,
    t_f = t_f, # We’re back to the original t_f as our input.
    t_c = t_c)

## Braking data into train and validate subset
To sample a smaller validation set from all regions of the original dataset we usually shuffle the data

In [None]:
n_samples = t_f.shape[0]
n_val = int(0.2 * n_samples)

shuffled_indices = torch.randperm(n_samples)

train_indices = shuffled_indices[:-n_val]
val_indices = shuffled_indices[-n_val:]

train_indices, val_indices  # <1>

In [None]:
train_t_f = t_f[train_indices]
train_t_c = t_c[train_indices]

val_t_f = t_f[val_indices]
val_t_c = t_c[val_indices]

train_t_fn = 0.1 * train_t_f
val_t_fn = 0.1 * val_t_f

Inside the trainign loop we calculate the loss of the validation data. Notice that we do not backpropagate through the validation data and we do not perform the `step` operation on the validation data. Trainign is done only on train(ing) data.

In [None]:
def training_loop(n_epochs, optimizer, params, train_t_f, val_t_f,
                  train_t_c, val_t_c):
    for epoch in range(1, n_epochs + 1):
        train_t_p = model(train_t_f, *params) # <1>
        train_loss = loss_fn(train_t_p, train_t_c)
                             
        val_t_p = model(val_t_f, *params) # <1>
        val_loss = loss_fn(val_t_p, val_t_c)
        
        optimizer.zero_grad()
        train_loss.backward() # <2>
        optimizer.step()

        if epoch <= 3 or epoch % 500 == 0:
            print(f"Epoch {epoch}, Training loss {train_loss.item():.4f},"
                  f" Validation loss {val_loss.item():.4f}")
            
    return params

During the training we monitor both the training and the validation loss. If the validation loss stops falling, we are experiencing an overfitting issue.

In [None]:
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate)

training_loop(
    n_epochs = 3000, 
    optimizer = optimizer,
    params = params,
    train_t_f = train_t_fn, # Since we’re using SGD again, we’re
    val_t_f = val_t_fn, # back to using normalized inputs.
    train_t_c = train_t_c,
    val_t_c = val_t_c)

Here we are not being entirely fair to our model. The validation set is really small, so
the validation loss will only be meaningful up to a point. In any case, we note that the
validation loss is higher than our training loss, although not by an order of magnitude.
We expect a model to perform better on the training set, since the model
parameters are being shaped by the training set. Our main goal is to also see both the
training loss and the validation loss decreasing. While ideally both losses would be
roughly the same value, as long as the validation loss stays reasonably close to the
training loss, we know that our model is continuing to learn generalized things about
our data

## Turning autograd off on validation data
Since we’re not ever calling backward
on val_loss, We could in fact
just cal`l mod`el an`d loss_`fn as plain functions, without tracking the computati through the autogradon.
However optimized, building the autograd graph comes with additional costs that we
could totally forgo during the validation pass, especially when the model has millions
of parameters.
In order to address this, PyTorch allows us to switch off autograd when we don’t
need it, usi`ng the torch.`no_grad context m ger.12 We won’t see any meaningful
advantage in terms of speed or memory consumption on our small problem. However,
for larger models, the differences can add up. We can make sure this works by
checking the val`ue of the req`uires_grad attrib `te on th`e val_loss tensor:

In [None]:
def training_loop(n_epochs, optimizer, params, train_t_f, val_t_f,
                  train_t_c, val_t_c):
    for epoch in range(1, n_epochs + 1):
        train_t_p = model(train_t_f, *params)
        train_loss = loss_fn(train_t_p, train_t_c)

        # Context manager here
        with torch.no_grad(): # <1>
            val_t_p = model(val_t_f, *params)
            val_loss = loss_fn(val_t_p, val_t_c)
            assert val_loss.requires_grad == False # Cheking that requires_grad is set to False
            
        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()

Using the related set_grad_enabled context, we can also condition the code to run
with autograd enabled or disabled, according to a Boolean expression—typically indicating
whether we are running in training or inference mode. We could, for instance,
define a calc_forward function that takes data as input and runs model and loss_fn
with or without autograd according to a Boolean train_is argument:

In [None]:
def calc_forward(t_f, t_c, is_train):
    with torch.set_grad_enabled(is_train):
        t_p = model(t_f, *params)
        loss = loss_fn(t_p, t_c)
    return loss