# title: PyTorch Autograd

torch.autograd is PyTorch’s automatic differentiation engine that powers neural network training. In this section, you will get a conceptual understanding of how autograd helps a neural network train.

Neural networks (NNs) are a collection of nested functions that are executed on some input data. These functions are defined by parameters (consisting of weights and biases), which in PyTorch are stored in tensors.

Training a NN happens in two steps:

**Forward Propagation**: In forward prop, the NN makes its best guess about the correct output. It runs the input data through each of its functions to make this guess.

**Backward Propagation**: In backprop, the NN adjusts its parameters proportionate to the error in its guess. It does this by traversing backwards from the output, collecting the derivatives of the error with respect to the parameters of the functions (gradients), and optimizing the parameters using gradient descent. For a more detailed walkthrough of backprop, check out this video from 3Blue1Brown.

## Neural Networks demonstration in PyTorch

In [21]:
# load pretrained model
import torch
from torchvision.models import resnet18, ResNet18_Weights
# create a model
model = resnet18(weights=ResNet18_Weights.DEFAULT)
# generate a random data sensor to represent an image with 3 channels and 64x64 pixels
data = torch.randn(1, 3, 64, 64)
# and also its initial labels
labels = torch.rand(1,1000)

Next we run the input data through the model through each layer to make a prediction. And this is the forward pass.

In [22]:
prediction = model(data) # forward pass

Next we will do the following:
- compare the predictions and the corresponding labels to calculate error or loss
- backpropagate the error through the network via .backward() method on the error tensor
- torch.autograd calculates and stores the gradients for each model parameter in the .grad attribute of each parameter tensor

In [23]:
loss = (prediction - labels).sum()
loss.backward() # backward pass

In [24]:
print(model.parameters()) # check the model parameters

<generator object Module.parameters at 0x0000014C262AEB30>


Then we optimize the model using an optimizer SGD with learning rate 0.01 and momentum 0.9.

In [25]:
optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

And finally we need to call .step to initiate gradient descent and the optimizer will adjust each parameter by its greadient stored in the .grad attribute of each parameter tensor.

In [26]:
optim.step() #gradient descent

## Differentiation in Autograd

Let’s take a look at how autograd collects gradients. We create two tensors a and b with requires_grad=True. This signals to autograd that every operation on them should be tracked.

In [27]:
a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)
# also create a new tensor from a and b
Q = 3*a**3 - b**2

Assuming a and be to be the parameters of an NN, and Q to be the error, in NN training, we need gradients of the error regarding each parameter, i.e.:
$$
\frac{\partial Q}{\partial a} = 9a^2
$$
$$
\frac{\partial Q}{\partial b} = -2b
$$

When we call .backward() on Q, autograd computes the gradients of Q with respect to a and b. The gradients are stored in the .grad attribute of each tensor. We can access them as follows:
```python
a.grad, b.grad
```

We need to explicitly pass a gradient argument to Q.backward() because it is a verctor with the same shape as Q, and its values are 1 because the gradients are for itself, i.e.:

$$
\frac{\partial Q}{\partial Q} = 1
$$

In [28]:
external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)

In [29]:
# check if collected gradients are correct
print(9*a**2 == a.grad)
print(-2*b == b.grad)

tensor([True, True])
tensor([True, True])


Mathematically, if a function $\vec{y} = f(\vec{x})$ is a vector function, then the gradient of $\vec{y}$ with respect to $\vec{x}$ is a Jacobian matrix $J$ defined as follows:
$$
J = \begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n} \\
\frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n}
\end{bmatrix}
$$

Generally speaking, torch.autograd is an engine for computing vector-Jacobian product.

## Computational Graph

Conceptually, autograd keeps a record of data (tensors) & all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:

- run the requested operation to compute a resulting tensor, and
- maintain the operation’s gradient function in the DAG.

The backward pass kicks off when .backward() is called on the DAG root. autograd then:

- computes the gradients from each .grad_fn,
- accumulates them in the respective tensor’s .grad attribute, and
- using the chain rule, propagates all the way to the leaf tensors.


## Exclude gradients from the graph

torch.autograd tracks all operations on tensors with requires_grad=True. This is useful for training, but sometimes you want to exclude certain operations from the graph. For example, when you are evaluating a model and do not need to compute gradients, or tensors that don't require gradients, one can set the attribute requires_grad=False.

In [30]:
x = torch.rand(5, 5)
y = torch.rand(5, 5)
z = torch.rand((5, 5), requires_grad=True)

a = x + y
print(f"Does `a` require gradients?: {a.requires_grad}")
# an output tensor requires gradients if at least one of its inputs requires gradients
b = x + z
print(f"Does `b` require gradients?: {b.requires_grad}")

Does `a` require gradients?: False
Does `b` require gradients?: True


In a NN, parameters that don’t compute gradients are usually called frozen parameters. It is useful to “freeze” part of your model if you know in advance that you won’t need the gradients of those parameters (this offers some performance benefits by reducing autograd computations).

In finetuning, we freeze most of the model and typically only modify the classifier layers to make predictions on new labels. Let’s walk through a small example to demonstrate this. As before, we load a pretrained resnet18 model, and freeze all the parameters.

Let's show this as an example:

In [31]:
from torch import nn, optim

model = resnet18(weights=ResNet18_Weights.DEFAULT)

# Freeze all the parameters in the network
for param in model.parameters():
    param.requires_grad = False

Let’s say we want to finetune the model on a new dataset with 10 labels. In resnet, the classifier is the last linear layer model.fc. We can simply replace it with a new linear layer (unfrozen by default) that acts as our classifier.

In [32]:
model.fc = nn.Linear(512, 10)

In [33]:
# show the layers of the model
for name, param in model.named_parameters():
    print(name, param.requires_grad)

conv1.weight False
bn1.weight False
bn1.bias False
layer1.0.conv1.weight False
layer1.0.bn1.weight False
layer1.0.bn1.bias False
layer1.0.conv2.weight False
layer1.0.bn2.weight False
layer1.0.bn2.bias False
layer1.1.conv1.weight False
layer1.1.bn1.weight False
layer1.1.bn1.bias False
layer1.1.conv2.weight False
layer1.1.bn2.weight False
layer1.1.bn2.bias False
layer2.0.conv1.weight False
layer2.0.bn1.weight False
layer2.0.bn1.bias False
layer2.0.conv2.weight False
layer2.0.bn2.weight False
layer2.0.bn2.bias False
layer2.0.downsample.0.weight False
layer2.0.downsample.1.weight False
layer2.0.downsample.1.bias False
layer2.1.conv1.weight False
layer2.1.bn1.weight False
layer2.1.bn1.bias False
layer2.1.conv2.weight False
layer2.1.bn2.weight False
layer2.1.bn2.bias False
layer3.0.conv1.weight False
layer3.0.bn1.weight False
layer3.0.bn1.bias False
layer3.0.conv2.weight False
layer3.0.bn2.weight False
layer3.0.bn2.bias False
layer3.0.downsample.0.weight False
layer3.0.downsample.1.weight Fa

As you can see, only the last layer's parameters are trainable.