Let's take a look at how ``autograd`` collects gradients. We create two tensors ``a`` and ``b`` with
``requires_grad=True``. This signals to ``autograd`` that every operation on them should be tracked.

In [1]:
import torch

a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)
print(a)
print(b)

tensor([2., 3.], requires_grad=True)
tensor([6., 4.], requires_grad=True)


We create another tensor Q from a and b.

Q=$3a^{3}$ - $b^{2}$

In [2]:
Q = 3*a**3 - b**2
print(Q)

tensor([-12.,  65.], grad_fn=<SubBackward0>)


In [3]:
external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)

Gradients are now deposited in a.grad and b.grad

In [4]:
# check if collected gradients are correct
print(9*a**2 == a.grad)
print(-2*b == b.grad)

tensor([True, True])
tensor([True, True])


Conceptually, autograd keeps a record of data (tensors) & all executed operations (along with the resulting new tensors) in 
a directed acyclic graph (DAG) consisting of
<https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function>
objects. In this DAG, leaves are the input tensors, roots are the output
tensors. By tracing this graph from roots to leaves, you can
automatically compute the gradients using the chain rule.

**In a forward pass, autograd does two things simultaneously:**

- run the requested operation to compute a resulting tensor, and
- maintain the operation’s *gradient function* in the DAG.

**The backward pass kicks off when ``.backward()`` is called on the DAG
root. ``autograd`` then:**

- computes the gradients from each ``.grad_fn``,
- accumulates them in the respective tensor’s ``.grad`` attribute, and
- using the chain rule, propagates all the way to the leaf tensors.

In [5]:
x = torch.rand(5, 5)
y = torch.rand(5, 5)
z = torch.rand((5, 5), requires_grad=True)

a = x + y
print(f"Does `a` require gradients? : {a.requires_grad}")
b = x + z
print(f"Does `b` require gradients?: {b.requires_grad}")

Does `a` require gradients? : False
Does `b` require gradients?: True


In a NN, parameters that don't compute gradients are usually called **frozen parameters**. 
It is useful to "freeze" part of your model if you know in advance that you won't need the gradients of those parameters (this offers some performance benefits by reducing autograd computations).

Another common usecase where exclusion from the DAG is important is for finetuning a pretrained network 
<https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html>

In finetuning, we freeze most of the model and typically only modify the classifier layers to make predictions on new labels. 

In [6]:
import torchvision
from torch import nn, optim

model = torchvision.models.resnet18(pretrained=True)

# Freeze all the parameters in the network
for param in model.parameters():
    param.requires_grad = False



Let's say we want to finetune the model on a new dataset with 10 labels. 

In resnet, **the classifier is the last linear layer model.fc. 
We can simply replace it with a new linear layer (unfrozen by default) that acts as our classifier.**

In [7]:
model.fc = nn.Linear(512, 10)

Now all parameters in the model, except the parameters of model.fc, are frozen. 
The only parameters that compute gradients are the weights and bias of model.fc.

In [8]:
# Optimize only the classifier
optimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

In [9]:
print(optimizer)

SGD (
Parameter Group 0
    dampening: 0
    differentiable: False
    foreach: None
    lr: 0.01
    maximize: False
    momentum: 0.9
    nesterov: False
    weight_decay: 0
)


In [10]:
print(model.fc)

Linear(in_features=512, out_features=10, bias=True)
