<a href="https://colab.research.google.com/github/desaiankitb/pytorch-basics/blob/main/deep-learning-blitz/01_autograd_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
%matplotlib inline

# A gentle introduction to `torch.autograd`

- `torch.autograd` is PyTorch's automatic diffrentiation engine that powers neural network training. In this section, you will get a conceptual understanding of how autograd helps a neural netowrk train. 

## Background
- Neural networks (NNs) are a collection of nested functions that are executed on some input data. These functions are defined by *parameters* (consisting of weights and biases), which is PyTorch are stored in tensors. 

- Training a NN happens in two steps: 
  - **Forward Propagation**: In forward prop, the NN makes its best guess  about the correct output. It runs the input data through each of its functions to make this guess. 

  - **Backward Propagation**: In backprop, the NN adjusts its parameters proportionate to the error in its guess. It does this by traversing backwards from the output, collecting the derivatives of the error with respect to the parameters of the functions *(gradients)* and optimizing the parameters using gradient descent. For a more detailed walkthrough of backprop, check out [this video from 3Blue1Brown](https://www.youtube.com/watch?v=tIeHLnjs5U8).


## Usage in PyTorch
- Let's take a look at a single training step. For example, we load a pretrained resnet 18 model from `torchvision`. We create a data tensor to represent a single image with 3 channels, and height & width of 64, and its corresponding `label` initialized to some random values. 

In [8]:
import torch, torchvision 
model = torchvision.models.resnet18(pretrained=True)
data = torch.rand(1, 3, 64, 64)
label = torch.rand(1, 1000)

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth


HBox(children=(FloatProgress(value=0.0, max=46830571.0), HTML(value='')))




Next, we run the input data through the model through each of its layers to make a prediction. This is the **forward pass**. 

In [9]:
prediction = model(data) #forward pass 

  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)


We use the model's prediction and the corresponding label to calculate the error (`loss`). The next step is to backpropagate this error through the network. Backward propagation is kicked off when we call `.backward()` on the error tensor. Autograd then calculates and stores the gradients for each model parameter in parameter's `.grad` attribute. 

In [10]:
loss = (prediction - label).sum()
loss.backward() # backward pass

Next, we load a optimizer, in this case SGD with a learning rate 0.01 and momentum of 0.9. We register all the parameters of the model in the optimizer. 

In [11]:
optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

Finally, we call `.step()` to initiate gradient descent. The optimizer adjusts each parameter by its gradient stored in `.grad`. 

In [12]:
optim.step() #gradient descent

At this point, you have everything you need to train your neural network. The below sections detail the working of autograd - feel free to skip them. 

## Differentiation in Autograd

- Let us take a look at how `autograd` collects gradients. We create two tensors `a` and `b` with `requires_grad=True`. This signals to `autograd` that every operation on them should be tracked. 

In [13]:
import torch

a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

We create another tensor $Q$ from `a` and `b`. 

$$Q = 3a^3 - b^2$$

In [14]:
Q = 3*a**3 - b**2

- Let us assume `a` and `b` to be parameters of an NN, and `Q` to be the error. In NN training, we want gradients of the error w.r.t. parameters, i.e. 

\begin{align}\frac{\partial Q}{\partial a} = 9a^2\end{align}

\begin{align}\frac{\partial Q}{\partial b} = -2b\end{align}

- When we call `.backward()` on `Q`, autograd calculates these gradients and stores them in respective tensors' `.grad` attribute. 
- We need to explicitly pass a `gradient` argument in `Q.backward()` because it is a vector. `gradient` is a tensor of the same shape as `Q`, and it represents the gradient of Q w.r.t. itself, i.e. 

\begin{align}\frac{dQ}{dQ} = 1\end{align}

- Equivalently, we can also aggregate Q into a scalar and call backward implicitly, like `Q.sum().backward()`. 

In [15]:
external_grad = torch.tensor([1., 1.])
print(external_grad)
Q.backward(gradient=external_grad)


tensor([1., 1.])


- Gradients are now decomposed in `a.grad` and `b.grad`

In [16]:
# Check if collected gradients are correct 
print(9*a**2 == a.grad)
print(-2*b == b.grad)

tensor([True, True])
tensor([True, True])


In [17]:
print(9*a**2)
print(-2*b)

tensor([36., 81.], grad_fn=<MulBackward0>)
tensor([-12.,  -8.], grad_fn=<MulBackward0>)


In [18]:
print(a.grad)

tensor([36., 81.])


In [19]:
print(b.grad)

tensor([-12.,  -8.])


In [20]:
print(Q)

tensor([-12.,  65.], grad_fn=<SubBackward0>)
