# Background 
- Neural Networks (NN) are a collection of nested functions that are executed on some input data. These functions are defined by parameters (consisting of weights and biases), which in PyTorch are stored in tensors
- Training a NN happens in 2 steps:
    + Forward Propagation
        In forward prop, the NN makes its best guest about the correct output. It runs the input data through eaech of its functions to make this guess
    + Backward Propagation
        In backdrop, the NN adjusts its parameters proportionate to the error in its guess. It does this by traversing backwards from the output, collection the derivatives of the error with respect to the parameters of the functions (gradient), and optimizing the parameters using gradient decent

# Usage in PyTorch
- Look at an example in a single training step
- Load a pretrained resnet18 model from `torchvision`
- Create a random data tensor to represent a single image with 3 channels, and height & width of 64, and its corresponding label initialized to some random values.
- Label in pretrained model has shape (1, 1000)

## Initialize the model

In [61]:
import torch
from torchvision.models import resnet18, ResNet18_Weights

model = resnet18(weights=ResNet18_Weights.DEFAULT)
data = torch.rand(1, 3, 64, 64)
labels = torch.rand(1, 1000)

print(data)

tensor([[[[0.5421, 0.2225, 0.1813,  ..., 0.3353, 0.7534, 0.0658],
          [0.8229, 0.3776, 0.8601,  ..., 0.0821, 0.6289, 0.3139],
          [0.9230, 0.1914, 0.3773,  ..., 0.5510, 0.6882, 0.9290],
          ...,
          [0.2035, 0.5775, 0.3468,  ..., 0.9233, 0.9648, 0.8529],
          [0.2178, 0.2000, 0.3380,  ..., 0.4139, 0.0227, 0.2314],
          [0.0952, 0.9879, 0.1932,  ..., 0.6048, 0.3325, 0.3075]],

         [[0.2756, 0.7602, 0.1380,  ..., 0.1315, 0.5301, 0.7083],
          [0.9754, 0.9447, 0.0938,  ..., 0.0809, 0.3266, 0.1984],
          [0.2307, 0.4279, 0.7401,  ..., 0.7288, 0.8550, 0.7953],
          ...,
          [0.3426, 0.0096, 0.3271,  ..., 0.1005, 0.5864, 0.9310],
          [0.5460, 0.3109, 0.3677,  ..., 0.3936, 0.7002, 0.0259],
          [0.8847, 0.4800, 0.4041,  ..., 0.1071, 0.3233, 0.7646]],

         [[0.5845, 0.0091, 0.6418,  ..., 0.9020, 0.8466, 0.7718],
          [0.2247, 0.2234, 0.1239,  ..., 0.9381, 0.0825, 0.7746],
          [0.0618, 0.4209, 0.6553,  ..., 0

## Foward pass
- Run input data through the model through each of its layers to make a prediction
- This is a forward pass

In [62]:
prediction = model(data)

print(prediction)

tensor([[-7.0983e-01, -2.7322e-01, -8.3016e-01, -1.4580e+00, -8.3909e-01,
         -1.2968e-01, -4.6641e-01,  4.0165e-01,  3.7474e-01, -6.1585e-01,
         -9.5696e-01, -6.7113e-01, -9.8432e-02, -1.0333e+00, -9.6147e-01,
         -5.5149e-01, -6.5734e-01, -2.6695e-01, -5.6188e-01, -5.6120e-01,
         -1.5620e+00, -1.0224e+00, -1.3473e+00, -6.2376e-04, -9.4227e-01,
         -1.0288e+00, -6.2271e-01, -1.1116e+00, -7.0821e-01, -1.8653e-02,
         -7.0968e-01, -8.2254e-01, -3.1741e-01, -6.6930e-01, -4.6288e-01,
         -5.9411e-01,  5.6066e-01, -7.2004e-01, -3.5162e-01,  9.8255e-02,
         -8.8806e-01, -7.0963e-01, -9.5301e-01, -6.9042e-02, -5.6950e-01,
         -1.1889e-01, -1.0205e+00, -5.2649e-01, -1.1128e+00, -9.4764e-01,
         -2.8345e-01,  4.0667e-01, -2.0549e-01, -6.6938e-01, -3.9741e-03,
         -1.3550e+00, -1.3320e-01, -1.5202e+00, -3.3882e-01, -6.5573e-01,
          7.7521e-01,  5.2745e-02,  5.5540e-02,  4.9108e-02, -9.5402e-01,
         -1.7737e-01, -8.8296e-02, -1.

## Backward propagation
- Use the model prediction to calculate the error (`loss`)
- Next step is to backpropagate this error through the networ
- Back propagation is kicked off when we call `.backward()` on the error tensor
- Autograd then calculates ad stores the gradients for eaeh model parameter in the parameter's `.grad` attribute

In [63]:
loss = (prediction - labels).sum()
loss.backward()

print(loss)

tensor(-496.0020, grad_fn=<SumBackward0>)


## Optimize
- Next step is to load an optimizer, in this case SGD with a learning rate of 0.01 and `momentum` of 0.9
- Register all the parameters of the model in the optimizer
- SGD = Stochastic Gradient Descent
- Momentum or SGD with momentum is method which helps accelerate gradients vectors in the right directions, thus leading to faster converging. Specifically it helps the model exit the local min/max to find the absolute min/max
- Momentum = data from exponentially weighed averages


In [64]:
optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

print(optim)

SGD (
Parameter Group 0
    dampening: 0
    differentiable: False
    foreach: None
    lr: 0.01
    maximize: False
    momentum: 0.9
    nesterov: False
    weight_decay: 0
)


## Step
- Finally, call `.step()` to initiate gradient descent (next epoch)
- The optimizer adjusts each parameter by its gradient stored in `.grad`

In [65]:
# Gradient Descent
optim.step()

print(optim)

SGD (
Parameter Group 0
    dampening: 0
    differentiable: False
    foreach: None
    lr: 0.01
    maximize: False
    momentum: 0.9
    nesterov: False
    weight_decay: 0
)


# Differentiation in Autograd
- How autograd collects gradients

## Create tensor a and b
- We create 2 tensors `a` and `b` with `requires_grad=True`
- This signal to `autograd` that every operation on them should be tracked

In [66]:
import torch

a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)


## Create tensor Q from a and b
- We create another tensor `Q` from `a` and `b`
- Q = 3*a**3 - b**2

In [67]:
Q = 3*a**3 - b**2

## Aggregate Q into a scalar
- Lets assume `a` and `b` to be parameters of an NN, and `Q` to be the error. In NN training, we want gradients of the error with respect to (w.r.t.) parameters
    + dQ/da = 9a**2
    + dQ/db = -2b
- When we call `.backward()` on `Q`, autograd calculates these gradients and stores them in the respective tensors' `.grad` attribute
- We need to explicitly pass a `gradient` argument in `Q.backward()` because it is a vector
- `gradient` is a tensor of the same shape as `Q`, and it represents the gradient of Q w.r.t itself
    + dQ/dQ = 1
- Equivalently, we can also aggregate Q into a scalar and call backward implicitly, like `Q.sum().backward()`

In [68]:
external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)

## Deposite gradient
- Gradients are now deposited in `a.grad` and `b.grad`

In [69]:
print(9*a**2 == a.grad)
print(-2*b == b.grad)

tensor([True, True])
tensor([True, True])


# Computational graph (DAG)
- Conceptualy, autograd keeps a record of data (tensors) & all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects
- In this DAG, leaves are the input tensors, roots are the ouput tensors
- By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule
- In a `forward pass`, autograd does 2 things simultaneously:
    + Run the requested operation to compute a resulting tensor, and
    + Maintain the operation's gradient function in the DAG
- The `backward pass` kicks off when `.backward()` is called on the DAG root, `autograd` then
    + Computes the gradients from each `.grad_fn`
    + Accumulates them in the respective tensor's `.grad` attribute, and
    + Using the chain rule, propagates all the way to the leaf tensors
- Below is a visual representation of the DAG in our example. In the graph, arrows are in the direction of the forward pass. The nodes represent the backward functions of each operation in the forward pass. The leaf nodes in blue represent of leaf tensors `a` and `b`

        a
        |
        v
    PowBackward()       b
        |               |
        v               v
    MulBackward()   PowBackward()
        \               /
         \             /
           SubBacward()

## Note
- DAGs are dynamic in PyTorch
- Important thing to note is that the graph is recreated from scratch
- After each `.backward()` call, autograd starts populating a new graph
- This is exactly what allows you to use control flow statements in your model
- You can change the shape, size, and operations at eveery iteration if needed

# Exclusion from the DAG
- `torch.autograd` tracks operations on all tensors which have their `requires_grad` flag set to `True`
- For tensors tht don't require gradients, setting this attribute to `False` excludes it from the gradient computation DAG

## Gradient Requirements
- The output tensor of an operation will require gradients even if only a single input tensor has `requires_grad=True`

In [70]:
x = torch.rand(5, 5)
y = torch.rand(5, 5)
z = torch.rand((5, 5), requires_grad=True) 

a = x + y
print(f"Does `a` require gradients? : {a.requires_grad}")
b = x + z
print(f"Does `b` require gradients?: {b.requires_grad}")

Does `a` require gradients? : False
Does `b` require gradients?: True


## Frozen Parameters
- In a NN, parameters tht don't compute gradients are usually called frozen parameters
- It is useful to "freeze" part of your model if you know in advance that you won't need the gradients of those parameters (offer performance benefits by reducing autograd computation)

## Fine Tunning
- In fine tunning, we freeze most of the model and typically only modify the classifier layers to make predictions on new labels
- Example as below using resnet18 model

In [71]:
from torch import nn, optim

model1 = resnet18(weights=ResNet18_Weights.DEFAULT)

# Freeze all the parameters in the network
for param in model1.parameters():
    param.requires_grad = False

- Lets say we want to finetune the model on a new dataset with 10 labels
- In resnet, the classifer is the last linear layer `model.fc`
- we can replace it with a new linear layer (unfrozen by default) that acts as our classifier

In [72]:
model1.fc = nn.Linear(512, 10)

- Now all the parameters in the model, except the parameters of `model.fc` are frozen
- The only parameters that compute gradients are the weights and bias of `model.fc`

In [73]:
# Optimize only the classifier
optimizer = optim.SGD(model1.parameters(), lr=1e-2, momentum=0.9)

- Notice although we register all the parameters in the optimizer, the only parameters that are computing gradients ( and hence updated in the gradient descent) are the weights and bias of the classifier
- The same exclusionary functionality is available as a context manager in `torch.no_grad()`