# Autograd

Autograd is a package within Pytorch and is used to calculate gradients. Gradients are essential for model optimization. Autograd handles all the computation for us.

In [1]:
# Imports

import torch

In [29]:
x = torch.randn(3, requires_grad=True)  # Random 1D tensor
print(x)  # Remember the requires_grad arg from tensors?
# Pytorch will maintain a computational graph for us.
# randn and rand are functionally similar
# randn draws values from a standard normal distribution (mean is 0, variance is 1)
# rand draws values from a uniform distribution i.e. range is [0, 1)
# Both return random tensors based on the values from the distributions, and dimensions passed

tensor([-0.9632,  0.7163, -0.7456], requires_grad=True)


In [30]:
y = x + 2  # This will calculate the gradient of y wrt x after the operation
y  # Notice the grad_fn attribute, which shows the function that calculated the gradient

tensor([1.0368, 2.7163, 1.2544], grad_fn=<AddBackward0>)

In [31]:
z = y * y * 2
z

tensor([ 2.1500, 14.7566,  3.1471], grad_fn=<MulBackward0>)

In [32]:
z = z.mean()
z

tensor(6.6846, grad_fn=<MeanBackward0>)

In [33]:
# And if we now want to calc the gradient:
z.backward()  # will calculate the gradient as dz/dx
# This will be stored in the grad attribute of x where gradients are stored
x.grad  # Shows the final calculated gradient
# All of the above only works if requires_grad is specified to be True

tensor([1.3824, 3.6217, 1.6726])

Notice how the `grad_fn` usually has a "Backward" in it. This is because the gradient is being calculated via BackPropagation for each operation performed. In the first case, `dy/dx` gradient is calculated for the operation `x + 2` in the function `AddBackward`, similarly `y * y * 2` for `dz/dy` in `MulBackward`, and so finally when we call `backward()` on `z`, all the grad functions combine and simplify to calculate the overall gradient for `x` in the form of `dz/dx`. We will discuss backpropagation more in detail in the next notebook. The gradient is calculated using a Jacobian vector matrix of partial derivatives which we multiply with a gradient vector to get the final gradients we want. This is also called the chain rule.

In the previous example z was a scalar so we didn't have to put any arguments in backward(). If we didn't call mean() on z, we would have to create a vector of the same size.


In [35]:
x = torch.randn(3, requires_grad=True)
print(x)
y = x + 2
print(y)
z = y * y * 2
print(z)
# z = z.mean()
# We are not calling mean
# We create a vector of the same dimensions as z
v = torch.tensor([0.1, 1.0, 0.001], dtype=torch.float32)
# Pass the vector to backward()
z.backward(v)  # Error without vector
print(x.grad)
# In the background this is a vector Jacobian product.
# If it is not a scalar we MUST give it a vector.

tensor([-0.9073,  1.7299,  0.0849], requires_grad=True)
tensor([1.0927, 3.7299, 2.0849], grad_fn=<AddBackward0>)
tensor([ 2.3881, 27.8247,  8.6933], grad_fn=<MulBackward0>)
tensor([4.3709e-01, 1.4920e+01, 8.3394e-03])


We can prevent PyTorch from tracking history and calculating `grad_fn`. For example when updating weights during our training loop, this operation should not be a part of it. A complete example will be demonstrated in a later notebook.

```py
x.requires_grad_(False)  # Note the trailing underscore
# Recall trailing underscore implies modification of the value in-place
x.detach()  # Creates a new tensor with the same values that doesn't require_grad
with torch.no_grad():  # Wrapper
    # operations here
```
These are all valid operations. <br>

NOTE: Whenever we call the backward() function the gradient for the tensor will be accumulated in the grad attriute.

In [38]:
# Dummy training example:
weights = torch.ones(4, requires_grad=True)

# One iteration
model_output = (weights * 3).sum()  # Dummy operation
model_output.backward()
print(weights.grad)
# Two iterations:
model_output = (weights * 3).sum()  # Dummy operation
model_output.backward()
print(weights.grad)
# Three iterations:
model_output = (weights * 3).sum()  # Dummy operation
model_output.backward()
print(weights.grad)

# We can see the gradients are accumulated but incorrect

tensor([3., 3., 3., 3.])
tensor([6., 6., 6., 6.])
tensor([9., 9., 9., 9.])


In [39]:
# Before the optimization and iteration, we have an additional step to call to correct the gradients
weights = torch.ones(4, requires_grad=True)
for epoch in range(3):
    model_output = (weights * 3).sum()  # Dummy operation
    model_output.backward()
    print(weights.grad)
    weights.grad.data.zero_()  # IMPORTANT: This is the step to correct the gradients


tensor([3., 3., 3., 3.])
tensor([3., 3., 3., 3.])
tensor([3., 3., 3., 3.])


When working with optimizers during training steps, we will follow a similar process. This will be covered in a later notebook, but we will go over a short example here:

```py
optimizer = torch.optim.AdamW(weights, lr=0.01)  # lr = learning rate
# There are many optimizers like SGD (stochastic gradient descent)
# But Adam is widely used
optimizer.step()
optimizer.zero_grad()
```