### Gradients with autograd

Gradients are essential in machine learning and autograd helps us to automatically calculate gradients in long computational graphs

In [1]:
import torch

If we want to calculate the gradient of some function of x with respect to x, we have to add the requires_grad=True parameter to the declaration of x

In [2]:
x = torch.randn(3, requires_grad=True)
print(x)

tensor([-0.4134, -1.2041,  1.2125], requires_grad=True)


Now let us define a function of x, and then calculate the gradient of that function with respect to x

In [3]:
y = x + 2 # This is a simple linear function

# y has an attribute called grad_fn that helps in calculating the gradient
print(y)

tensor([1.5866, 0.7959, 3.2125], grad_fn=<AddBackward0>)


In [4]:
z = 2*y**2 # z is a function of y which in turn is a function of x
print(z)

tensor([ 5.0344,  1.2668, 20.6400], grad_fn=<MulBackward0>)


In [5]:
a = z.mean() # a is a function of z which is eventually a function of x
print(a)

tensor(8.9804, grad_fn=<MeanBackward0>)


Each function records in its grad_fn attribute the steps taken from the initial requires_grad variable to get to this stage in the computational graph. These steps are then reversed in the backpropagation to help with gradient calculation.

To calculate the gradient of a with respect to x, we call a.backward(). To check the values of the gradient with respect to x, we call x.grad.

In [6]:
# a.backward() # Calculates da/dx

In [7]:
# print(x.grad)

It is important to note that gradients can be calculated implicitly like this only for single value tensors or scalars. This is why calling z.backward() will result in an error (since z is a 1D tensor or an array).

In the background, what backward() does is essentially create a vector Jacobian matrix (with the partial derivatives) and multiplies it with a gradient vector, to get the final gradients that we want. 

To make the backward() function work with vectors, we have to pass a vector as an argument to the backward() function. This vector has to be the same size as the vector whose gradient we are calculating.

In [8]:
g = torch.randn(z.shape)

In [9]:
# For this to work, we had to do 2 things
# 1. Pass in a tensor (vector) g of the same size
# as z to the backward() function
# 2. Comment out the first backward() call on a
# since there cannot be two backward() calls on
# the same computational graph

z.backward(g)
print(x.grad)

tensor([-0.6132,  4.8666, 14.6377])


It is also important to note that backward cannot be called twice on the same computational graph without the retain_graph=True parameter. 

It should be noted that the gradient calculation should not involve any weight updates. This is done in 3 ways:
<ul>
    <li>requires_grad=False</li>
    <li>x.detach() where x is the weight parameter</li>
    <li>with torch.no_grad():</li>
</ul>

In [12]:
a = torch.randn(3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
c = b.detach()
d = a + c
d.backward(torch.randn(d.shape)) # This works because at least one of the parameters of the function has requires_grad=True

In [16]:
print(a.grad) # Prints gradient values
print(b.grad) # Prints None since there is no gradient calculation done here
print(c.grad) # Prints None since there is no gradient calculation done here

tensor([ 0.1189, -0.5598, -2.7388])
None
None


In [18]:
e = c**2
e.backward() ## This does not work since c is detached and therefore does not have the requires_grad=True parameter

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

In [19]:
with torch.no_grad():
    y = a*b
    print(y) # y does not have the grad_fn attribute since gradients are not going to be calculated for it

tensor([-0.7367, -0.5727,  1.9858])


#### A dummy example

In [20]:
weights = torch.ones(4, requires_grad=True)

for epoch in range(4):
    model_output = (weights*3).sum() # Results in one scalar value
    model_output.backward() # Calculates gradient of model_output with respect to weights
    print(weights.grad)

tensor([3., 3., 3., 3.])
tensor([6., 6., 6., 6.])
tensor([9., 9., 9., 9.])
tensor([12., 12., 12., 12.])


The above output is explained as such. Everytime the gradient values are 3 as per the model_output function. However, they keep getting accumulated in the weights.grad, resulting in the 3s, 6s, 9s and 12s.

In [21]:
weights = torch.ones(4, requires_grad=True)

for epoch in range(4):
    model_output = (weights*3).sum() # Results in one scalar value
    model_output.backward() # Calculates gradient of model_output with respect to weights
    print(weights.grad)
    weights.grad.zero_() # Zeroes out the gradient in place

tensor([3., 3., 3., 3.])
tensor([3., 3., 3., 3.])
tensor([3., 3., 3., 3.])
tensor([3., 3., 3., 3.])


It is important to zero out the gradients before every update/optimisation step as well.

In [24]:
optimizer = torch.optim.SGD(weights, lr=0.01)
optimizer.step() # One optimisation step of the stochastic gradient descent algorithm
print(weights.grad) # Should print current gradients
optimizer.zero_grad() # Zeroes out the gradients
print(weights.grad) # Should print None

tensor([1., 1., 1., 1.], requires_grad=True)


TypeError: params argument given to the optimizer should be an iterable of Tensors or dicts, but got torch.FloatTensor

The above doesn't work, but we will fix this in the next few notebooks.