Gradients are essential for all optimizations. Autograd is a built in gradient calculator for pytorch.

In [1]:
import torch as tc

In [2]:
# randn gives a random from a normal distribution with mean 0 and sd 1
x = tc.randn(3, requires_grad=True)
x

tensor([-1.7552,  0.2298,  0.4503], requires_grad=True)

In [3]:
y = x + 2
y

tensor([0.2448, 2.2298, 2.4503], grad_fn=<AddBackward0>)

![Backpropagation Diagram](Data/Images/diagram_backprop.png "Backpropagation Diagram")

PyTorch will create a gradient function and apply it automatically during back propagation.

In [4]:
z = y * y * 2
z

tensor([ 0.1198,  9.9440, 12.0078], grad_fn=<MulBackward0>)

In [5]:
z = z.mean()
z

tensor(7.3572, grad_fn=<MeanBackward0>)

In [6]:
# Calculate the gradient for z with respect to x. dz/dx
z.backward()

In [7]:
# To see gradients stored
x.grad

tensor([0.3264, 2.9731, 3.2670])

In the background, this uses a Vector-Jacobian product to calculate the gradient.
![Vector Jacobian Diagram](Data/Images/vectorjacobian.png "Vector Jacobian Diagram")<br>
A jacobian matrix with the partial derivatives is multiplied with a gradient vector to result in final gradients<br>
Also called chain rule.

In [9]:
# z.backward()
# Gradients can only be implicitly made for scalar outputs.
# To do this without we must provide backward a gradient argument. The argument is a vector of the same size
x = tc.randn(3, requires_grad=True)
y = x + 2
z = y * y * 2
v = tc.tensor([0.1,1.0,0.001], dtype=tc.float32)
z.backward(v)
x.grad

tensor([ 5.4162e-01,  6.7663e+00, -1.1053e-03])

Sometimes during training we want to update weights but this should not be part of gradient computation.<br>
3 options to prevent tracking the gradient:<br>
x.requires_grad_(False)

x.detach()

with torch.no_grad():<br>
    ....

In [11]:
x = tc.randn(3, requires_grad=True)
x.requires_grad_(False)
x

tensor([-1.3348, -0.7026,  0.2218])

In [12]:
x = tc.randn(3, requires_grad=True)
y = x.detach()
y

tensor([-1.6131, -0.2991,  1.2048])

In [16]:
x = tc.randn(3, requires_grad=True)
with tc.no_grad():
    y = x + 2
y

tensor([3.4342, 1.0998, 3.4113])

Whenever we call the backward function. Then the gradient for this tensor will be accummulated into the .grad attribute.<br>
The values will be summed up

In [19]:
weights = tc.ones(4,requires_grad=True)
for epoch in range(1):
    model_output = (weights * 3).sum()
    
    model_output.backward()
    
    print(weights.grad)

tensor([3., 3., 3., 3.])


In [21]:
weights = tc.ones(4,requires_grad=True)
for epoch in range(2):
    model_output = (weights * 3).sum()
    
    model_output.backward()
    
    print(weights.grad)

tensor([3., 3., 3., 3.])
tensor([6., 6., 6., 6.])


In [22]:
weights = tc.ones(4,requires_grad=True)
for epoch in range(3):
    model_output = (weights * 3).sum()
    
    model_output.backward()
    
    print(weights.grad)

tensor([3., 3., 3., 3.])
tensor([6., 6., 6., 6.])
tensor([9., 9., 9., 9.])


Each backward call sums up the gradient. Gradients are now incorrect. Must empty the gradient before the next optimization step by calling<br>
.grad.zero_() function

In [24]:
weights = tc.ones(4,requires_grad=True)
for epoch in range(3):
    model_output = (weights * 3).sum()
    
    model_output.backward()
    
    print(weights.grad)
    
    weights.grad.zero_()

tensor([3., 3., 3., 3.])
tensor([3., 3., 3., 3.])
tensor([3., 3., 3., 3.])


Later when working with the pytorch builtin optimizer:<br>

In [26]:
weights = tc.ones(4,requires_grad=True)
optimizer = tc.optim.SGD([weights], lr = 0.01)
optimizer.step()
optimizer.zero_grad()