In [1]:
# https://towardsdatascience.com/pytorch-autograd-understanding-the-heart-of-pytorchs-magic-2686cd94ec95
import torch

# Creating the graph
x = torch.tensor(1.0, requires_grad = True)
y = torch.tensor(2.0)
z = x * y

# Displaying
for i, name in zip([x, y, z], "xyz"):
    print(f"{name}\ndata: {i.data}\nrequires_grad: {i.requires_grad}\n\
grad: {i.grad}\ngrad_fn: {i.grad_fn}\nis_leaf: {i.is_leaf}\n")

x
data: 1.0
requires_grad: True
grad: None
grad_fn: None
is_leaf: True

y
data: 2.0
requires_grad: False
grad: None
grad_fn: None
is_leaf: True

z
data: 2.0
requires_grad: True
grad: None
grad_fn: <MulBackward0 object at 0x7f795b007e48>
is_leaf: False



## The backward() function

Backward is the function which actually calculates the gradient by passing its argument (`1x1` unit tensor by default) through the backward graph all the way up to every leaf node traceable from the calling root tensor. The calculated gradients are then stored in `.grad` of every leaf node. Remember, the backward graph is already made dynamically during the forward pass. Backward function only calculates the gradient using the already made graph and stores them in leaf nodes.

In [2]:
import torch
# Creating the graph
x = torch.tensor(1.0, requires_grad = True)
z = x ** 3
z.backward()        # Computes the gradient 
print(x.grad.data)  # Prints '3' which is dz/dx 

tensor(3.)


An important thing to notice is that when `z.backward()` is called, a tensor is automatically passed as `z.backward(torch.tensor(1.0))`. The `torch.tensor(1.0)` is the external gradient provided to terminate the chain rule gradient multiplications. This external gradient is passed as the input to the `MulBackward` function to further calculate the gradient of `x`. 

The dimension of tensor passed into `.backward()` must be the same as the dimension of the tensor whose gradient is being calculated.

In [0]:
import torch

x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x**2

y.backward(torch.FloatTensor([1.0, 1.0])) # the tensor provide to .backward acts as a weight
x.grad

tensor([2., 4.])

The line `y.backward(torch.FloatTensor([1.0, 1.0]))` calculates the gradient using as an argument a tensor of ones. This is exactly similar to what PyTorch does with scalar; by default assigns a scalar tensor of `1`, but we don't see it. It happens internally.

For tensors of greater dimensions, we have to explicitely declare the argument. In this simple case, because want to calculate the derivative of `x^2`, we don't want the output to be affected by any sort of weight, so we just use a tensor of ones. 

In [0]:
# another way to put it would be
import torch

x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x**2

#
w = torch.FloatTensor([1.0, 1.0])
y.backward(w)             # the tensor provide to .backward acts as a weight
x.grad                    # notice that we do not provide backward(x)

tensor([2., 4.])

In [0]:
# This will happen if we provide x as argument of .backward()
import torch

x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x**2

w = torch.FloatTensor([1.0, 1.0])

y.backward(x)             # the tensor provide to .backward acts as the weights
x.grad                    
# out: tensor([2., 8.])   # x as an argument outputs the wrong derivatives

tensor([2., 8.])

For this simple example of calculating the derivative of `X^2`, `x` as an argument, outputs the wrong derivatives dy/dx because `[1.0, 2.0]` is not a tensor of ones.

## Increase the size of x
We make this more interesting and add a new variable `z`.

In [0]:
import torch

x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x**2
z = y**3

w = torch.FloatTensor([1.0, 1.0])

#y.backward(w, retain_graph=True)             # the tensor provide to .backward acts as the weights
#print(x.grad)

z.backward(w, retain_graph=True)
print(y.grad)
print(x.grad)

z.backward(w, retain_graph=True)
print(y.grad)
print(x.grad)

# why y is None?

None
tensor([  6., 192.])
None
tensor([ 12., 384.])


In the previous operation, there is only one independent variable, and that is `x`. Both `y` and `z` depend on `x`, and `z` depend on `y`. When we apply the derivative on z respect of y we get `None` because is not an independent variable. Internally, the gradient function has the vale of **None**, so, if nothing changes respect of `y`, then it returns its original value in PyTorch.

Let's see the next example:



In [0]:
# https://www.javatpoint.com/gradient-with-pytorch
import torch

x = torch.tensor([2.0, 2.0], requires_grad=True) # independent variable
z = torch.tensor([4.0, 4.0], requires_grad=True) # independent variable

y = x**2 + z**3  
# Its partial derivatives are:
#    dy/dx = 2x;     
#    dy/dz = 3z^2; 

w = torch.FloatTensor([1.0, 1.0])  # weights

y.backward(w, retain_graph=True)
print(x.grad)
print(z.grad)

y.backward(w, retain_graph=True)
print(x.grad)
print(z.grad)

y.backward(w, retain_graph=True)
print(x.grad)
print(z.grad)

# at every pass, the derivative is accumulated
# there is no second derivative calculated

tensor([4., 4.])
tensor([48., 48.])
tensor([8., 8.])
tensor([96., 96.])
tensor([12., 12.])
tensor([144., 144.])


Now we get the right derivative computations, because `y` is the only dependent variable of `x` and `z`.


## What happens if turn off `retain_graph`
We will get an error on the second and consecutive passes.

     RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.`

In [0]:
# https://www.javatpoint.com/gradient-with-pytorch
import torch

x = torch.tensor([2.0, 2.0], requires_grad=True) # independent variable
z = torch.tensor([4.0, 4.0], requires_grad=True) # independent variable

y = x**2 + z**3  
# dy/dx = 2x;     
# dy/dz = 3z^2; 


w = torch.FloatTensor([1.0, 1.0])  # weights

y.backward(w)
print(x.grad)
print(z.grad)

try: 
   y.backward(w)
   print(x.grad)
   print(z.grad)
except RuntimeError:
   print("Trying backward a second time when buffers are empty")

try:
    y.backward(w)
except RuntimeError:
    print("Trying backward a second time when buffers are empty")
else: 
    print(x.grad)
    print(z.grad)

# at every pass, the derivative is accumulated
# there is no second derivative calculated

tensor([4., 4.])
tensor([48., 48.])
Trying backward a second time when buffers are empty
Trying backward a second time when buffers are empty


## calculating gradients per element emptying the grad accumulator
For neural networks, we usually use **loss** to assess how well the network has learned to classify the input image (or other tasks). The loss term is usually a scalar value.

The gradient arguments of the backward() method is used to calculate a weighted sum of each element of a Variable w.r.t the leaf Variable. These weights are just the derivative of final loss w.r.t each element of the intermediate variable.


In [0]:
# https://stackoverflow.com/a/47026836/5270873
from torch.autograd import Variable
import torch

x = Variable(torch.FloatTensor([[1, 2, 3, 4]]), requires_grad=True)
z = 2*x
loss = z.sum(dim=1) # scalar


# do backward for first element of z
z.backward(torch.FloatTensor([[1, 0, 0, 0]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_() #remove gradient in x.grad, or it will be accumulated



# do backward for second element of z
z.backward(torch.FloatTensor([[0, 1, 0, 0]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_()


# do backward for all elements of z, with weight equal to the derivative of
# loss w.r.t z_1, z_2, z_3 and z_4
z.backward(torch.FloatTensor([[1, 1, 1, 1]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_()



# or we can directly backprop using loss
loss.backward() # equivalent to loss.backward(torch.FloatTensor([1.0]))
print(x.grad.data)    

tensor([[2., 0., 0., 0.]])
tensor([[0., 2., 0., 0.]])
tensor([[2., 2., 2., 2.]])
tensor([[2., 2., 2., 2.]])
