# Explaining the backward function in PyTorch

**Sources**: 
* https://stackoverflow.com/questions/57248777/backward-function-in-pytorch

* https://towardsdatascience.com/pytorch-autograd-understanding-the-heart-of-pytorchs-magic-2686cd94ec95:

Backward is the function which actually calculates the gradient by passing its argument, a `1x1` unit tensor by default, through the backward graph all the way up to every leaf node traceable from the calling root tensor.

An important thing to notice is that when `z.backward()` is called, a tensor is automatically passed as `z.backward(torch.tensor(1.0))`. The `torch.tensor(1.0)` is the **external gradient** provided to terminate the chain rule gradient multiplications. This **external gradient** is passed as the input to the `MulBackward` function to further calculate the gradient of x. The dimension of the tensor passed into the `.backward()` function must be the same as the dimension of the tensor which gradient is being calculated. For example, if the gradient enabled tensor `x` and `y` are as follows:

    x = torch.tensor([0.0, 2.0, 8.0], requires_grad = True)

    y = torch.tensor([5.0 , 1.0 , 7.0], requires_grad = True)

and 

    z = x * y

then, to calculate the gradients of `z` (a `1x3` tensor), with respect to `x` or `y` , an **external gradient** needs to be passed to `z.backward()` function as follows: 

    z.backward(torch.FloatTensor([1.0, 1.0, 1.0])

If not, `z.backward()` would give a `RuntimeError: grad can be implicitly created only for scalar outputs`

## Simplest example: Gradient of a scalar
PyTorch assigns a unary tensor by default as an argument to `backward`. This only happens with scalars.

In [0]:
import numpy as np
import torch
from torch.autograd import Variable

a = Variable(torch.FloatTensor([4.0]), requires_grad=True) 
out = a * a
out.backward()      # We are not assigning any unary tensor as argument
print(a.grad)

tensor([8.])


This other example (below), shows an implicit assignment of a unary tensor to the `backward` function.

In [0]:
# equivalent example
import numpy as np
import torch
from torch.autograd import Variable

a = Variable(torch.FloatTensor([4.0]), requires_grad=True) 
out = a * a
out.backward(torch.tensor([1.0]))      # assigning a unary tensor as argument 
print(a.grad)

tensor([8.])


## Gradient of non-scalar tensors

In [0]:
import numpy as np
import torch
from torch.autograd import Variable


a = Variable(torch.FloatTensor([[1,2,3], [4,5,6]]), requires_grad=True) 
out = a * a
out.backward(a)
print(a.grad)


tensor([[ 2.,  8., 18.],
        [32., 50., 72.]])


The previous operation is wrong in this simple isolated case (above), because we are assigning non unary weights, which means we won't be able to get the correct results of the derivatives of `a^2`. In other words, we are transforming the output.

This other example (below), shows the correct use of a tensor on the `backward` function. It is a tensor of ones.

In [0]:
import numpy as np
import torch
from torch.autograd import Variable


a = Variable(torch.FloatTensor([[1,2,3],[4,5,6]]), requires_grad=True) 

# add this tensor for the "weights" or external gradient
w = torch.FloatTensor([[1,1,1],[1,1,1]])

out = a * a
out.backward(w)
print(a.grad)

tensor([[ 2.,  4.,  6.],
        [ 8., 10., 12.]])


## Another example with a 5x3 tensor

In [0]:
# this example uses a 5x3 tensor
import numpy as np
import torch

x = np.array([[73, 67, 43],
              [91, 88, 64],
              [87, 134, 58],
              [102, 43, 37],
              [69, 96, 70]], dtype='float32')

x = torch.from_numpy(x)
x.requires_grad = True
print(x)

# this tensor represents the weights
w = torch.ones(5, 3, requires_grad=True) # external gradient

y = x**2

y.backward(w, retain_graph=True)
print(x.grad)

tensor([[ 73.,  67.,  43.],
        [ 91.,  88.,  64.],
        [ 87., 134.,  58.],
        [102.,  43.,  37.],
        [ 69.,  96.,  70.]], requires_grad=True)
tensor([[146., 134.,  86.],
        [182., 176., 128.],
        [174., 268., 116.],
        [204.,  86.,  74.],
        [138., 192., 140.]])


## Gradients with multiple variables

Source: https://towardsdatascience.com/getting-started-with-pytorch-part-1-understanding-how-automatic-differentiation-works-5008282073ec

![alt text](https://drive.google.com/uc?id=1adhiJx8-y9ALXBfwKl0UzQOy-DjHUecm)


In [0]:
from torch import FloatTensor
from torch.autograd import Variable


# Define the leaf nodes
a = Variable(FloatTensor([4]))

weights = [Variable(FloatTensor([i]), requires_grad=True) for i in (2, 5, 9, 7)]

# unpack the weights for nicer assignment
w1, w2, w3, w4 = weights

b = w1 * a
c = w2 * a
d = w3 * b + w4 * c
L = (10 - d)

L.backward()

for index, weight in enumerate(weights, start=1):
    gradient, *_ = weight.grad.data
    print(f"Gradient of w{index} w.r.t to L: {gradient}")

Gradient of w1 w.r.t to L: -36.0
Gradient of w2 w.r.t to L: -28.0
Gradient of w3 w.r.t to L: -8.0
Gradient of w4 w.r.t to L: -20.0


![alt text](https://drive.google.com/uc?id=1jUhObd5fpqzgjQnZ_l4pa7g1gfkjcL0U)

## Example. Get the size of the independent variable

In [0]:
# From an example on linear regression
x = np.arange(1,16,1)
x = torch.from_numpy(x)
x = x.float()            # torch.Size([15])
print(x.shape)
x = x.view(-1, 3)        # transform tensor to a 5x3 
x.requires_grad = True   # torch.Size([5, 3])
print(x)

# w represents the weights and y the model. very, very simple example
# using size of x, we ensure w has the same size as x
w = torch.ones((x.size()[0], x.size()[1]), requires_grad = True)
y = x**2

y.backward(w)
print(x.grad)

torch.Size([15])
tensor([[ 1.,  2.,  3.],
        [ 4.,  5.,  6.],
        [ 7.,  8.,  9.],
        [10., 11., 12.],
        [13., 14., 15.]], requires_grad=True)
tensor([[ 2.,  4.,  6.],
        [ 8., 10., 12.],
        [14., 16., 18.],
        [20., 22., 24.],
        [26., 28., 30.]])
