In [267]:
import torch

In [268]:
x = torch.arange(4.0)

Avoid allocating new memory every time we take a derivative because deep learning requires successively computing derivatives with respect to the same parameters a great many times, and we might risk running out of memory


In [269]:
# Can also create x = torch.arange(4.0, requires_grad=True)
x.requires_grad_(True)
x.grad  # The gradient is None by default

In [270]:
x

tensor([0., 1., 2., 3.], requires_grad=True)

Differentiating the function y = 2x@x with respect to the column vector x. To start, we assign x an initial value.

In [271]:
y = 2 * torch.dot(x, x)
y

tensor(28., grad_fn=<MulBackward0>)

In [272]:
y.backward() # take the gradient of y with respect to x;  derivative of 2x@x is 4x
x.grad

tensor([ 0.,  4.,  8., 12.])

In [273]:
x.grad == 4 * x

tensor([True, True, True, True])

Note that PyTorch does not automatically reset the gradient buffer when we record a new gradient; the new gradient is added to the already-stored gradient. This behavior comes in handy when we want to optimize the sum of multiple objective functions.


In [274]:
x.grad

tensor([ 0.,  4.,  8., 12.])

In [275]:
x.grad.zero_()  # Reset the gradient
x.grad, y

(tensor([0., 0., 0., 0.]), tensor(28., grad_fn=<MulBackward0>))

You have a tensor $Y$, which has been computed directly or indirectly from tensor $X$.

`Y.backward()` would calculate the derivative of each element of $Y$ w.r.t. each element of $X$. This gives us `N_out` (the number of elements in Y) masks with shape `X.shape`.

However, `torch.backward()` enforces by default that the gradient that will be stored in `X.grad` shall be of the same shape as X. If `N_out=1`, there is no problem as we have only one mask. That is why you want to reduce Y to a single value...

`some_loss_function.sum().backward()` calculates the sum of all the loss values across the batch and then performs backpropagation based on that sum. This means that each element of the batch contributes equally to the loss, regardless of its value. This can be useful in some scenarios, such as when you want to prioritize rare events that have a small number of occurrences in the batch. If you sum your loss you will end up scaling your loss value and the gradients that are inferred from it uncontrollably -> overflow after some time.

`some_loss_function.mean().backward()` calculates the mean of all the loss values across the batch and then performs backpropagation based on that mean. This means that each element of the batch contributes equally to the loss, but the contribution is weighted by its value. This can be useful in scenarios where you want to prioritize elements of the batch that have a higher loss value, or when you want to ensure that the gradients are scaled appropriately.

If `N_out>1`, Pytorch wants to take a weighted sum over the `N_out` gradient masks. But you need to supply the weights for this weighted sum! You can do this with the gradient argument:
`Y.backward(gradient=weights_shaped_like_Y)`

If you give every element of Y weight 1, you will get the same behaviour as using `torch.sum(Y).backward()`


https://zhang-yang.medium.com/the-gradient-argument-in-pytorchs-backward-function-explained-by-examples-68f266950c29

NOTE: We did not pass the ``gradient`` argument to ``backward()``, and this defaults to passing the value 1. PyTorch is calculating the Jacobian product. In the case of scalar values, ``.backward()`` w/o parameters is equivalent to ``.backward(torch.tensor(1.0))``. When input is a vector and output y = x1 + x2 is a scalar, the default gradient argument will also be 1s. If output is a vector then have something like y1 = f(x1) and y2 = f(x2), etc.

In [276]:
# You can only compute partial derivatives for a *scalar* function. 
# What backwards() gives you is d(loss)/d(parameter) and you expect 
# a *single* gradient value per parameter.
y = x.sum()
y.backward(), x.grad

(None, tensor([1., 1., 1., 1.]))