# Computing gradients and derivatives in PyTorch
>"Making good use of the `gradient` argument in PyTorch's `backward` function"

- toc: true 
- badges: true
- comments: true
- categories: [mathematics]

---
tags: mathematics pytorch gradients backward automatic differentiation vector-Jacobian product backpropagation

---

# tl;dr
The `backward` function in `PyTorch` can be used to compute the derivatives or gradients of functions.  The `backward` function computes vector-Jacobian products so that the appropriate vector must be determined.  In other words, the correct `gradient` argument must be passed to `backward`, although not passing `gradient` explicitly will cause `backward` to choose the appropriate value but only in the simplest cases.

This notebook explains vector-Jacobian products and how to choose the `gradient` argument in the `backward` function in the general case.

# A brief overview
In the case of a function taking a scalar and returning a scalar, the use of the `backward` function is quite straight-forward:

In [10]:
# collapse-hide
import torch
x = torch.tensor(1., requires_grad=True)
y = x**2
y.backward()
print(f"Derivative at a single point:")
print(x.grad.data)

Derivative at a single point:
tensor(2.)


However, when  
- the function is **multi-valued** (e.g. vector- or matrix-valued); or  
- one wishes to compute the derivative of a function at **mulitple** points,  

then the `gradient` argument in `backward` must be suitably chosen.  For example:

In [13]:
# collapse-hide
import torch
x = torch.linspace(-2, 2, 5, requires_grad=True)
y = x**2
gradient = torch.ones_like(y)
y.backward(gradient)
print("Derivative at multiple points:")
print(x.grad.data)

Derivative at multiple points:
tensor([-4., -2.,  0.,  2.,  4.])


Indeed, more precisely, the `backward` function computes vector-Jacobian products, which is not explicit in the function's doc string:

In [24]:
# collapse-hide
print("First line of `torch.Tensor.backward` doc string:")
print("\""+ torch.Tensor.backward.__doc__.split("\n")[0] + "\"")

First line of `torch.Tensor.backward` doc string:
"Computes the gradient of current tensor w.r.t. graph leaves."


although some explanations are given [in this official tutorial](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#gradients).  The crucial point is therefore to choose the appropriate vector, which is passed to the `backward` function in its `gradient` argument:

In [45]:
# collapse-hide
import inspect
import torch
print(f"torch.Tensor.backward{inspect.signature(torch.Tensor.backward)}")
print("...")
print("\n".join(torch.Tensor.backward.__doc__.split("\n")[11:18]))
print("...")

torch.Tensor.backward(self, gradient=None, retain_graph=None, create_graph=False)
...
        Arguments:
            gradient (Tensor or None): Gradient w.r.t. the
                tensor. If it is a tensor, it will be automatically converted
                to a Tensor that does not require grad unless ``create_graph`` is True.
                None values can be specified for scalar Tensors or ones that
                don't require grad. If a None value would be acceptable then
                this argument is optional.
...


There is a way around specifying the `gradient` argument.  Revisiting the example above, the derivative at multiple points can be equivalently calculated by adding a `sum()`:

In [44]:
# collapse-hide
import torch
x = torch.linspace(-2, 2, 5, requires_grad=True)
y = (x**2).sum()

y.backward()
print("Derivative at multiple points:")
print(x.grad.data)

Derivative at multiple points:
tensor([-4., -2.,  0.,  2.,  4.])


Here, the `backward` method is invoked on a different `tensor`:  
```
(x**2).backward()
```
if `x` contains a single input,
vs
```
(x**2).sum().backward()
```
if `x` contains multiple inputs.

On the other hand, passing the `gradient` argument, whether `x` contains one or multiple inputs, the same command is used to compute the derivatives:
```
y = (x**2)
y.backward(torch.ones_like(y))
```

Roughly speaking, the difference between the two methods, namely setting `gradient=torch.ones_like(y)` or adding `sum()`, is in the order of the summation and differentiation.

# Usage examples of the `backward` function
The derivative of the **scalar**, **univariate** function $f(x)=x^2$ at a **single** point $x=1$:

In [1]:
import torch
x = torch.tensor(1., requires_grad=True)
y = x**2
y.backward()
x.grad

tensor(2.)

The derivative of the **scalar**, **univariate** function $f(x)=x^2$ at **multiple** points $x= -2, -1, \dots, 2$:

In [2]:
import torch
x = torch.linspace(-2, 2, 5, requires_grad=True)
y = x**2
v = torch.ones_like(y)
y.backward(v)
x.grad

tensor([-4., -2.,  0.,  2.,  4.])

The gradient of the **scalar**, **multivariate** function $f(x_1, x_2)=3x_1^2 + 5x_2^2$ at a **single** point $(x_1, x_2)=(-1, 2)$:

In [3]:
import torch
x = torch.tensor([-1., 2.], requires_grad=True)
w = torch.tensor([3., 5.])
y = (x*x*w).sum()
y.backward()
x.grad

tensor([-6., 20.])

The gradient of the **scalar**, **multivariate** function $f(x_1, x_2) = -x_1^2 + x_2^2$ at **multiple** points $(x_1, x_2)$:

In [4]:
import torch
x = torch.arange(6, dtype=float).view(3, 2).requires_grad_(True)
w = torch.tensor([-1, 1])
y = (x*x*w).sum(1)
v = torch.ones_like(y)
y.backward(v)
x.grad

tensor([[-0.,  2.],
        [-4.,  6.],
        [-8., 10.]], dtype=torch.float64)

The _derivatives_ of the **vector-valued**, **univariate** function $f(x)= (-x^3, 5x)$ at a **single** point $x=1$, i.e. the derivative of
- its first component function $f_1(x)=-x^3$; and
- its second component function $f_2(x)=5x$.

In [5]:
# collapse-hide
import torch
x = torch.tensor(1., requires_grad=True)
y = torch.stack([-x**3, 5*x])

v1 = torch.tensor([1., 0.])
y.backward(v1, retain_graph=True)

print(f"f_1'({x.data.item()}) = {x.grad.data.item():>4}")

x.grad.zero_()

v2 = torch.tensor([0., 1.])
y.backward(v2)
print(f"f_2'({x.data.item()}) = {x.grad.data.item():>4}")

f_1'(1.0) = -3.0
f_2'(1.0) =  5.0


The _derivatives_ of the **vector-valued**, **univariate** function $f(x)= (-x^3, 5x)$ at **multiple** points, i.e. the derivative of
- its first component function $f_1(x)=-x^3$; and
- its second component function $f_2(x)=5x$.

In [6]:
# collapse-hide
import torch
import itertools
x = torch.arange(3, dtype=float, requires_grad=True)
y = torch.stack([-x**3, 5*x])

ranges = [range(_) for _ in y.shape]

v1 = torch.tensor([1. if i == 0 else 0. for i, j in itertools.product(*ranges)]).view(*y.shape)
y.backward(v1, retain_graph=True)
print(f"Derivative of f_1(x)=-3x^2 at the points {tuple(x.data.view(-1).tolist())}:")
print(x.grad)

x.grad.zero_()

v2 = torch.tensor([1. if i == 1 else 0. for i, j in itertools.product(*ranges)]).view(*y.shape)
y.backward(v2)
print(f"\nDerivative of f_2(x)=5x at the points {tuple(x.data.view(-1).tolist())}:")
print(x.grad)

Derivative of f_1(x)=-3x^2 at the points (0.0, 1.0, 2.0):
tensor([  0.,  -3., -12.], dtype=torch.float64)

Derivative of f_2(x)=5x at the points (0.0, 1.0, 2.0):
tensor([5., 5., 5.], dtype=torch.float64)


The _gradients_ of the **vector-valued**, **multivariate** function
$$
f(x_1, \dots, x_n) = (x_1 + \dots + x_n\,, x_1^2 + \dots + x_n^2)
$$
at a **single** point $(x_1, \dots, x_n)$, i.e. the gradient of
- its first component function $f_1(x_1, \dots, x_n) = x_1 + \dots + x_n$; and
- its second component function $f_2(x_1, \dots, x_n) = x_1^2 + \dots + x_n^2$.

In [17]:
# collapse-show
import torch
x = torch.arange(4, dtype=float, requires_grad=True)
y = torch.stack([x.sum(), (x**2).sum()])

print(f"x                 : {tuple(x.data.tolist())}")
print(f"y = (y_1, y_2)    : {tuple(y.data.tolist())}")

v1 = torch.tensor([1., 0.])
y.backward(v1, retain_graph=True)
print(f"gradient of y_1   : {tuple(x.grad.data.tolist())}")

x.grad.zero_()

v2 = torch.tensor([0., 1.])
y.backward(v2)
print(f"gradient of y_2   : {tuple(x.grad.data.tolist())}")

x                 : (0.0, 1.0, 2.0, 3.0)
y = (y_1, y_2)    : (6.0, 14.0)
gradient of y_1   : (1.0, 1.0, 1.0, 1.0)
gradient of y_2   : (0.0, 2.0, 4.0, 6.0)


The _gradients_ of the **vector-valued**, **multivariate** function
$$
f(x_1, \dots, x_n) = (x_1 + \dots + x_n\,, x_1^2 + \dots + x_n^2)
$$
at **multiple** points, i.e. the gradient of
- its first component function $f_1(x_1, \dots, x_n) = x_1 + \dots + x_n$; and
- its second component function $f_2(x_1, \dots, x_n) = x_1^2 + \dots + x_n^2$.

In [18]:
# collapse-show
import torch
import itertools
x = torch.arange(4*3, dtype=float).view(-1,4).requires_grad_(True)
y = torch.stack([x.sum(1), (x**2).sum(1)])
print("x:")
print(x.data)
print("y:")
print(y.data)

print()

ranges = [range(_) for _ in y.shape]

v1 = torch.tensor([1. if i == 0 else 0. for i, j in itertools.product(*ranges)]).view(*y.shape)
y.backward(v1, retain_graph=True)
print("Gradients of the f1 at multiple points:")
print(x.grad)

x.grad.zero_()

print()
v2 = torch.tensor([1. if i == 1 else 0. for i, j in itertools.product(*ranges)]).view(*y.shape)
y.backward(v2)
print("Gradients of the f2 at multiple points:")
print(x.grad)



x:
tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]], dtype=torch.float64)
y:
tensor([[  6.,  22.,  38.],
        [ 14., 126., 366.]], dtype=torch.float64)

Gradients of the f1 at multiple points:
tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]], dtype=torch.float64)

Gradients of the f2 at multiple points:
tensor([[ 0.,  2.,  4.,  6.],
        [ 8., 10., 12., 14.],
        [16., 18., 20., 22.]], dtype=torch.float64)


# Mathematical preliminaries
## Scalars, vectors, matrices, and tensors

- A **scalar** is a real number.  It is usually denoted with $x$. 
- An **$n$-dimensional vector** is a list $(x_1, \dots, x_n)$ of scalars.
- An **$m$-by-$n$ matrix** is an array with $m$ rows and $n$ columns of scalars:
$$
\begin{bmatrix}w_{1,1}&\dots&w_{1,n}\\\vdots&\ddots&\vdots\\w_{m,1}&\dots&w_{m,n}\end{bmatrix}
$$
- A **column vector** of length $n$ is a $n$-by-$1$ matrix:
$$\begin{bmatrix}x_1\\\vdots\\x_n\end{bmatrix}$$
Note that it is distinct from its vector counterpart $(x_1, \dots, x_n)$.
- A **row vector** of length $n$ is a $1$-by-$n$ matrix:
$$\begin{bmatrix}x_1&\dots&x_n\end{bmatrix}$$
Note that it is distinct from its vector and column vector counterparts.

>Note:
For convenience, we may denote a vector, a column vector, or a row vector with a single symbol, typically $x$.

In another post we establish the following correspondence between these mathematical entities and their `tensor` counterparts in `PyTorch`:

|mathematical name|mathematical notation|`tensor` shape|`tensor` dimension|
|---|---|---|---|
|scalar|$x$|`()`|`0`|
|vector|$(x_1, \dots, x_n)$|`(n,)`|`1`|
|matrix|$\begin{bmatrix}w_{1,1}&\dots&w_{1,n}\\\vdots&\ddots&\vdots\\w_{m,1}&\dots&w_{n,m}\end{bmatrix}$|`(m,n)`| `2`|
|column vector|$\begin{bmatrix}x_1\\\vdots\\x_n\end{bmatrix}$|`(n,1)`|`2`|
|row vector|$\begin{bmatrix}x_1&\dots&x_n\end{bmatrix}$|`(1,n)`|`2`|

## Mathematical functions
- We consider functions which are mappings from scalars, vectors, or matrices to scalars, vectors, or matrices.  It is generically denoted $y=f(x)$.
- A **scalar** function $y=f(x)$ is a function returning a scalar, i.e. $y$ is a scalar.  
- A **vector-valued** function $y=f(x)$ is a function returning a vector, i.e. $y$ is a vector.  We often write
$$f(x) = (f_1(x), \dots, f_m(x))$$
if the output is $m$-dimensional, where each of $f_1(x), \dots, f_m(x)$ is a scalar function.
- A **univariate** function $y=f(x)$ is a function depending on a scalar $x$.
- A **multivariate** function $y=f(x)$ is a function depending on a vector $x=(x_1, \dots, x_n)$.  

In summary

|$y=f(x)$|scalar-valued|vector-valued|
|---|---|---|
|**univariate**|$x$ is a scalar<br>$y$ is a scalar|$x$ is a scalar<br>$y$ is a vector|
|**multivariate**|$x$ is a vector<br>$y$ is a scalar|$x$is a vector<br>$y$ is a vector|

## Differentiation
### Basic definitions
We do not recall the definitions for:
- the **derivative** $f'(x)$ of a scalar, uni-variate function $y=f(x)$ evaluated at a scalar $x$;
- the **partial derivatives** $\frac{\partial f}{\partial x_i}(x)$, $i=1, \dots, n$, of a scalar, multivariate function $y=f(x)$ with respect to the variables $x_1, \dots, x_n$, and evaluated at $x=(x_1, \dots, x_n)$.

### Derivatives of vector-valued, univariate functions
The **derivative** of a vector-valued, uni-variate function $y=f(x)$ evaluated at a scalar $x$ is the vertical concatenation of the derivatives of its component functions:
$$f'(x) = \begin{bmatrix}f_1'(x)\\\vdots\\f_m'(x)\end{bmatrix}$$

### Gradients
The **gradient** of a scalar-valued function $y=f(x)$, is the *row* vector of its partial derivatives:
$$\nabla f(x) = \begin{bmatrix}\frac{\partial f}{\partial x_1}(x)&\dots&\frac{\partial f}{\partial x_n}(x)\end{bmatrix}$$
with length $n$ if $x$ is $n$-dimensional: $x=(x_1, \dots, x_n)$.

### Jacobians
The **Jacobian** of a vector-valued, multivariate function $y=f(x)$ is the vertical concatenation of the gradients of the component functions $f_1, \dots, f_m$:
$$J_f(x)
\,=\,
\begin{bmatrix}
\nabla f_1(x)\\\vdots\\\nabla f_m(x)
\end{bmatrix}
\,=\,
\begin{bmatrix}
    \frac{\partial f_1}{\partial x_1}(x)&\dots&\frac{\partial f_1}{\partial x_n}(x)\\
    \vdots&\ddots&\vdots\\
    \frac{\partial f_m}{\partial x_1}(x)&\dots&\frac{\partial f_m}{\partial x_n}(x)
\end{bmatrix}
$$
It is thus an $m$-by-$n$ matrix, i.e. with $m$ rows and $n$ columns.

#### Special case: $m=1$
In case $m=1$, the Jacobian agrees with the gradient of a scalar, multivariate function:
$$J_f(x) = \nabla f(x)$$

#### Special case: $n=1$
In case $n=1$, the Jacobian agrees with the derivative of a vector-valued, univariate function.
$$J_f(x) = \begin{bmatrix}f_1'(x)\\\vdots\\f_m'(x)\end{bmatrix}$$

## Vector-Jacobian products
Given a vector-valued, multivariate function $y=f(x)$ and a _column_ vector
$v=\begin{bmatrix}v_1\\\vdots\\v_m\end{bmatrix}$,
the **vector-Jacobian product** is the matrix multiplication
$$v^\top J_f(x) \,=\,
\begin{bmatrix}
v_1&\dots&v_m
\end{bmatrix}
\begin{bmatrix}
    \frac{\partial f_1}{\partial x_1}(x)&\dots&\frac{\partial f_1}{\partial x_n}(x)\\
    \vdots&\ddots&\vdots\\
    \frac{\partial f_m}{\partial x_1}(x)&\dots&\frac{\partial f_m}{\partial x_n}(x)
\end{bmatrix}
$$
which is then a _row_ vector of length $n$.

### Special case
If $v^\top$ happens to be the gradient of a scalar-valued function $z=\ell(y)$ evaluated at $f(x)$, i.e. $v = \nabla \ell(y)$ where $y=f(x)$, then
\begin{equation}
v^\top J_f(x) 
\,=\,\nabla (\ell\circ f)(x)
\end{equation}
In other words, $v^\top J_f(x)$ is the gradient of the composition of the function $\ell$ with the function $f$.

>Note:
The vector-Jacobian product can be generalized to cases where $x$ and $y$ are (mathematical) tensors of higher dimensions.  This generalization is in fact used in some of the examples of this post.

### Application: Gradients of vector-valued functions
If $y=f(x)=(f_1(x), \dots, f_m(x))$ is a vector-valued, multivariate function, one computes the gradients $\nabla f_1(x), \dots, \nabla f_m(x)$ one at a time, each time with a suitable vector $v$.  Indeed, fix $i$ between $1$ and $m$, and define $\ell_i(y)=y_i$ the function selecting the $i$-th coordinate of $y=(y_1, \dots, y_m)$, so that
$$f_i(x) = \ell_i(f(x))\,.$$
Noting that
$$\nabla \ell_i(y) = \begin{bmatrix}0&\cdots&0&1&0&\cdots&0\end{bmatrix}$$
where the only non-zero coordinate is in the $i$-th position, then 
$$
\begin{align}
\nabla \ell_i(f(x))J_f(x)
& =
\begin{bmatrix}0&\cdots&0&1&0&\cdots&0\end{bmatrix}
\begin{bmatrix}
    \frac{\partial f_1}{\partial x_1}(x)&\dots&\frac{\partial f_1}{\partial x_n}(x)\\
    \vdots&\ddots&\vdots\\
    \frac{\partial f_m}{\partial x_1}(x)&\dots&\frac{\partial f_m}{\partial x_n}(x)
\end{bmatrix}\\
&=
\begin{bmatrix}\frac{\partial f_i}{\partial x_1}(x)&\dots&\frac{\partial f_i}{\partial x_n}(x)\end{bmatrix}
\end{align}
$$


### Application: Derivatives at multiple points
To evaluate the derivative of a scalar, univariate function $f(x)$ at multiple sample points $x^{(1)}, \dots, x^{(N)}$, we create a *new*, vector-valued and multivariate function
$$F(x)=\begin{bmatrix}f\left(x^{(1)}\right)\\ \vdots \\ f\left(x^{(N)}\right)\end{bmatrix}
\qquad\textrm{where}\qquad
x\,=\,(x^{(1)}, \dots, x^{(N)})\,.$$
Thus, its Jacobian is
$$J_F(x)=\begin{bmatrix}
f'(x^{(1)})&&&&\\
&\ddots&&&\\
&&f'(x^{(j)})&&\\
&&&\ddots&\\
&&&&f'(x^{(N)})\end{bmatrix}
$$
where all off-diagonal terms are $0$.
Thus, setting $v=\begin{bmatrix}1\\\vdots\\1\end{bmatrix}$, we obtain the gradient of $f$ evaluated at the $N$ sample points $x^{(1)}\,, \dots\,, x^{(N)}$:
$$\begin{bmatrix}f'(x^{(1)})&\dots& f'(x^{(j)})&\cdots& f'(x^{(N)})\end{bmatrix}
=\left[1\,,\dots\,,1\right]
J_f(x)\,.$$
The interpretation here is that the resulting row vector contains the derivative of $f$ at the samples $x^{(1)}$ to $x^{(N)}$.

### The trick with `sum()`
The trick of adding `sum()` before calling `backward` differs with the previous application only in the order of operations performed: the summation is performed before differentiation.

From a scalar, univariate function $y=f(x)$, construct a new scalar, multivariate function 
$$G(x_1, \dots, x_N) = f(x_1) + \dots + f(x_N)$$
Using the rules of vector calculus, the gradient of $G$ at an $n$-dimensional point $(x_1, \dots, x_N)$ is
$$
\begin{align}
\nabla G(x) & = \begin{bmatrix}\frac{\partial G}{\partial x_1}(x)&\cdots&\frac{\partial G}{\partial x_N}\end{bmatrix}\\
& = \begin{bmatrix}f'(x_1)&\cdots&f'(x_N)\end{bmatrix}
\end{align}
$$
The interpretation here is that the resulting row vector contains the gradient of $G$ at the $N$-dimensional point $(x_1, \dots, x_N)$.

# Computing gradients with `PyTorch`
A mathematical function is a mapping, which strictly speaking one should denote $f$.  The denotation $y=f(x)$ is simply to suggest that the typical input will be denoted $x$ and the corresponding output will be denoted $y$.  Otherwise, $y=f(x)$ actually asserts the identity between a value $y$ and the evaluation of the function $f$ at the value $x$.

In `PyTorch`, the primary objects are `tensor`s, which can represent (mathematical) scalars, vectors, and matrices (as well as mathematical tensors).   The way a `PyTorch` function calculates a `tensor`, generically denoted `y` and called the output, from another `tensor`, generically denoted `x` and called the input, reflects the action of a mathematical function $f$ (or $y=f(x)$).

Conversely, a mathematical function $f$ can be evaluated at $x$ using `PyTorch`, and furthermore `PyTorch` allows to evaluate the derivative or gradient of $f$ at $x$ via the method `backward`.  More specifically, the `backward` function performs vector-Jacobian products, where the vector correspond to the `gradient` argument.  The key point in using the `backward` is thus to understand how to choose the `gradient` argument.

The mathematical preliminaries above show how `gradient` should be chosen.  There are two key points:  
1. `gradient` has the same shape as `y`;  
1. `gradient` is populated with `0.`'s and `1.`'s, and the location of the `1.`'s corresponding to the inputs and outputs of interest.

# Examples revisited

>Note:
The variable `v` is passed to the `gradient` argument in all our examples.

For the derivative of a scalar, univariate function evaluated a single point, we choose `gradient=torch.tensor(1.)`, which is the default value:

In [9]:
import torch
x = torch.tensor(1., requires_grad=True)
y = x**2
v = torch.ones_like(y)
y.backward()
print(f"Shape of x         : {tuple(x.shape)}")
print(f"Shape of y         : {tuple(y.shape)}")
print(f"gradient argument  : {v}")

Shape of x         : ()
Shape of y         : ()
gradient argument  : 1.0


Note that if `x` is cast as a `1`-dimensional `tensor`, then (in this particular example) `y` is also a `1`-dimensional `tensor`:

In [10]:
import torch
x = torch.tensor([1.], requires_grad=True)
y = x**2
v = torch.ones_like(y)
y.backward()
print(f"Shape of x         : {tuple(x.shape)}")
print(f"Shape of y         : {tuple(y.shape)}")
print(f"gradient argument  : {v}")

Shape of x         : (1,)
Shape of y         : (1,)
gradient argument  : tensor([1.])


Similarly if `x` is cast as `2`-dimensional `tensor`:

In [11]:
import torch
x = torch.tensor([[1.]], requires_grad=True)
y = x**2
v = torch.ones_like(y)
y.backward()
print(f"Shape of x         : {tuple(x.shape)}")
print(f"Shape of y         : {tuple(y.shape)}")
print(f"gradient argument  : {v}")

Shape of x         : (1, 1)
Shape of y         : (1, 1)
gradient argument  : tensor([[1.]])


For the derivative of a scalar, univariate function evaluated at multiple points, `gradient` contains all `1.`'s and is of same shape as `y`:

In [12]:
import torch
x = torch.linspace(-1, 1, 5, requires_grad=True)
y = x**2
v = torch.ones_like(y)
y.backward(v)
print(f"Shape of x         : {tuple(x.shape)}")
print(f"Shape of y         : {tuple(y.shape)}")
print(f"gradient argument  : {v}")

Shape of x         : (5,)
Shape of y         : (5,)
gradient argument  : tensor([1., 1., 1., 1., 1.])


Casting `x` in a different shape changes the shape of `y`, and thus of `gradient`:

In [13]:
import torch
x = torch.linspace(-2, 2, 5).view(-1,1).requires_grad_(True)
y = x**2
v = torch.ones_like(y)
y.backward(v)
print(f"Shape of x         : {tuple(x.shape)}")
print(f"Shape of y         : {tuple(y.shape)}")
print(f"gradient argument  : ")
print(v)

Shape of x         : (5, 1)
Shape of y         : (5, 1)
gradient argument  : 
tensor([[1.],
        [1.],
        [1.],
        [1.],
        [1.]])


For the derivative of a vector-valued, univariate function evaluated at a single point, the derivative of each component function is calculated one at a time, and `gradient` consists of all `0.`'s except for one `1.`, which is located at a position corresponding to the component function.  In the example below, the function is in fact *matrix-valued*, namely we calculate the derivative of
$$f(x) = \begin{bmatrix}1&x\\x^2&x^3\\x^4&x^5\end{bmatrix}\qquad \textrm{at}\quad x\,=\,1\,.$$

In [19]:
# collapse-show
import torch
import itertools
x = torch.tensor(1., requires_grad=True)
y = torch.stack([x**i for i in range(6)]).view(3,2)
ranges = [range(_) for _ in y.shape]

print("x:")
print(x.data)
print("\ny:")
print(y.data)

derivatives = torch.zeros_like(y)

for i, j in itertools.product(*ranges):
    v = torch.zeros_like(y)
    v[i,j] = 1.
    if x.grad is not None: x.grad.zero_()
        
    y.backward(v, retain_graph=True)
    derivatives[i,j] = x.grad.item()
print("\nDerivatives:")
print(derivatives)    

x:
tensor(1.)

y:
tensor([[1., 1.],
        [1., 1.],
        [1., 1.]])

Derivatives:
tensor([[0., 1.],
        [2., 3.],
        [4., 5.]])


>Note:
The use of `for` loops can be avoided.

For the gradient of a scalar, multivariate function evaluated at a single point, `gradient=torch.tensor(1.)`: 

In [15]:
import torch
x = torch.tensor([-1., 2.], requires_grad=True)
w = torch.tensor([3., 5.])
y = (x*x*w).sum()
v = torch.ones_like(y)
y.backward()
print(f"Shape of x         : {tuple(x.shape)}")
print(f"Shape of y         : {tuple(y.shape)}")
print(f"gradient argument  : {v}")

Shape of x         : (2,)
Shape of y         : ()
gradient argument  : 1.0


In the following example, the input `x` is a `(3,2)`-tensor:

In [16]:
x = torch.arange(6, dtype=float).view(3,2).requires_grad_(True)
y = (x**2).sum()
v = torch.ones_like(y)
y.backward(v)

print(f"Shape of x: {tuple(x.shape)}")
print(f"Shape of y: {tuple(y.shape)}")
print(f"gradient argument: {v}")
print("x:")
print(x.data)
print("x.grad:")
print(x.grad.data)

Shape of x: (3, 2)
Shape of y: ()
gradient argument: 1.0
x:
tensor([[0., 1.],
        [2., 3.],
        [4., 5.]], dtype=torch.float64)
x.grad:
tensor([[ 0.,  2.],
        [ 4.,  6.],
        [ 8., 10.]], dtype=torch.float64)
