$f_2(f_{1}(x))$

# Autograd: automatic differentiation

- `torch.Tensor` is the central class of the package.
- If you set its attribute `.requires_grad` as `True`, it starts to track all operations on it.
- When you finish your computation you can call `.backward()` and have all the gradients computed automatically.
- The gradient for this tensor will be accumulated into `.grad` attribute.

- To stop a tensor from tracking history, you can call `.detach()` to detach it from the computation history, and to prevent future computation from being tracked.
- To prevent tracking history (and using memory), you can also wrap the code block in `with torch.no_grad():`.
- This can be particularly helpful when evaluating a model because the model may have trainable parameters with `requires_grad=True`, but for which we don’t need the gradients.

- There's one more class which is very important for autograd implementation: a `Function`.

- `Tensor` and `Function` are interconnected and build up an acyclic graph, that encodes a complete history of computation.

- Each tensor has a `.grad_fn` attribute that references a `Function` that has created the Tensor (except for Tensors created by the user - their `grad_fn` is None).

- If you want to compute the derivatives, you can call `.backward()` on a `Tensor`.

- If `Tensor` is a scalar (i.e. it holds a one element data), you don't need to specify any arguments to `backward()`, however if it has more elements, you need to specify a gradient argument that is a tensor of matching shape.

In [1]:
import torch
torch.__version__

'1.6.0'

Create a tensor and set `requires_grad=True` to track computation with it

In [2]:
x = torch.ones(2, 2, requires_grad=True)
x

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)

Do some operations

In [3]:
y = x + 2
y

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)

`y` was created as a result of an operation, so it has a `grad_fn`.

In [4]:
y.grad_fn

<AddBackward0 at 0x7fedde3f7d60>

Do more operations on `y`:

In [7]:
z = y * y * 3
z

tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>)

In [8]:
out = z.mean()
out

tensor(27., grad_fn=<MeanBackward0>)

`.requires_grad_()` changes an existing Tensor's `requires_grad` flag in-place.

In [9]:
a = torch.randn(2,2)
a = ((a*3)/(a-1))
a

tensor([[-0.3393, -6.3876],
        [ 1.3550,  1.7123]])

In [10]:
a.requires_grad

False

In [11]:
a.requires_grad_(True)
a.requires_grad

True

In [12]:
b = (a*a).sum()
b

tensor(45.6847, grad_fn=<SumBackward0>)

# Calculating gradient

In [14]:
out

tensor(27., grad_fn=<MeanBackward0>)

In [15]:
out.backward()

In [16]:
x.grad

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])

$$
o = \frac{1}{4} \sum_{i} z_{i}
$$

$$
z_{i} = 3(x_{i} + 2)^2
$$

$$
\left. z_{i} \right|_{x_{i}=1} = 27
$$

$$
\frac{\partial o}{\partial x_{i}} = \frac{3}{2}\left( x_{i} + 2 \right)
$$

$$
\left. \frac{\partial o}{\partial x_{i}} \right|_{x_{i}=1} = \frac{9}{2} = 4.5
$$

$$
x_{1} = 1
$$

In [18]:
x = torch.tensor([2.2], requires_grad=True)
x

tensor([2.2000], requires_grad=True)

# Vector valued function

Mathematically, if you have a vector valued function $(\mathbf{y}=f(\mathbf{x})$
then the gradient of $\mathbf{y}$ with respect to $\mathbf{x}$
is a Jacobian matrix:
$$
\begin{split} J = 
\left(
\begin{array}{ccc}
\frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
\vdots & \ddots & \vdots \\
\frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
\end{array}\right)\end{split}
$$

Generally speaking, `torch.autograd` is an engine for computing vector-Jacobian product. That is, given any vector $\mathbf{v} = (v_1, v_2, \ldots, v_m)^{\mathsf{T}}$, compute the product
$\mathbf{v}^{\mathsf{T}} \cdot \mathbf{J}$.

If $\mathbf{v}$ happens to be the gradient of a scalar function $l=g(\mathbf{y})$, that is,
$v=\left(\begin{array}{ccc}\frac{\partial l}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}\end{array}\right)^{T}$, then by the chain rule,
the vector-Jacobian product would be the gradient of $l$ with respect to $\mathbf{x}$:

$$
\begin{split}
\mathbf{J}^{T}\cdot \mathbf{v} = \left(\begin{array}{ccc}
 \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}}\\
 \vdots & \ddots & \vdots\\
 \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
 \end{array}\right)\left(\begin{array}{c}
 \frac{\partial l}{\partial y_{1}}\\
 \vdots\\
 \frac{\partial l}{\partial y_{m}}
 \end{array}\right)=\left(\begin{array}{c}
 \frac{\partial l}{\partial x_{1}}\\
 \vdots\\
 \frac{\partial l}{\partial x_{n}}
 \end{array}\right)\end{split}
$$

Note that $\mathbf{v}^{T} \cdot \mathbf{J}$ gives a row vector which can be treated as a column vector by taking $\mathbf{J}^{T} \cdot \mathbf{v}$.

This characteristic of vector-Jacobian product makes it very convenient to feed external gradients into a model that has non-scalar output.

Now let’s take a look at an example of vector-Jacobian product:

In [None]:
x = torch.randn(3, requires_grad=True)

In [None]:
y = x*2
while y.data.norm() < 1000:
    y = y * 2
y

Now in this case $y$ is no longer a scalar. `torch.autograd` could not compute the full Jacobian directly, but if we just want the vector-Jacobian product, simply pass the vector to backward as argument:

In [None]:
v = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(v)

In [None]:
x.grad

# Stop autograd from tracking history on Tensors.

In [None]:
x.requires_grad

In [None]:
(x**2).requires_grad

In [None]:
with torch.no_grad():
    print( (x**2).requires_grad )

In [None]:
x.requires_grad

Or by using `.detach()` to get a new Tensor with the same content but that does not require gradients:

In [None]:
y = x.detach()

In [None]:
y.requires_grad

In [None]:
x.eq(y)

In [None]:
x, y