<a href="https://colab.research.google.com/github/center4ml/Workshops/blob/2023_2/Day_2/10_computational_graph_part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


In [1]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import math


# First example - repetition

So let's see how it works in practice in PyTorch. We don't need a neural network to see it.

Let's have $f(x,y)=x^3+y^2$ as an example.
Then you can calculate by hand:

$\frac{\partial f(x,y)}{\partial x} = 3x^2, \frac{\partial f(x,y)}{\partial y} = 2y$

Concretely, as an example with which we will work some more:

$\frac{\partial f(x,y)}{\partial x} \Big|_{x=2} = 3 \cdot 2^2 = 12$

$\frac{\partial f(x,y)}{\partial y} \Big|_{y=4} = 2 \cdot 4 = 8$

You get all this automatically with PyTorch, which builds a tree of an expression that is constructed as we go. Let's have an example in Python code.

In [2]:
x = torch.tensor([[2., 3., 2.], [2., 3., 3.]], requires_grad=True)  # we initiate the points with values
y = torch.tensor([[4., 5., 5.], [4., 5., 5.]], requires_grad=True)  # equal to where we want to compute
                                                                    # partial derivatives
pow3 = x ** 3
pow2 = y ** 2
f = pow3 + pow2                              # then we build the function

print('f:', f.grad_fn)
print('pow3:',pow3.grad_fn)
print('pow2:',pow2.grad_fn)

f: <AddBackward0 object at 0x7f3448dd3ee0>
pow3: <PowBackward0 object at 0x7f3500fbd540>
pow2: <PowBackward0 object at 0x7f3448dd3ee0>


The expression tree (which is used for computing gradients) gets built as we go, with leaves being the individual tensors we start from.

To calculate gradients numerically at the current tensor values you can call `backward()`.

Let's see if we get

$\frac{\partial f(x,y)}{\partial x} \Big|_{x=2} = 12$

$\frac{\partial f(x,y)}{\partial y} \Big|_{y=4} = 8$

as expected. The gradients of $f(x,y)$ with respect to variables $x$ and $y$ get written to a `grad` field of those variables (`x.grad`, `y.grad`)

In [3]:
print('f ==', f)
# f.backward() results in an error 'grad can be implicitly created only for scalar outputs`

f == tensor([[24., 52., 33.],
        [24., 52., 52.]], grad_fn=<AddBackward0>)


You cannot call `backward()` on any other tensor than a scalar. And `f` is not a scalar. The reason behind it is that loss is always a scalar. Recall, that in backpropagation you adjust weights of a neural network according to gradients calculated with respect to those weights on a loss function (which results is a scalar).

Recall, that to calculate gradients you need to call `backward()`, but you may call it on scalar variables only. But `f.sum()` is a scalar!

Let's try to call `backward()` on `f.sum()` scalar.

In [4]:
f.sum().backward()
x.grad

tensor([[12., 27., 12.],
        [12., 27., 27.]])

In [5]:
y.grad

tensor([[ 8., 10., 10.],
        [ 8., 10., 10.]])

OK, results are as expected (i.e. as calculated by hand earlier) coordinate-wise, i.e. on every coordinate separately.

### Explanation of this `sum()` trick


The operations (powers and addition) work coordinate-wise, so in fact, for those tensors of order 2, and dimensions 2 by 3, we have that `f` is a matrix sized 2 by 3 of functions $f_{ij}$, each defined as $f_{ij} = x_{ij}^3 + y_{ij}^2$ and independent of **other** coordinates.

Consequently, a partial derivative of $f_{ij}$ with respect to say $x_{ij}$ is equal to a partial derivative of `f.sum()` with respect to $x_{ij}$, mathematically $ \frac{\partial f_{ij}}{\partial x_{ij}} = \frac{\partial \sum_{kl}f_{kl}}{\partial x_{ij}}$

**Note, that it works independently on number of orders of tensors in question nor on the particular order dimensions.**


# Another way to get the gradients

is to call the `torch.autograd.grad` method.
It ***returns*** the gradients instead of storing them in the graphs leafes (as `torch.autograd.backward` would do).

In [6]:
# let us reinitialize the variables
x = torch.tensor([[2., 3., 2.], [2., 3., 3.]], requires_grad=True)
y = torch.tensor([[4., 5., 5.], [4., 5., 5.]], requires_grad=True)
f = x**3 + y**2

In [7]:
# f_x = torch.autograd.grad(f, x, grad_outputs=torch.ones_like(x)) # frees the graph after computations
# f_x = torch.autograd.grad(f, x, grad_outputs=torch.ones_like(x), retain_graph=True) # retain_graph: the graph used to compute the grad will be kept. Allows to differentiate f again.

f_x = torch.autograd.grad(f, x, grad_outputs=torch.ones_like(x), create_graph=True) # create_graph: graph of the derivative will be constructed, allowing to compute higher order derivative products.

print(f"f_x: {f_x}")


f_x: (tensor([[12., 27., 12.],
        [12., 27., 27.]], grad_fn=<MulBackward0>),)


Notice, that the x.grad is empty

In [8]:
# f.sum().backward()
x.grad

### Higher order derivatives

As one may expected, we have to differentiate  one more time.


In [9]:
f_xx = torch.autograd.grad(f_x, x, grad_outputs=torch.ones_like(x))
# f_xx = torch.autograd.grad(f_x, x, grad_outputs=torch.ones_like(x), retain_graph=True)
# f_xx = torch.autograd.grad(f_x, x, grad_outputs=torch.ones_like(x), create_graph=True) # allows to differentiate f_xx to compute higher order derivatives

f_xx

(tensor([[12., 18., 12.],
         [12., 18., 18.]]),)

### Conclusions:

We need to create (and keep) the computational graph to find the higher order derivativess.

`create_graph (bool, optional)` – If True, graph of the derivative will be constructed, allowing to compute higher order derivative products. Defaults to False.


`retain_graph (bool, optional)` – If False, the graph used to compute the grad will be freed. Note that in nearly all cases setting this option to True is not needed and often can be worked around in a much more efficient way. Defaults to the value of create_graph.

#### References:

<https://pytorch.org/docs/1.5.0/autograd.html#torch.autograd.grad>

https://stackoverflow.com/questions/69148622/difference-between-autograd-grad-and-autograd-backward

https://discuss.pytorch.org/t/when-do-i-use-create-graph-in-autograd-grad/32853

https://discuss.pytorch.org/t/whats-the-difference-between-torch-autograd-grad-and-backward/94638

https://stackoverflow.com/questions/46774641/what-does-the-parameter-retain-graph-mean-in-the-variables-backward-method

https://pytorch.org/docs/stable/generated/torch.Tensor.is_leaf.html#torch.Tensor.is_leaf