<a href="https://colab.research.google.com/github/fanurs/pytorch-notes/blob/main/notes/tut03_automatic_differentiation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automatic differentiation

There are at least three ways to evaluate the derivative of a function: symbolic differentiation, numerical differentiation, and automatic differentiation.

Symbolic differentiation is what we usually first learned in calculus. However, for most applications, it is difficult for computer to automate this process.

Numerical differentiation uses the method of finite differences to approximate a derivative,
$$ \frac{df(x)}{dx} \approx \frac{f(x + \delta x) - f(x)}{\delta x} \ . $$
But this approach introduces round-off errors.

Automatic differentiation (AD) exploits the fact that every function can be computed using a chain of arithmetic operations. Hence, using the chain rule, it is possible to evaluate the resulting derivative.

AD is usually faster and numerically more stable.

Any machine learning algorithm is essentially some kind of optimization problem. This is why having a reliable way to compute a gradient is very useful. PyTorch offers an automatic differeniation package called [`torch.autograd`](https://pytorch.org/docs/stable/autograd.html) to handle these tasks.

# Differentiating a single-variable function

In [None]:
import matplotlib.pyplot as plt
import torch as th

## Setting up the tensors

Let's start by using PyTorch to solve a simple example. Given $g(x) = x^2$, evaluate
$$ \frac{d}{dx}g(x) $$
over the domain $x\in[-3, 3]$.

In [None]:
x = th.linspace(-3, 3, 301)
y = x**2
plt.plot(x, y)
plt.show()

So far there is no differentiation involved. We are just plotting out the function $g(x) = x^2$ over the interval $x\in[-3, 3]$ with 301-point sampling:
$$ -3.00, -2.98, -2.96, \ldots, 2.98, 3.00 \ . $$

To take the derivative, we would have to update an attribute, `torch.Tensor.requires_grad` to `True`. To propagate this attribute, we shall recompute `y` too.

In [None]:
x.requires_grad = True
y = x**2
x[0]

Notice that each entry of `x` is now a tensor that also contains the `grad_fn` property. This is where `autograd` will do its magic. But for now, it means if we try to plot by typing `plot.plot(x, y)`, an error will be raised. For `plot()` to work properly, we have to strip off the `grad_fn` part from the tensors. PyTorch provides us a `detach()` function for that.

In [None]:
plt.plot(x.detach(), y.detach())
plt.show()

## Taking derivative as gradient

So how can we actually take the derivative of $g(x)$?

PyTorch's autograd can only compute the gradient,
$$ \nabla f(x_1, x_2, \ldots, x_n) =
\begin{bmatrix}
\partial_1 f \\
\partial_2 f \\
\vdots \\
\partial_n f \\
\end{bmatrix} \ .
$$

On the other hand, we have $g(x)$ evaluated at 301 discrete points,
$$ x_k = -3 + 0.02 * (k - 1) $$
for all $k = 1, 2, \ldots, 301$. And we want to know its derivative $g'(x)$ at the same 301 discrete points,
$$ g'(x_1), g'(x_2), \ldots, g'(x_{301}) \ . $$

The trick is to construct the sum
$$ f(x_1, x_2, \ldots, x_{301}) \equiv g(x_1) + g(x_2) + \cdots + g(x_{301}) \ . $$
Then the gradient of this sum would give us
$$ \nabla f(x_1, x_2, \ldots, x_{301}) =
\begin{bmatrix}
\partial_1 f \\
\partial_2 f \\
\vdots \\
\partial_{301} f \\
\end{bmatrix} =
\begin{bmatrix}
g'(x_1) \\
g'(x_2) \\
\vdots \\
g'(x_{301}) \\
\end{bmatrix} \ .
$$

In other words, instead of treating $g(x)$ as a single-variable function that depends on $x$, we may view $x_1, x_2, \ldots, x_{301}$ as independent variables for the sum $f$. This is why constructing the tensors, we specified `requires_grad = True` for $x$, too.

Following the trick, we first compute the sum of all function values evaluated at the 301 discrete points:

In [None]:
total = y.sum()
total

Next, we calculate the gradient. The function for this is `torch.Tensor.backward()`. The reason for this weird name is because computation of a gradient in a neural network often happens in a process called "backward propagation".

In [None]:
total.backward()

The `backward()` function does not actually return anything. Instead, it traces back all the tensors that were used to build up `total` (i.e. `x` and `y`), and takes the partial derivatives with respective to each component that has a `grad_fn` property (traced computation history) to construct the gradient of `total`. The final result will be stored as `grad` attribute:

In [None]:
plt.plot(x.detach(), x.grad.detach())
plt.show()

This is a line for $2x$, which is what we would expect for the derivative of $g(x) = x^2$.

Let us quickly apply all these steps to some more interesting function, $h(x) = \sin(x^2)$:

In [None]:
x = th.linspace(-3, 3, 100, requires_grad=True)
y = th.sin(x**2)
y.sum().backward()
plt.plot(x.detach(), y.detach(), label='function')
plt.plot(x.detach(), x.grad.detach(), label='derivative')
plt.legend()
plt.show()

## Some remarks

AD is all about tracing back how a quantity is calculated from a chain of elementary arithmetic operations and applying chain rule. This is why PyTorch's autograd traces the computation history for all tensors that have set `requires_grad = True`. We can always turn off this tracing by setting `requires_grad = False`. However, this will immediately erase the computation history.

To temporarily avoid tracing computation history, we can use `torch.no_grad()` in a context manager:

In [None]:
x = th.tensor([1.1, 1.2, 1.3], requires_grad=True)
a = x**2
with th.no_grad():
    b = 2 * a
c = 2 * a
display(a, b, c)

# References

- https://en.wikipedia.org/wiki/Automatic_differentiation