## Autograd in Pytorch 

In [1]:
import torch

#### Some things about Automatic Differentiation
Automatic differentiation is a clever yet simple technique to calculate the gradients of function w.r.t it's inputs.
* It is not Numerical differentiation neither Symbolic differentiation.
* There are two modes, Forward and Backward modes.
* Forward mode calculates gradients from bottom-up, i.e, parent's derivatives using the child's values
* Backward mode calculates gradients from top-down, vice-versa
* Can be implemented in different ways. An elegant approach is to use Dual Numbers where you declare an entity $\epsilon$ and treat real numbers like $a$ as $(a+\epsilon a')$ and declare $\epsilon ^2 = 0$. Thus $$\frac{f(x + \epsilon x') - f(x)}{\epsilon x'} = f'(x) => f(x+\epsilon x') = f(x) + \epsilon f'(x)x' = b + \epsilon b' $$ 
Basically, dual numbers are analogous to data structures which carry the derivative along with origin of that derivative (called *primal*)
As a quick exercise for grasping this, one can apply this on $f(g(x))$ to find it's derivative and notice that chaining works perfectly well too.

Time to try this

For instance, let's say we have this function :- 

$$y = 5(x + 2)^2$$

And it's derivative as we know,

$$y' = 10(x + 2)$$

In [238]:
x = torch.ones(1, requires_grad=True)
y = 5 * (x + 2) ** 2

We can see that at $x = 1$, $y = 45$

In [240]:
x,y

(tensor([ 1.]), tensor([ 45.]))

Let's calcute the gradients. What happens in the background is, $\frac{dy}{dx}$ is calculated via the backward mode of autograd.

In [241]:
y.backward()

As we can expect, at $x = 1$, $y' = 30$

In [243]:
x.grad

tensor([ 30.])

***

Let's try another function that takes 3 inputs and see how the partial derivatives are computed

$$ f(x,y,z) = \sin x^2 + e^y\tan z $$

$$ \frac{\partial f}{\partial x} = 2x\cos x^2,    \frac{\partial f}{\partial y} = e^y\tan z ,    \frac{\partial f}{\partial z} = e^y \sec^2 z$$

In [4]:
x = torch.Tensor([3]).requires_grad_(True)
y = torch.Tensor([4.5]).requires_grad_(True)
z = torch.Tensor([6.8]).requires_grad_(True)

In [5]:
f = torch.sin(x**2) + torch.exp(y)*torch.tan(z)

In [6]:
f

tensor([ 51.5725])

At $x= 3, y = 4.5, z = 6.8, f = 51.57247162$ 

And the partial derivatives are 

$$ \left.\frac{\partial f}{\partial x}\right\vert_{(x,y,z) = (3,4.5,6.8)} = -5.4668 $$
$$ \left.\frac{\partial f}{\partial y}\right\vert_{(x,y,z) = (3,4.5,6.8)} = 51.1603 $$
$$ \left.\frac{\partial f}{\partial z}\right\vert_{(x,y,z) = (3,4.5,6.8)} = 119.092 $$

In [7]:
## Compute the grads
f.backward()

In [8]:
x.grad,y.grad,z.grad

(tensor([-5.4668]), tensor([ 51.1604]), tensor([ 119.0936]))

***

Let's see an example where the computation graph diverges

In [61]:
x = torch.Tensor([3]).requires_grad_(True)

Ex: $f(x,y) = \sin x $, $g = \ln f , h = \sqrt{f} $ 

In [66]:
f = torch.sin(x)
g = torch.log(f)
h = torch.pow(f,0.5)

In [67]:
f,g,h

(tensor([ 0.1411]), tensor([-1.9581]), tensor([ 0.3757]))

And the derivatives are : 

$$ \frac{\partial g}{\partial x} = \frac{\partial g}{\partial f}\frac{\partial f}{\partial x} = \cot x $$
$$ \frac{\partial h}{\partial x} = \frac{\cos x}{2\sqrt{\sin x}} $$

whose values at $x = 3$ are :
$\frac{\partial g}{\partial x}= -7.0153 $ and $\frac{\partial h}{\partial x}= -1.3176 $

As we see, $g$ and $h$ depend on $x$ and are independent of each other. So here when we do `.backward()` on both $g$ and $h$ , the gradients of $x$ accumulate.

In [68]:
g.backward(retain_graph=True)

In [69]:
x.grad

tensor([-7.0153])

In [70]:
h.backward()

In [71]:
x.grad

tensor([-8.3329])

Which is (-7.0153) + (-1.3176) = -8.3329 

In case of neural networks, for backprop, `optimizer.step()` updates x as $x_{i+1} = x_{i} - \lambda \frac{\partial J}{\partial x} $ , where J is the optimization function and $\lambda$, the learning rate, and hence the 2nd term is the derivative of $x$ w.r.t $J$

In Pytorch, while doing the forward pass of the network, a computation graph is built and when `.backward()` is called, the gradients are computed in backward mode along that graph only. Hence, in case of networks with dynamic control flow, the grads will be computed only for the graph that was built during the forward pass in each iteration. This allows us to have a part of the network to be frozen and not have grads for those part or other complex control flow.