# Automatic differentiation
Example:


$f(w,x)=\frac{1}{1+e^{-(w_{0}x_{0}+w_{1}x_{1}+w_{2})}}$


## Computational Graph
<img src='images/auto.svg' />

## Intermediate Functions

$a=w_{0}*x_{0}$

$b=w_{1}*x_{1}$

$c=a+b$

$d=c+w_{2}$

$e=-d$

$f=exp(e)$

$g=1+f$

$h=1/g$



## Input Values
$w_0=2.0$

$x_0=-1.0$

$w_1=-3.0$

$x_1=-2.0$

$w_2=-3.0$

## Forward
$a= -2.0$

$b= 6.0$

$c= 4.0$

$d= 1.0$

$e= -1.0$

$f= 0.36$

$g= 1.36$

$h= 0.73$


## Backward Gradients
1) $\frac{\partial h  }{\partial h }=1$

2) $\frac{\partial h }{\partial g }=\frac{-1}{g^2}=-0.53$

$\frac{\partial g }{\partial f }=1$

3) $\frac{\partial h }{\partial f }=\frac{\partial h }{\partial g }\frac{\partial g }{\partial f }=-0.53$

$\frac{\partial f }{\partial e}=exp(e)=0.36$

4) $\frac{\partial h }{\partial e}=\frac{\partial h }{\partial f}\frac{\partial f }{\partial e}=-0.53* 0.36=-0.19$

$\frac{\partial e }{\partial d}=-1$

5) $\frac{\partial h }{\partial d}=\frac{\partial h }{\partial e}\frac{\partial e }{\partial d}=-0.19*-1=0.19$

$\frac{\partial d }{\partial c}=1$

6) $\frac{\partial h }{\partial c}=\frac{\partial h }{\partial d}\frac{\partial d }{\partial c}=0.19*1=0.19$

$\frac{\partial d }{\partial w_2}=1$

7) $\frac{\partial h }{\partial w_2}=\frac{\partial h }{\partial d}\frac{\partial d }{\partial w_2}=0.19*1=0.19$

$\frac{\partial c }{\partial a}=1$

8) $\frac{\partial h }{\partial a}=\frac{\partial h }{\partial c}\frac{\partial c }{\partial a}=0.19*1=0.19$

$\frac{\partial c }{\partial b}=1$

9) $\frac{\partial h }{\partial b}=\frac{\partial h }{\partial c}\frac{\partial c }{\partial b}=0.19*1=0.19$

$\frac{\partial a }{\partial w_0}=x_0$

10) $\frac{\partial h }{\partial w_0}=\frac{\partial h }{\partial a}=\frac{\partial a }{\partial w_0}=0.19*x_0=-0.19$

$\frac{\partial a }{\partial x_0 }=w_0$

11) $\frac{\partial h }{\partial x_0 }=\frac{\partial h }{\partial a }\frac{\partial a}{\partial x_0 }=0.19*w_0=0.38$

$\frac{\partial b }{\partial w_1}=x_0$

12) $\frac{\partial h }{\partial w_1}=\frac{\partial h }{\partial b}\frac{\partial b }{\partial w_1}=0.19* x_0=-0.19$

$\frac{\partial b }{\partial x_1}=w_1$

13) $\frac{\partial h }{\partial x_1}=\frac{\partial h }{\partial b}\frac{\partial b }{\partial x_1}=0.19*w_1=-0.57$


In [2]:
import torch
w0=torch.tensor(2.0,requires_grad=True )
x0=torch.tensor(-1.0,requires_grad=True)

w1=torch.tensor(-3.0,requires_grad=True)
x1=torch.tensor(-2.0,requires_grad=True)

w2=torch.tensor(-3.0,requires_grad=True)


a=w0*x0
b=w1*x1
c=a+b
d=c+w2
e=-d
f=torch.exp(e)
g=1+f
h=1/g

print("a=",a)
print("b=",b)
print("c=",c)
print("d=",d)
print("e=",e)
print("f=",f)
print("g=",g)
print("h=",h)


h.backward()
print("dh/dw0=",w0.grad)
print("dh/dx0=",x0.grad)
print("dh/dw1=",w1.grad)
print("dh/dx1=",w1.grad)
print("dh/dw2=",w2.grad)

a= tensor(-2., grad_fn=<MulBackward0>)
b= tensor(6., grad_fn=<MulBackward0>)
c= tensor(4., grad_fn=<AddBackward0>)
d= tensor(1., grad_fn=<AddBackward0>)
e= tensor(-1., grad_fn=<NegBackward>)
f= tensor(0.3679, grad_fn=<ExpBackward>)
g= tensor(1.3679, grad_fn=<AddBackward0>)
h= tensor(0.7311, grad_fn=<MulBackward0>)
dh/dw0= tensor(-0.1966)
dh/dx0= tensor(0.3932)
dh/dw1= tensor(-0.3932)
dh/dx1= tensor(-0.3932)
dh/dw2= tensor(0.1966)


# Autograd
autograd keeps a record of data (tensors) & all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects. In this DAG, **leaves** are the input tensors, **roots** are the output tensors.
  
# Leaf
A leaf Variable is a variable that is at the beginning of the graph. That means that no operation tracked by the autograd engine created it. 


Below is a visual representation of the **DAG** in our example. In the graph, the *arrows* are in the direction of the forward pass. The *nodes* represent the backward functions of each operation in the forward pass. The *leaf* nodes in blue represent our leaf tensors

In [11]:
import torch
import torchviz

a=torch.tensor([1.2],requires_grad=True)
b=torch.tensor([2.2],requires_grad=True)
c=torch.tensor([0.2],requires_grad=True)
d=torch.tensor([0.8],requires_grad=True)
e=torch.tensor([7.],requires_grad=True)

f=d*((a+b)*(c))+e

<img src='images/graph.svg'/>

DAGs are dynamic in PyTorch An important thing to note is that the graph is recreated from scratch; after each .backward() call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.

# Exclusion from the DAG

In a NN, parameters that don’t compute gradients are usually called **frozen parameters**. It is useful to “freeze” part of your model if you know in advance that you won’t need the gradients of those parameters (this offers some performance benefits by reducing autograd computations).

Another common usecase where exclusion from the DAG is important is for **finetuning** a pretrained network

- requires_grad=False
- no_grad()
- detach

# no_grad()
*torch.no_grad()*: in this context manager in the **__enter__()** method, *set_grad_enabled(False)*
 so for tensor object *requires_grad* will turn into False.
 
when you use no_grad(), you can control the new w1 and new w2 have no gradients since
 they are generated by operations, which means you only change the value of w1 and w2,
 not gradient part, they still have previous defined variable gradient information and back propagation can continue.


# detach
*detach()*: detaches the output from the computationnal graph. So no gradient will be backproped along this variable.

In [12]:
f=d*((a+b)*(c.detach()))+e

<img src='images/graph_detach.svg'/>

# zero_grad

PyTorch accumulates the gradients on subsequent backward passes. 

We explicitly need to call zero_grad() because, after loss.backward() (when gradients are computed), we need to use optimizer.step() to proceed gradient descent. More specifically, the gradients are not automatically zeroed because these two operations, loss.backward() and optimizer.step(), are separated, and optimizer.step() requires the just computed gradients.

In addition, sometimes, we need to accumulate gradient among some batches; to do that, we can simply call backward multiple times and optimize once.

Refs: [1](https://stackoverflow.com/questions/44732217/why-do-we-need-to-explicitly-call-zero-grad?noredirect=1&lq=1), [2](https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch#)

In [6]:
import torch

N,D_in,H, D_out=64, 1000, 100, 10
learning_rate=1e-6
number_of_iterations=500

x=torch.randn(N,D_in)
y=torch.randn(N,D_out)


model1=torch.nn.Sequential(torch.nn.Linear(D_in,H),
                    torch.nn.ReLU(),
                    torch.nn.Linear(H,D_out))

loss_function=torch.nn.MSELoss(reduction='sum')


for i in range(number_of_iterations):
    y_predict=model1(x)
    loss = loss_function(y_predict, y)
    #print(i, loss.item())
    loss.backward()
    with torch.no_grad():
        for param in model1.parameters():
            param.data -= learning_rate * param.grad

    model1.zero_grad()