# Automatic differentiation
Example:


$f(w,x)=\frac{1}{1+e^{-(w_{0}x_{0}+w_{1}x_{1}+w_{2})}}$


## Computational Graph
<img src='images/auto.svg' />

## Intermediate Functions

$a=w_{0}*x_{0}$

$b=w_{1}*x_{1}$

$c=a+b$

$d=c+w_{2}$

$e=-d$

$f=exp(e)$

$g=1+f$

$h=1/g$



## Input Values
$w_0=2.0$

$x_0=-1.0$

$w_1=-3.0$

$x_1=-2.0$

$w_2=-3.0$

## Forward
$a= -2.0$

$b= 6.0$

$c= 4.0$

$d= 1.0$

$e= -1.0$

$f= 0.36$

$g= 1.36$

$h= 0.73$


## Backward Gradients
1) $\frac{\partial h  }{\partial h }=1$

2) $\frac{\partial h }{\partial g }=\frac{-1}{g^2}=-0.53$

$\frac{\partial g }{\partial f }=1$

3) $\frac{\partial h }{\partial f }=\frac{\partial h }{\partial g }\frac{\partial g }{\partial f }=-0.53$

$\frac{\partial f }{\partial e}=exp(e)=0.36$

4) $\frac{\partial h }{\partial e}=\frac{\partial h }{\partial f}\frac{\partial f }{\partial e}=-0.53* 0.36=-0.19$

$\frac{\partial e }{\partial d}=-1$

5) $\frac{\partial h }{\partial d}=\frac{\partial h }{\partial e}\frac{\partial e }{\partial d}=-0.19*-1=0.19$

$\frac{\partial d }{\partial c}=1$

6) $\frac{\partial h }{\partial c}=\frac{\partial h }{\partial d}\frac{\partial d }{\partial c}=0.19*1=0.19$

$\frac{\partial d }{\partial w_2}=1$

7) $\frac{\partial h }{\partial w_2}=\frac{\partial h }{\partial d}\frac{\partial d }{\partial w_2}=0.19*1=0.19$

$\frac{\partial c }{\partial a}=1$

8) $\frac{\partial h }{\partial a}=\frac{\partial h }{\partial c}\frac{\partial c }{\partial a}=0.19*1=0.19$

$\frac{\partial c }{\partial b}=1$

9) $\frac{\partial h }{\partial b}=\frac{\partial h }{\partial c}\frac{\partial c }{\partial b}=0.19*1=0.19$

$\frac{\partial a }{\partial w_0}=x_0$

10) $\frac{\partial h }{\partial w_0}=\frac{\partial h }{\partial a}=\frac{\partial a }{\partial w_0}=0.19*x_0=-0.19$

$\frac{\partial a }{\partial x_0 }=w_0$

11) $\frac{\partial h }{\partial x_0 }=\frac{\partial h }{\partial a }\frac{\partial a}{\partial x_0 }=0.19*w_0=0.38$

$\frac{\partial b }{\partial w_1}=x_0$

12) $\frac{\partial h }{\partial w_1}=\frac{\partial h }{\partial b}\frac{\partial b }{\partial w_1}=0.19* x_0=-0.19$

$\frac{\partial b }{\partial x_1}=w_1$

13) $\frac{\partial h }{\partial x_1}=\frac{\partial h }{\partial b}\frac{\partial b }{\partial x_1}=0.19*w_1=-0.57$


# Autograd
autograd keeps a record of data (tensors) and all executed operations (along with the resulting new tensors) in a directed acyclic graph "DAG" consisting of Function objects. 


## torch.autograd
`torch.autograd` is PyTorch’s automatic differentiation engine.

In [12]:
x=torch.tensor([2.0 ,3.0],requires_grad=True)
y=torch.tensor([6.0 ,4.0],requires_grad=True)
z=3*x**3-y**2

When we call `.backward()` on `z`, autograd calculates these gradients and stores them in
the respective tensors’ `.grad` attribute.
We need to explicitly pass a gradient argument in `z.backward()` because it is a vector.
gradient is a tensor of the same shape as Q, and it represents the gradient of z w.r.t. itself, i.e. `dz\dz=1`

In [11]:
external_grad=torch.empty(2,requires_grad=True)
z.backward(external_grad)
print(x.grad)
print(x.shape)
print(y.grad)


tensor([-2.6740e-29,  3.7179e-39])
torch.Size([2])
tensor([ 8.9135e-30, -3.6720e-40])


DAGs are dynamic in PyTorch. An important thing to note is: **the graph is recreated from scratch, after each `.backward()` call**, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model. You can change the shape, size and operations at every iteration if needed.

In this DAG: 

1. **arrows**: are in the direction of the forward pass. 
2. **nodes**: represent the backward functions of each operation in the forward pass.
3. **leaf**: A leaf Variable is a variable that is at the beginning of the graph. That means that no operation tracked by the autograd engine created it. nodes in blue represent our leaf tensors.
4. **roots** are the output tensors.


In [6]:
import torch
w0=torch.tensor(2.0,requires_grad=True )
x0=torch.tensor(-1.0,requires_grad=True)

w1=torch.tensor(-3.0,requires_grad=True)
x1=torch.tensor(-2.0,requires_grad=True)

w2=torch.tensor(-3.0,requires_grad=True)


a=w0*x0
b=w1*x1
c=a+b
d=c+w2
e=-d
f=torch.exp(e)
g=1+f
h=1/g

print("a=",a)
print("b=",b)
print("c=",c)
print("d=",d)
print("e=",e)
print("f=",f)
print("g=",g)
print("h=",h)


h.backward()
print("dh/dw0=",w0.grad)
print("dh/dx0=",x0.grad)
print("dh/dw1=",w1.grad)
print("dh/dx1=",w1.grad)
print("dh/dw2=",w2.grad)


import torch
import torchviz


h_params={'w0':w0,'x0':x0,'w1':w1,'x1':x1,'w2':w2,
          'a':a ,'b':b, 'c':c, 'd':d, 'e':e, 'f':f, 'g':g, 'h':h }




dot=torchviz.make_dot(h,params=h_params)
dot.format='svg'
dot.render(filename='graph', directory='images')


a= tensor(-2., grad_fn=<MulBackward0>)
b= tensor(6., grad_fn=<MulBackward0>)
c= tensor(4., grad_fn=<AddBackward0>)
d= tensor(1., grad_fn=<AddBackward0>)
e= tensor(-1., grad_fn=<NegBackward>)
f= tensor(0.3679, grad_fn=<ExpBackward>)
g= tensor(1.3679, grad_fn=<AddBackward0>)
h= tensor(0.7311, grad_fn=<MulBackward0>)
dh/dw0= tensor(-0.1966)
dh/dx0= tensor(0.3932)
dh/dw1= tensor(-0.3932)
dh/dx1= tensor(-0.3932)
dh/dw2= tensor(0.1966)


'images/graph.svg'

# Visualisation of network

<img src='images/graph.svg'/>

# Exclusion from the DAG and turning off computation of gradients

In a NN, parameters that don’t compute gradients are usually called **frozen parameters**. It is useful to “freeze” part of your model if you know in advance that you won’t need the gradients of those parameters (this offers some performance benefits by reducing autograd computations).

so the two reason two use froze parameters:
1. Performance: if you know in advance that you won’t need the gradients of those parameters
2. Finetuning a pretrained network

Another common usecase where exclusion from the DAG is important is for **finetuning** a pretrained network (i.e. keeping the covnet layer fiex and only train the fully connected layer).

Also during evaluating/validation you need to turn off computation of gradients with `torch.no_grad()` in pair with `model.eval()` to turn off batch normilization and drop out layers as well.


## no_grad()
`torch.no_grad()` works in context manager. In the `__enter__()` method, it calls the `set_grad_enabled(False)`
 so for all tensor objects `requires_grad` will turn into False.

In [15]:
import torch

x=torch.randn([2,3], requires_grad=True)
print(x)
y=2*x
print(y)
print(y.requires_grad)

with torch.no_grad():
    y=2*x
    print(y.requires_grad)
print(y.requires_grad)
    

tensor([[ 0.8074, -1.7852,  0.8007],
        [-0.1867, -0.2302, -0.9471]], requires_grad=True)
tensor([[ 1.6147, -3.5705,  1.6014],
        [-0.3734, -0.4605, -1.8943]], grad_fn=<MulBackward0>)
True
False
False


## detach()
The *detach()* function, detaches the output from the computationnal graph, so no gradient will be backproped along this variable.

In [8]:
import torchviz
import torch

a=torch.tensor([1.2],requires_grad=True)
b=torch.tensor([2.2],requires_grad=True)
c=torch.tensor([0.2],requires_grad=True)
d=torch.tensor([0.8],requires_grad=True)
e=torch.tensor([7.],requires_grad=True)
f_params={'a':a,'b':b,'c':c,'d':d,'e':e}

# Graph before detach
f=d*((a+b)*(c))+e
f.backward()
dot=torchviz.make_dot(f,params=f_params)
dot.format='svg'
dot.render(filename='graph_before_detach', directory='images')

# Graph after detach
f=d*((a+b)*(c.detach()))+e
dot=torchviz.make_dot(f,params=f_params)
dot.format='svg'
dot.render(filename='graph_after_detach', directory='images')


'images/graph_after_detach.svg'

### Graph before detach:
<img src='images/graph_before_detach.svg'/>

### Graph after detach:
<img src='images/graph_after_detach.svg'/>

# Setting the gradients to zero

PyTorch **accumulates** the gradients on subsequent backward passes. This is convenient while training RNNs. 

- `zero_grad` clears old gradients from the last step (otherwise you’d just accumulate the gradients from all loss.backward() calls).
- `loss.backward()` computes the derivative of the loss w.r.t. the parameters (or anything requiring gradients) using backpropagation.
- `opt.step()` causes the optimizer to take a step based on the gradients of the parameters.



We explicitly need to call `zero_grad()` because, after `loss.backward()` (when gradients are computed), we need to use `optimizer.step()` to proceed gradient descent. More specifically, the gradients are not automatically zeroed because these two operations, `loss.backward()` and `optimizer.step()`, are separated, and `optimizer.step()` requires the just computed gradients.

In addition, sometimes, we need to accumulate gradient among some batches; to do that, we can simply call backward multiple times and optimize once.






Refs: [1](https://stackoverflow.com/questions/44732217/why-do-we-need-to-explicitly-call-zero-grad?noredirect=1&lq=1), [2](https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch#)

In [None]:
import torch

if torch.cuda.is_available():
    dev=torch.device("cuda")
else:
    dev = torch.device("cpu")

learning_rate=1e-6
N,D_in,H,D_out=64,1000,100,10
x=torch.randn(N,D_in)
y=torch.randn(H,D_out)

model=torch.nn.Sequential()
model.add_module('w0',torch.nn.Linear(D_in,H))
model.add_module('relu',torch.nn.ReLU())
model.add_module('w0',torch.nn.Linear(H,D_out))

loss_function=torch.nn.MSELoss(reduction='sum')

optimizer=torch.optim.SGD(model.parameters(),lr=learning_rate)

number_of_iterations=500
for i in range(number_of_iterations):
    y_predict=model(x)
    loss=loss_function(y_predict,y)
    print(i, loss.item())
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()