# Pytorch: backpropagation and gradients

1. .grad
2. .backword()
3. retain_grad(),retain_graph
4. torch.optim, .stop()

Examples:
1. scaler example
2. vector example
3. function example

## Pytorch introduction

Pytorch python library that provided tools to work with tensors.
Feature: allows for Trcking gradient odf tensors

Tensors: a multidimentional array

In [2]:
import torch

# torch.Tensor

my_tensor = torch.tensor([[1.0,2.0],[1.0,2.0]])
my_tensor.size()

torch.Size([2, 2])

## Simple scalar example
 
Let's define the following: $\\ p \in \mathbb{R} \\ w = 10p \\ l = w^2$.
 
From calculus we know that:
$$  \frac{\partial w}{\partial p} = 10, \
\frac{\partial l}{\partial w} = 2w, \
\frac{\partial l}{\partial p} = \frac{\partial l}{\partial w} \frac{\partial w}{\partial p} = 2w * 10 = 2(10p)*10 = 200p $$

Note that p is @free@ (it doesn't depend on anything)

l -- > w --> p

Actually, torch is building this kind of graph when I'll define thoes tensors. This graph that toech build is called computational graph.
Tensor that do not depend in anything (p) are called leaf node (...).

In [4]:
def init_variables(scaler: float=1.0, requires_grad: bool=True):
    p = torch.tensor([scaler],requires_grad=requires_grad)
    w = 10*p 
    l= w**2 # pow(2)
    return p, w, l

p,w,l=init_variables(1.0)
print(p,w,l)


tensor([1.], requires_grad=True) tensor([10.], grad_fn=<MulBackward0>) tensor([100.], grad_fn=<PowBackward0>)


Each torch has a requires_grid attribute. this attr allows for tracking gradients for the tensor --> if a tensor has requrs_grad=True, it'll be attached to computational graph.

s0 ...how to compute derivatives??

In torch, each tensor having requires_grad=True, has a method called .backwards(), which computes the derivative of that tensor wrt the leaf nodes.

tensor.backwords()

w.backwords()

## Graph leaves are alwayes the tensors will compute the gradients with respect to when calling BACKWARD()

bUT WHERE THE RESULT IS SORTED?
in torch, when yo call .backward(), the result of computational is available 

In [6]:
p,w,_=init_variables(scaler=2.0)
p.backward()
p.grad

tensor([1.])

In [7]:
p,w,_=init_variables(scaler=2.0)
w.backward()
print(w.grad)

None


  print(w.grad)


In [8]:
def init_variables_retain_w(scaler: float=1.0, requires_grad: bool=True):
    p = torch.tensor([scaler],requires_grad=requires_grad)
    w = 10*p 
    l= w**2 # pow(2)
    return p, w, l
p,w,l=init_variables_retain_w(1.0)
w.backward()
print(w.grad)

None


  print(w.grad)


In [9]:
p,w,l=init_variables(scaler=1.0)
l.backward()

print(p.grad)

tensor([200.])


In [None]:
# dl / dw
p ,w, l = init_variables_retain_w(scaler=1.0)
l.backward()
print(w.grad)

# dl / dp =  dl /dw * dw / dl

## Be Careful

In [None]:
p ,w, l = init_variables_retain_w(scaler=1.0)
w.backward(retain_graph=True)
print(p.grad)
# computational graph doesn't exist anymore.
#p.gard = torch.zeros_like(p)
#p.grad = torch.zero([1])
p.grad.zero_()
w.backward()
print(p.grad)

tensor([10.])
tensor([20.])


## clone and deattach

clone: makes a copy of the tensor it is called on ()

## Example with tensors
 
Let's define the following: $\\ p = (2,2,2,2) \\ w = p^2$ \
and a  function $$ \ell(p,w) = \sum_{i=1}^{4} (p_i - w_i)^2 = \sum_{i=1}^{4} (p_i - p_i^2)^2 = \ell(p). $$
Hence $$ \frac{\partial \ell}{\partial p_i} = 2(p_i-p_i^2)(1-2p_i) = 4p_i^3-6p_i^2+2p_i $$
 
In particular:
$$ \nabla_p \ell = (\frac{\partial \ell}{\partial p_1}, \frac{\partial \ell}{\partial p_2}, \frac{\partial \ell}{\partial p_3}, \frac{\partial \ell}{\partial p_4}) = (4p_1^3-6p_1^2+2p_1, 4p_2^3-6p_2^2+2p_2, 4p_3^3-6p_3^2+2p_3, 4p_4^3-6p_4^2+2p_4)$$
 
Note also taht:
$$ \frac{\partial \ell}{\partial w_i} = -2(p_i-w_i) = -2(p_i-p_i^2)$$

In [2]:
import torch
p = torch.tensor([2.0,2.0,2.0,2.0],requires_grad=True)
w = p**2
print(p,w)

tensor([2., 2., 2., 2.], requires_grad=True) tensor([4., 4., 4., 4.], grad_fn=<PowBackward0>)


In [4]:
def l_fn(p,w) -> torch.Tensor:
    return (p-w).pow(2).sum()

p= torch.tensor([2.0,2.0,2.0,2.0],requires_grad=True)
w = p**2
l = l_fn(p,w)
print(l)
l.backward()  # dl / dp, result is 2*(p-w) = 2*(2-4) = -4

print(p.grad)

tensor(16., grad_fn=<SumBackward0>)
tensor([12., 12., 12., 12.])


## So Far...

1. we can compute derivatives of scaler (l) wrt to leaf nodes (p) (l.backward())
2. the result will be in p.grad

Next step: how to change p according to p.grad?

