# Pytorch: backpropagation and gradients

1. .grad
2. .backword()
3. retain_grad(),retain_graph
4. torch.optim, .stop()

Examples:
1. scaler example
2. vector example
3. function example

## Pytorch introduction

Pytorch python library that provided tools to work with tensors.
Feature: allows for Trcking gradient odf tensors

Tensors: a multidimentional array

In [2]:
import torch

# torch.Tensor

my_tensor = torch.tensor([[1.0,2.0],[1.0,2.0]])
my_tensor.size()

torch.Size([2, 2])

## Simple scalar example
 
Let's define the following: $\\ p \in \mathbb{R} \\ w = 10p \\ l = w^2$.
 
From calculus we know that:
$$  \frac{\partial w}{\partial p} = 10, \
\frac{\partial l}{\partial w} = 2w, \
\frac{\partial l}{\partial p} = \frac{\partial l}{\partial w} \frac{\partial w}{\partial p} = 2w * 10 = 2(10p)*10 = 200p $$

Note that p is @free@ (it doesn't depend on anything)

l -- > w --> p

Actually, torch is building this kind of graph when I'll define thoes tensors. This graph that toech build is called computational graph.
Tensor that do not depend in anything (p) are called leaf node (...).

In [4]:
def init_variables(scaler: float=1.0, requires_grad: bool=True):
    p = torch.tensor([scaler],requires_grad=requires_grad)
    w = 10*p 
    l= w**2 # pow(2)
    return p, w, l

p,w,l=init_variables(1.0)
print(p,w,l)


tensor([1.], requires_grad=True) tensor([10.], grad_fn=<MulBackward0>) tensor([100.], grad_fn=<PowBackward0>)


Each torch has a requires_grid attribute. this attr allows for tracking gradients for the tensor --> if a tensor has requrs_grad=True, it'll be attached to computational graph.

s0 ...how to compute derivatives??

In torch, each tensor having requires_grad=True, has a method called .backwards(), which computes the derivative of that tensor wrt the leaf nodes.

tensor.backwords()

w.backwords()

## Graph leaves are alwayes the tensors will compute the gradients with respect to when calling BACKWARD()

bUT WHERE THE RESULT IS SORTED?
in torch, when yo call .backward(), the result of computational is available 

In [6]:
p,w,_=init_variables(scaler=2.0)
p.backward()
p.grad

tensor([1.])

In [7]:
p,w,_=init_variables(scaler=2.0)
w.backward()
print(w.grad)

None


  print(w.grad)


In [8]:
def init_variables_retain_w(scaler: float=1.0, requires_grad: bool=True):
    p = torch.tensor([scaler],requires_grad=requires_grad)
    w = 10*p 
    l= w**2 # pow(2)
    return p, w, l
p,w,l=init_variables_retain_w(1.0)
w.backward()
print(w.grad)

None


  print(w.grad)


In [9]:
p,w,l=init_variables(scaler=1.0)
l.backward()

print(p.grad)

tensor([200.])


In [None]:
# dl / dw
p ,w, l = init_variables_retain_w(scaler=1.0)
l.backward()
print(w.grad)

# dl / dp =  dl /dw * dw / dl

## Be Careful

In [None]:
p ,w, l = init_variables_retain_w(scaler=1.0)
w.backward(retain_graph=True)
print(p.grad)
# computational graph doesn't exist anymore.
#p.gard = torch.zeros_like(p)
#p.grad = torch.zero([1])
p.grad.zero_()
w.backward()
print(p.grad)

tensor([10.])
tensor([20.])


## clone and deattach

clone: makes a copy of the tensor it is called on ()

## Example with tensors
 
Let's define the following: $\\ p = (2,2,2,2) \\ w = p^2$ \
and a  function $$ \ell(p,w) = \sum_{i=1}^{4} (p_i - w_i)^2 = \sum_{i=1}^{4} (p_i - p_i^2)^2 = \ell(p). $$
Hence $$ \frac{\partial \ell}{\partial p_i} = 2(p_i-p_i^2)(1-2p_i) = 4p_i^3-6p_i^2+2p_i $$
 
In particular:
$$ \nabla_p \ell = (\frac{\partial \ell}{\partial p_1}, \frac{\partial \ell}{\partial p_2}, \frac{\partial \ell}{\partial p_3}, \frac{\partial \ell}{\partial p_4}) = (4p_1^3-6p_1^2+2p_1, 4p_2^3-6p_2^2+2p_2, 4p_3^3-6p_3^2+2p_3, 4p_4^3-6p_4^2+2p_4)$$
 
Note also taht:
$$ \frac{\partial \ell}{\partial w_i} = -2(p_i-w_i) = -2(p_i-p_i^2)$$

In [2]:
import torch
p = torch.tensor([2.0,2.0,2.0,2.0],requires_grad=True)
w = p**2
print(p,w)

tensor([2., 2., 2., 2.], requires_grad=True) tensor([4., 4., 4., 4.], grad_fn=<PowBackward0>)


In [4]:
def l_fn(p,w) -> torch.Tensor:
    return (p-w).pow(2).sum()

p= torch.tensor([2.0,2.0,2.0,2.0],requires_grad=True)
w = p**2
l = l_fn(p,w)
print(l)
l.backward()  # dl / dp, result is 2*(p-w) = 2*(2-4) = -4

print(p.grad)

tensor(16., grad_fn=<SumBackward0>)
tensor([12., 12., 12., 12.])


## So Far...

1. we can compute derivatives of scaler (l) wrt to leaf nodes (p) (l.backward())
2. the result will be in p.grad

Next step: how to change p according to p.grad?



Let's go back to the previous example, changing a little bit numbers.
 
Let's define the following: $\\ p = (1,3,2,2) \\ w = p^2$
$$ \ell(p,w) = \sum_{i=1}^{4} (p_i - w_i)^2 = \sum_{i=1}^{4} (p_i - p_i^2)^2 = \ell(p). $$
We already know that $$ \frac{\partial \ell}{\partial p_i} = 2(p_i-p_i^2)(1-2p_i) = 4p_i^3-6p_i^2+2p_i $$
and we know that the content of `p.grad` is simply:
$$ \nabla_p \ell = (\frac{\partial \ell}{\partial p_1}, \frac{\partial \ell}{\partial p_2}, \frac{\partial \ell}{\partial p_3}, \frac{\partial \ell}{\partial p_4}) = (4p_1^3-6p_1^2+2p_1, 4p_2^3-6p_2^2+2p_2, 4p_3^3-6p_3^2+2p_3, 4p_4^3-6p_4^2+2p_4)$$
which in our case is:
$$p.grad = \nabla_p \ell = (4-6+2, 108-54+6, 32-24+4, 32-24+4) = (0, 60, 12, 12)$$
 
When we call `optimizer.step()` with learning rate $\eta$ at time $t$, we get:
$$p_{i, next} = p_{i,before} - \eta * \frac{\partial \ell}{\partial p_i} = p_{i,before} - \eta * (4p_i^3-6p_i^2+2p_i)$$
 
Putting numbers together, we get:
$$p_{1, next} = 1 - \eta * 0 = 1, \\
p_{2, next} = 3 - \eta * 60 = 3 - 60\eta, \\
p_{3, next} = 2 - \eta * 12 = 2 - 12\eta, \\
p_{4, next} = 2 - \eta * 12 = 2 - 12\eta.$$

In [3]:
def l_fn(p,w) -> torch.Tensor:
    return (p-w).pow(2).sum()

In [5]:
import torch
p = torch.tensor([2.0,2.0,2.0,2.0],requires_grad=True)
w = p**2
print(f'p before{p}')

opt = torch.optim.SGD([p],lr=1.0)

l = l_fn(p,w)

opt.zero_grad()
l.backward()
opt.step()
print(f'p after {p}')


p beforetensor([2., 2., 2., 2.], requires_grad=True)
p after tensor([-10., -10., -10., -10.], requires_grad=True)


1. we can compute gradients dl/dp
2. we can update p according to dl/dp

## Tensor NN

f(input,p)--> output

In [21]:
def a_funcion (input_tensor: torch.Tensor , p: torch.Tensor) -> torch.Tensor:
    return input_tensor*p
dim = 2
torch.manual_seed(0)
p = torch.rand([2],requires_grad=True)
print(p)

tensor([0.4963, 0.7682], requires_grad=True)


In [22]:
dataset = torch.rand([100,2])
dataset.size()

torch.Size([100, 2])

**CRUTIAL POINT**
1. Define a task: learn how to multiply by 2 the input_tensor
2. define the loss function

In [23]:
def loss(input: torch.Tensor, output: torch.Tensor) -> torch.Tensor:
    """
    input : Will be a datapoint from the dataset
    output: a_function(input, p) 
    """

    
    tmp =  (output-2.*input).pow(2).mean(-1)
    print(tmp.shape)
    return tmp.mean()


In [None]:
torch.manual_seed(0)
p = torch.rand([dim],requires_grad=True)
print(f'initial p: {p}')
dataset = torch.rand([100,dim])

opt = torch.optim.SGD([p],lr=0.01)
updates = 0
for data in dataset:
    opt.zero_grad()
    output = a_funcion(data,p)
    print(output)


    l = loss(data,output)
    opt.zero_grad()
    l.backward() # dloss / dp
    opt.step()
    updates +=1
print(f'initial p: {p}')
print(updates)

initial p: tensor([0.4963, 0.7682], requires_grad=True)
tensor([0.2586, 0.6317], grad_fn=<MulBackward0>)
torch.Size([])
tensor([0.2607, 0.6386], grad_fn=<MulBackward0>)
torch.Size([])
tensor([0.2628, 0.6454], grad_fn=<MulBackward0>)
torch.Size([])
tensor([0.2649, 0.6521], grad_fn=<MulBackward0>)
torch.Size([])
tensor([0.2670, 0.6588], grad_fn=<MulBackward0>)
torch.Size([])
tensor([0.2691, 0.6655], grad_fn=<MulBackward0>)
torch.Size([])
tensor([0.2712, 0.6721], grad_fn=<MulBackward0>)
torch.Size([])
tensor([0.2733, 0.6787], grad_fn=<MulBackward0>)
torch.Size([])
tensor([0.2754, 0.6852], grad_fn=<MulBackward0>)
torch.Size([])
tensor([0.2775, 0.6917], grad_fn=<MulBackward0>)
torch.Size([])
tensor([0.2796, 0.6982], grad_fn=<MulBackward0>)
torch.Size([])
tensor([0.2816, 0.7046], grad_fn=<MulBackward0>)
torch.Size([])
tensor([0.2837, 0.7109], grad_fn=<MulBackward0>)
torch.Size([])
tensor([0.2858, 0.7172], grad_fn=<MulBackward0>)
torch.Size([])
tensor([0.2878, 0.7235], grad_fn=<MulBackward0>)

In [26]:
torch.manual_seed(0)

p = torch.rand([dim],requires_grad=True)
print(f'initial p: {p}')
dataset = torch.rand([100,dim])
lr = 0.1
#opt = torch.optim.SGD([p],lr=0.01)

for data in dataset:
    output = a_funcion(data,p)


    l = loss(data,output)
    p.grad = torch.zeros_like(p)

    l.backward() # dloss / dp
    with torch.no_grad():
        p -= lr*p.grad
    

    #p_grad = p.grad
    #p = p.detach()
    #p = p - lr*p
    #p.requires_grad = True


    print (f'initial p: {p}')

    



initial p: tensor([0.4963, 0.7682], requires_grad=True)
torch.Size([])
initial p: tensor([0.4974, 0.7704], requires_grad=True)
torch.Size([])
initial p: tensor([0.5116, 0.8198], requires_grad=True)
torch.Size([])
initial p: tensor([0.5474, 0.9146], requires_grad=True)
torch.Size([])
initial p: tensor([0.5775, 0.9580], requires_grad=True)
torch.Size([])
initial p: tensor([0.5949, 0.9749], requires_grad=True)
torch.Size([])
initial p: tensor([0.5949, 0.9778], requires_grad=True)
torch.Size([])
initial p: tensor([0.6071, 1.0053], requires_grad=True)
torch.Size([])
initial p: tensor([0.6749, 1.0689], requires_grad=True)
torch.Size([])
initial p: tensor([0.6783, 1.0763], requires_grad=True)
torch.Size([])
initial p: tensor([0.7397, 1.1537], requires_grad=True)
torch.Size([])
initial p: tensor([0.7596, 1.2184], requires_grad=True)
torch.Size([])
initial p: tensor([0.7814, 1.2423], requires_grad=True)
torch.Size([])
initial p: tensor([0.8920, 1.2424], requires_grad=True)
torch.Size([])
initia