# Gradients and backpropagtion in `PyTorch` (by Michele Alessi)

## Takeaways😇:
1. Understand `.grad` attribute
2. Understand `.backward()` method
3. Deal with advanced features like `retain_graph` and `retain_grad()`
4. 1 + 2 *should* introduce you smoothly to `optimizers` and `.step()` method.

We will go through the following examples:
1. a scalar (ie dealing with numbers) example to understand how gradients are calculated and backpropagated.
2. a vector (ie dealing with tensors) example.
3. an example involving functions. 

## Pytorch introduction
`PyTorch` is a library that basically provides tools to work with tensors. Tensors are nothing but multi-dimensional arrays.

We have many operations that we can do with tensors, like addition, multiplication, etc. Many of them work similarly to numpy.

Other operations are specific to `torch`, but we will see them as we go on (I think it is better if we analyze an operation in the proper context rather then just list them).

To start, the basic operation we need is how to define a tensor. We can do it with the `torch.tensor()` function.

In [1]:
import torch
from typing import Tuple

my_tensor = torch.tensor([[1, 2, 3], [4, 5, 6]])
print(my_tensor)
print(my_tensor.size())

tensor([[1, 2, 3],
        [4, 5, 6]])
torch.Size([2, 3])


The big thing about Pytorch is that it provides automatic differentiation. This means that we can calculate the gradients of a function with respect to its parameters without having to do it manually.


## Simple scalar example

Let's define the following: $\\ p \in \mathbb{R} \\ w = 10p \\ l = w^2$. 

From calculus we know that:
$$  \frac{\partial w}{\partial p} = 10, \
\frac{\partial l}{\partial w} = 2w, \
\frac{\partial l}{\partial p} = \frac{\partial l}{\partial w} \frac{\partial w}{\partial p} = 2w * 10 = 2(10p)*10 = 200p $$



Note that p is 'free', it does not depend on anything. 
w depends on p, and l depends on w, which in turn depends on p.

l --> w --> p

We can see this as a graph. Actually torch works with this kind of graph, it is called the computational graph.

The tensors that do not depend on anything are called leaf tensors. 

In our example p is the only leaf tensor.

In [2]:
import torch 

def init_variables(scalar: float = 1.0, requires_grad: bool = True) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    p = torch.tensor([scalar], requires_grad=requires_grad)
    w = 10*p
    l = w**2
    return p, w, l

p, w, l = init_variables(scalar=1.0)
print('p:', p)
print('w = 10p:', w)
print('l = w^2:', l)

p: tensor([1.], requires_grad=True)
w = 10p: tensor([10.], grad_fn=<MulBackward0>)
l = w^2: tensor([100.], grad_fn=<PowBackward0>)


Each `torch` tensor has a `requires_grad` attribute that allows for tracking the gradients with respect to that tensor (in other words, if reqiures_grad=True, the tensor will be attached to the computational graph). \
If `requires_grad=True`, then we are able to compute the gradients with respect to that tensor.

To be more specific about the definition of a leaf node, a leaf node is a tensor that does not depend on any other (ie it is not the results of any operation) and it has `requires_grad=True`.

In [3]:
p, _, _ = init_variables(scalar=1.0, requires_grad=False)
print('p:', p)
print('p.requires_grad: ', p.requires_grad)
p.requires_grad = True
print('p: ', p)
print('p.requires_grad: ', p.requires_grad)

p: tensor([1.])
p.requires_grad:  False
p:  tensor([1.], requires_grad=True)
p.requires_grad:  True


So... how to compute gradients with pytorch? Using the `.backward()` method!

`backward()` is a method of each tensor that calculates the gradient of the tensor it is called on wrt the graph leaves.

w.backward() ---> dw/dp 

**Graph leaves are ALWAYS the tensors that we will compute the gradients with respect to when calling .backward().**
 
Ok but then... where do we see the gradients???

The gradients are stored in the `.grad` attribute of the tensor we computed the gradient with respect to. (so we would always like to access the `.grad` attribute of the leaf tensor).

w.backward() compute dw/dp and we see the result in p.grad

So to go on: Let's say we want to compute the gradient of `w` with respect to `p`. 

We would call `w.backward()` and then the gradient would be stored in `p.grad`.

In [4]:
p, w, _ = init_variables(scalar=1.0, requires_grad=True)

print('p: ',p)
print('w: ',w)
print('p.requires_grad: ',p.requires_grad)
print('w.requires_grad: ',w.requires_grad)

w.backward()
print('dw/dp = p.grad: ',p.grad)


p:  tensor([1.], requires_grad=True)
w:  tensor([10.], grad_fn=<MulBackward0>)
p.requires_grad:  True
w.requires_grad:  True
dw/dp = p.grad:  tensor([10.])


**IMPORTANT NOTE FOR LATER**: .backward() has to be called on a scalar tensor. 

In this case all is scalar, but later we will see that we have to be careful with this.

What happen if I try to compute the gradient of `p` with respect to `p`? 

In [5]:
p, *_ = init_variables(scalar=1.0, requires_grad=True)
p.backward()
print('dp/dp = p.grad: ',p.grad)

dp/dp = p.grad:  tensor([1.])


What happen if I compute `dw/dp` (like before) and try to access `w.grad`? (Note that w IS NOT a leaf tensor).

We know that dw/dp (=w.backward() note again that call w.backward() is telling pytorch: do the derivative of w wrt the leaf node) is the grad of w wrt p, and that the results is stored in p.grad.


We might expect that accessing w.grad would show the result of dw/dw (which we are not directly computing, but which is computed for the chain rule internally)

In [6]:
p, w, _ = init_variables(scalar=1.0, requires_grad=True)
w.backward()
print('dw/dp = p.grad: ',p.grad)
print('dw/dw = w.grad: ',w.grad)

dw/dp = p.grad:  tensor([10.])
dw/dw = w.grad:  None


  print('dw/dw = w.grad: ',w.grad)


In practice, the `.grad` attribute is "natively" accessible only for leaf tensors (ie that attributes normally is populated only for leaf tensors).

If we want to access the gradient of `w`, we need to call `retain_grad()` method on `w` before calling any `backward()`. This way, when calling backward, the computational graph is populated also for the non-leaf tensor w.
Let's modify the `init_params()` function.

In [7]:
def init_variables_retain_w(scalar: float = 1.0, requires_grad: bool = True) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    p = torch.tensor([scalar], requires_grad=requires_grad)
    w = 10*p
    w.retain_grad() # keep the gradient for w, which is not a leaf tensor since it depends on p
    l = w**2
    return p, w, l

In [8]:
p, w, _ = init_variables_retain_w(scalar=1.0, requires_grad=True)
w.backward()
print('dw/dp = p.grad: ',p.grad)
print('dw/dw = w.grad: ',w.grad)

dw/dp = p.grad:  tensor([10.])
dw/dw = w.grad:  tensor([1.])


Now let's work also with l.

In [9]:
p, w, l = init_variables(scalar=1.0, requires_grad=True)
print('p: ',p, 'w: ',w, 'l: ',l)


p:  tensor([1.], requires_grad=True) w:  tensor([10.], grad_fn=<MulBackward0>) l:  tensor([100.], grad_fn=<PowBackward0>)


If you call backward on l, this is equivalent to which operation?

In [10]:
l.backward()
print('dl/dp = p.grad: ',p.grad)

dl/dp = p.grad:  tensor([200.])


### Be careful with accumulation!!!
So far so good... but... \
**Note**: The `backward()` method ALWAYS accumulates the gradients in the `.grad` attribute of the leaves. \
This means that if we call `backward()` multiple times, the gradients will be accumulated in the `.grad` attribute. \
Let's try to call `backward()` 2 times on `w` and see what happens.

In [11]:
p, w, l = init_variables(scalar=1.0, requires_grad=True)
l.backward()
w.backward()
# w --> p
# show that it doesnt depend on the order or anything!

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

Pytorch frees computational graphs after a backward pass by default!! 
This means that after the closure of the first call of .backward(), the computational graph doesn't exists anymore!!!

If we want to call .backward() multiple times, we must keep the computational graph by passing retain_graph=True.

This tells PyTorch: "Don't free the graph after the first backward pass."

In [12]:
p, w, l = init_variables(scalar=1.0, requires_grad=True)
w.backward(retain_graph=True)
print('dw/dp = p.grad: ',p.grad) # add 10 to p.grad
w.backward()
print('dw/dp = p.grad: ',p.grad) # add 10 again to p.grad: results=20

dw/dp = p.grad:  tensor([10.])
dw/dp = p.grad:  tensor([20.])


But now we see an error! This occurs because we are accumulating gradients wrongly!

We need to zero the gradient to get the correct result. \
If we want to zero out the gradients, we can call the `zero_grad()` method on the tensor.

In [13]:
p, w, _ = init_variables(scalar=1.0, requires_grad=True)
w.backward(retain_graph=True)
print('dw/dp = p.grad: ',p.grad)
p.grad.zero_()
#p.grad = None
#p.grad = torch.tensor([0.0])
w.backward()
print('dw/dp = p.grad: ',p.grad)

dw/dp = p.grad:  tensor([10.])
dw/dp = p.grad:  tensor([10.])


Note that in general you will NEVER be in a situation like this.

What happen typically is that:
1. you have some parameters (the leaves, our p)
2. you do some intermediate operations (but you "don't see" these operations when computing gradients!) like our w
3. you end up with one number (our l, which will be the loss), and you backpropagate on the loss (you see this? l.backward() is exactley dl/dp, the derivative of the loss wrt the parameters 😇)

Mini exercise: extract dl/dw.

In [14]:
p, w, l = init_variables_retain_w(scalar=3.2)
l.backward()
w.grad

tensor([64.])

### `.detach()` and `.clone()`


Now we see 2 important opertions.

`detach()`, which is used to detach a tensor from the computational graph. This means that the tensor will not require gradients anymore. Also, if a tensor has a .grad attribute, it will lose it.

In [15]:
p, w, _ = init_variables(scalar=1.0, requires_grad=True)
print('w: ',w)
w = w.detach()
print('w: ',w)
w.requires_grad = True
print('p: ',p)
print('w: ',w)
print('p.requires_grad: ',p.requires_grad)
print('w.requires_grad: ',w.requires_grad)


w:  tensor([10.], grad_fn=<MulBackward0>)
w:  tensor([10.])
p:  tensor([1.], requires_grad=True)
w:  tensor([10.], requires_grad=True)
p.requires_grad:  True
w.requires_grad:  True


In [16]:
p, w, _ = init_variables(scalar=1.0, requires_grad=True)
w.backward()
print(p.grad)
p = p.detach()
print(p.grad)

tensor([10.])
None


`clone()` is used to clone a tensor. The new tensor has same data but is stored in a **different** memory location.

If the original tensor has `requires_grad=True`, the clone will also require gradients and be tracked by autograd.

Changes to the cloned tensor do not modify the original tensor and vice versa.

The clone doesn't inherit the `.grad` attribute of the original tensor.

In [17]:
p, w, _ = init_variables(scalar=1.0, requires_grad=True)
w.backward()
p_clone = p.clone()
print(p.grad)
print(p_clone.grad)
print(p.requires_grad)
print(p_clone.requires_grad)

tensor([10.])
None
True
True


  print(p_clone.grad)


### So far should be ok:
0. `.requires_grad`: attach a tensor to the computational graph (ie allow for automaitc computation of derivatives)
1. `tensor.backward()`: computes d tensor/ d leaf node
2. `leaf.grad`: stores the results of calling backward(), ie stores d tensor/ d leaf node
3. `retain_graph` and `tensor.retain_grad()`: advanced methods to do stuff that probabily you will never have to do
4. How to zero the gradient

## Example with tensors

Let's define the following: $\\ p = (2,2,2,2) \\ w = p^2$ \
and a  function $$ \ell(p,w) = \sum_{i=1}^{4} (p_i - w_i)^2 = \sum_{i=1}^{4} (p_i - p_i^2)^2 = \ell(p). $$
Hence $$ \frac{\partial \ell}{\partial p_i} = 2(p_i-p_i^2)(1-2p_i) = 4p_i^3-6p_i^2+2p_i $$

In particular:
$$ \nabla_p \ell = (\frac{\partial \ell}{\partial p_1}, \frac{\partial \ell}{\partial p_2}, \frac{\partial \ell}{\partial p_3}, \frac{\partial \ell}{\partial p_4}) = (4p_1^3-6p_1^2+2p_1, 4p_2^3-6p_2^2+2p_2, 4p_3^3-6p_3^2+2p_3, 4p_4^3-6p_4^2+2p_4)$$

Note also taht: 
$$ \frac{\partial \ell}{\partial w_i} = -2(p_i-w_i) = -2(p_i-p_i^2)$$

In [18]:
p = torch.tensor([2.0, 2.0,2.0, 2.0], requires_grad=True)
w = p**2

# we define p and w just like before, in a tensor fashion now.
print('p:', p, end='\n')
print('w= 2*p:', w, end='\n\n')

p: tensor([2., 2., 2., 2.], requires_grad=True)
w= 2*p: tensor([4., 4., 4., 4.], grad_fn=<PowBackward0>)



If we try to call `w.backward()`, we get an error since w is not a scalar and backward has to be called on scalar tensor!

In [19]:
w.backward()

RuntimeError: grad can be implicitly created only for scalar outputs

In [20]:
def loss_fn(p: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
    tmp = (p-w)**2
    return tmp.sum()

In [21]:
p = torch.tensor([2.0, 2.0,2.0, 2.0], requires_grad=True)
w = p**2
l = loss_fn(p,w)
# now we can call backward on l, meaning we are computing dl/dp
print('loss:', l, end='\n\n')

print('grad of p before calling l.backward() --> p.grad:', p.grad, end='\n\n')

# backward
l.backward()
print('grad of p after calling l.backward() on loss --> p.grad:', p.grad, end='\n\n')

loss: tensor(16., grad_fn=<SumBackward0>)

grad of p before calling l.backward() --> p.grad: None

grad of p after calling l.backward() on loss --> p.grad: tensor([12., 12., 12., 12.])



Note that if p_i=2, $4*2^3 - 6*2^2 +2*2 = 32 - 24 + 4 = 12$

Again, grad of w after calling l.backward() is no longer computable straightforardly, since w is not leaf

In [22]:
print(w.grad)

None


  print(w.grad)


So what about dl/dw?

In [23]:
#Note: what about dl/dw?
p = torch.tensor([2.0, 2.0,2.0, 2.0], requires_grad=True)
w = p**2
w.retain_grad()
l = loss_fn(p,w)
l.backward()
print(p.grad)
print(w.grad) # -2 (p_i - p_i^2)

tensor([12., 12., 12., 12.])
tensor([4., 4., 4., 4.])


### So far, 
1. take some tensor p
2. do operations on it
3. compute a scalar number (the loss)
4. compute dl/dp.


Now it's time to change p according to dl/dp!!!

## SGD optimizer

To do so, in pytorch we use the optimizer. In particular, we'll see two fundamental methods.

`.zero_grad()` --> set to zero the `.grad` attributeof all leaf tensors \
`.step()` --> updates the leaf `p` by `p_new = p - lr * p.grad` where `lr` is the learning rate.




Let's go back to the previous example, changing a little bit numbers.

Let's define the following: $\\ p = (1,3,2,2) \\ w = p^2$ 
$$ \ell(p,w) = \sum_{i=1}^{4} (p_i - w_i)^2 = \sum_{i=1}^{4} (p_i - p_i^2)^2 = \ell(p). $$
We already know that $$ \frac{\partial \ell}{\partial p_i} = 2(p_i-p_i^2)(1-2p_i) = 4p_i^3-6p_i^2+2p_i $$
and we know that the content of `p.grad` is simply:
$$ \nabla_p \ell = (\frac{\partial \ell}{\partial p_1}, \frac{\partial \ell}{\partial p_2}, \frac{\partial \ell}{\partial p_3}, \frac{\partial \ell}{\partial p_4}) = (4p_1^3-6p_1^2+2p_1, 4p_2^3-6p_2^2+2p_2, 4p_3^3-6p_3^2+2p_3, 4p_4^3-6p_4^2+2p_4)$$
which in our case is:
$$p.grad = \nabla_p \ell = (4-6+2, 108-54+6, 32-24+4, 32-24+4) = (0, 60, 12, 12)$$

When we call `optimizer.step()` with learning rate $\eta$ at time $t$, we get:
$$p_{i, next} = p_{i,before} - \eta * \frac{\partial \ell}{\partial p_i} = p_{i,before} - \eta * (4p_i^3-6p_i^2+2p_i)$$

Putting numbers together, we get:
$$p_{1, next} = 1 - \eta * 0 = 1, \\
p_{2, next} = 3 - \eta * 60 = 3 - 60\eta, \\
p_{3, next} = 2 - \eta * 12 = 2 - 12\eta, \\
p_{4, next} = 2 - \eta * 12 = 2 - 12\eta.$$


In [24]:

p = torch.tensor([1.0, 3.0, 2.0, 2.0], requires_grad=True)
w = p**2

print('v:', p)
print('w= 2*p:', w)

l = loss_fn(p,w)

print('loss:', l)

print('grad of p before calling backward() --> p.grad:', p.grad)
print('define lr=1')
opt = torch.optim.SGD([p], lr=1)
opt.zero_grad()
# backward
l.backward()
print('grad of p after calling backward() on loss --> p.grad:', p.grad)


opt.step()
print('p after calling opt.step() --> p:', p)

v: tensor([1., 3., 2., 2.], requires_grad=True)
w= 2*p: tensor([1., 9., 4., 4.], grad_fn=<PowBackward0>)
loss: tensor(44., grad_fn=<SumBackward0>)
grad of p before calling backward() --> p.grad: None
define lr=1
grad of p after calling backward() on loss --> p.grad: tensor([ 0., 60., 12., 12.])
p after calling opt.step() --> p: tensor([  1., -57., -10., -10.], requires_grad=True)


## Towards NN

Now we define a function that takes an input and it also depends on some parameters.
Then, it combines the input with parameters and returns an output.

Note that this is exactley what we do with NN: we have some parameters (the weights and biases) and we have some input (the data) and we get an output.

In [25]:
dim = 2
def a_function(input_tensor: torch.Tensor, p: torch.Tensor) -> torch.Tensor:
    return input_tensor*p

torch.manual_seed(0)
p = torch.randn([dim], requires_grad=True)
print('p:', p)

p: tensor([ 1.5410, -0.2934], requires_grad=True)


Now we define a dataset of 100 random points in 2d.
This will be our dataset.

In [26]:
torch.manual_seed(0)
dataset = torch.randn([100, 2])
dataset[0]

tensor([-1.1258, -1.1524])

**CRUCIAL point** of deep learning:

we will define the task we want to solve using our "function", and based on this we define a proper loss function to achieve our task.

The task is simple: the function has to learn how to multiply numbers by 2.

So the loss will be defined according to this task.


In [27]:
def loss(input: torch.Tensor, output: torch.Tensor):
    """
    input: will be a datapoint / a batch of datapoints
    output: will be the output of the function, ie output = a_function(input, p)
    """
    pass

In [28]:
def loss(input: torch.Tensor, output: torch.Tensor) -> torch.Tensor:
    square_diff = (output - 2*input)**2
    return square_diff.mean(-1).mean()

Note that if we learn the task perfectly, then the loss will be 0.

Let's use what we know about SGD optimizer to solve this task.

In [29]:
torch.manual_seed(0)
p = torch.randn([dim], requires_grad=True)
dataset = torch.randn([100, 2])
opt = torch.optim.SGD([p], lr=0.01)

#TODO:
# 1. play with the lr and the dimension of the dataset (here it is 100)
# 2. print things so that you familiarize yourself with the code

for data in dataset:
    # one update per each datapoint!!!
    #print(data)
    #break
    output = a_function(input_tensor=data, p=p)
    #print(output.size())
    #break
    l = loss(data, output)
    opt.zero_grad()
    l.backward()
    opt.step()

print('final parameters:', p)


final parameters: tensor([1.8175, 1.1722], requires_grad=True)


We can also do the calculations using all dataset at once! Let's see!

In [30]:
torch.manual_seed(0)
p = torch.randn([dim], requires_grad=True)
dataset = torch.randn([100, 2])
opt = torch.optim.SGD([p], lr=0.01)
print('initial parameters:', p)

output = a_function(input_tensor=dataset, p=p)
print('output:', output.size())
l = loss(dataset, output)
print(l.size())
opt.zero_grad()
l.backward() # nonly one update!!!!
opt.step()

print('final parameters:', p)


initial parameters: tensor([ 1.5410, -0.2934], requires_grad=True)
output: torch.Size([100, 2])
torch.Size([])
final parameters: tensor([ 1.5452, -0.2704], requires_grad=True)


So we can either do one update (`l.backward()` + `opt.step()`) for each datapoint (100 updates), or can take all the dataset at once and do one update (1 update).

Each time we pass through **all** the dataset, we call it an **epoch**.

Now we see both cases with epochs.

Note!!! In the first case (one update per datapoint), in the end we will do 100 x num_epochs updates.

In the second case (one update per epoch), in the end we will do num_epochs updates.

In [31]:
torch.manual_seed(0)
p = torch.randn([dim], requires_grad=True)
dataset = torch.randn([100, 2])
opt = torch.optim.SGD([p], lr=0.01)
n_updates = 0
print('initial parameters:', p)

for epoch in range(10):
    for data in dataset:
        # always updating p for each datapoint
        output = a_function(input_tensor=data, p=p)
        l = loss(data, output)
        opt.zero_grad()
        l.backward()
        opt.step()
        n_updates += 1

    print(f'epoch {epoch}: ', p)

print('final parameters:', p)
print('n_updates:', n_updates)


initial parameters: tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 0:  tensor([1.8175, 1.1722], requires_grad=True)
epoch 1:  tensor([1.9274, 1.7012], requires_grad=True)
epoch 2:  tensor([1.9711, 1.8922], requires_grad=True)
epoch 3:  tensor([1.9885, 1.9611], requires_grad=True)
epoch 4:  tensor([1.9954, 1.9860], requires_grad=True)
epoch 5:  tensor([1.9982, 1.9949], requires_grad=True)
epoch 6:  tensor([1.9993, 1.9982], requires_grad=True)
epoch 7:  tensor([1.9997, 1.9993], requires_grad=True)
epoch 8:  tensor([1.9999, 1.9998], requires_grad=True)
epoch 9:  tensor([2.0000, 1.9999], requires_grad=True)
final parameters: tensor([2.0000, 1.9999], requires_grad=True)
n_updates: 1000


In [32]:
torch.manual_seed(0)
p = torch.randn([dim], requires_grad=True)
dataset = torch.randn([100, 2])
opt = torch.optim.SGD([p], lr=0.1)
print('initial parameters:', p)
n_updates = 0
for epoch in range(100):
    # one update only!!!
    output = a_function(input_tensor=dataset, p=p)
    l = loss(dataset, output)
    opt.zero_grad()
    l.backward()
    opt.step()
    n_updates += 1
    print(f'epoch {epoch}: ', p)

print('final parameters:', p)
print('n_updates:', n_updates)

initial parameters: tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 0:  tensor([ 1.5828, -0.0630], requires_grad=True)
epoch 1:  tensor([1.6209, 0.1442], requires_grad=True)
epoch 2:  tensor([1.6554, 0.3307], requires_grad=True)
epoch 3:  tensor([1.6868, 0.4984], requires_grad=True)
epoch 4:  tensor([1.7154, 0.6492], requires_grad=True)
epoch 5:  tensor([1.7413, 0.7849], requires_grad=True)
epoch 6:  tensor([1.7649, 0.9070], requires_grad=True)
epoch 7:  tensor([1.7863, 1.0168], requires_grad=True)
epoch 8:  tensor([1.8058, 1.1156], requires_grad=True)
epoch 9:  tensor([1.8235, 1.2044], requires_grad=True)
epoch 10:  tensor([1.8396, 1.2843], requires_grad=True)
epoch 11:  tensor([1.8542, 1.3562], requires_grad=True)
epoch 12:  tensor([1.8675, 1.4209], requires_grad=True)
epoch 13:  tensor([1.8796, 1.4791], requires_grad=True)
epoch 14:  tensor([1.8905, 1.5314], requires_grad=True)
epoch 15:  tensor([1.9005, 1.5785], requires_grad=True)
epoch 16:  tensor([1.9096, 1.6208], requires_

#### Almost mini-batches :)
So we see that we can do one update per each datapoint, or one update using all the dataset at once: then, we can iterate this process for some epochs.

Actually, there is an intermediate way: we can use a batch of datapoints to do one update. This is the most common way to do it in practice.

We'll see this in the next lesson.

#### Digression on broadcasting in Pytorch

Sorry for being messy during this part today 😅

I'll try to explain again why both the `a_function` and `loss` functions work with both single points and the whole dataset.

```
def a_function(input_tensor, p):
    return input_tensor*p

def loss(input, output):
    return (output - 2*input).pow(2).mean(-1).mean()
```

1. `p` will **always** (in this example) be a tensor of shape `(2,)` (es $[1.0, 1.0]$).

2. If we process a single point, `input_tensor` will be a tensor of shape `(2,)` (es $[0.5, 2.0]$). \
and `output` will be a tensor of shape `(2,)` (es $[1.0*0.5, 1.0*2.0] = [0.5, 2.0]$).\
When we compute the loss, we get that: `output - 2*input` will be a tensor of shape `(2,)` (es $[0.5 - 2*0.5, 2.0 - 2*2.0] = [0.5 - 1.0, 2.0 - 4.0] = [-0.5, -2.0]$).\
Then, the  `.pow(2)` is not a problem, we simply get $[0.25, 4.0]$.\
Then, the `.mean(-1)` will take the mean of the last dimension, which **IN THIS CASE** is the only dimension, so we get $[2.125]$.\
Then, the `.mean()` will take the mean of the only element, so we get 2.125. \
**!!!! In this case the last `.mean()` is not necessary, but it is there to make the function work with both single points and the whole dataset.**

3. Now, if we process the whole dataset, `input_tensor` will have a shape `(N, 2)`, where N is the number of data points.\
Since `p` is a tensor of shape `(2,)`, the broadcasting will make the multiplication element-wise: so it's like multiplying each row of `input_tensor` (which is now the entire dataset!) by `p`.\
Hence the output will be a tensor of shape `(N, 2)` (where each original point is multiplied element-wise by `p`).\
So in this case, `input=input_tensor=dataset` and `output` are tensors of shape `(N, 2)`.\
The subtraction `output - 2*input` is also broadcasted element-wise, preserving the shape `(N, 2)`.\
The `.pow(2)` operation is also applied element-wise, still keeping the shape `(N, 2)`.\
The `.mean(-1)` computes the mean along the last dimension (dim=-1), reducing the shape to (N,) (i.e., **we now have a single loss value for each data point! So we are computing the right loss for each datapoint, all at once!!! Think about it, this is the logics behind batch-based training.**).
The final `.mean()` takes the mean over all N points, producing a single scalar loss value.



#### A small analyis on accumulating gradients

In [33]:
torch.manual_seed(0)
p = torch.randn([dim], requires_grad=True)
dataset = torch.randn([100, 2])
opt = torch.optim.SGD([p], lr=0.01)

output = a_function(input_tensor=dataset, p=p)
l = loss(dataset, output)
opt.zero_grad()
l.backward()
print(p.grad)


tensor([-0.4184, -2.3040])


In [34]:
torch.manual_seed(0)
p = torch.randn([dim], requires_grad=True)
dataset = torch.randn([100, 2])
opt = torch.optim.SGD([p], lr=0.01)

opt.zero_grad()
for data in dataset:
    output = a_function(input_tensor=data, p=p)
    l = loss(data, output)
    l.backward() # this populate the grad attribute of p, accumulating the gradients

print(p.grad)
print(p.grad/len(dataset))


tensor([ -41.8369, -230.4015])
tensor([-0.4184, -2.3040])


In [35]:
torch.manual_seed(0)
p = torch.randn([dim], requires_grad=True)
dataset = torch.randn([100, 2])
opt = torch.optim.SGD([p], lr=0.01)

opt.zero_grad()
loss_total = 0
for data in dataset:
    output = a_function(input_tensor=data, p=p)
    l = loss(data, output)
    loss_total += l

loss_total = loss_total/len(dataset)
loss_total.backward()
print(p.grad)


tensor([-0.4184, -2.3040])


In [36]:
torch.manual_seed(0)
p = torch.randn([dim], requires_grad=True)
dataset = torch.randn([100, 2])
opt = torch.optim.SGD([p], lr=0.01)

opt.zero_grad()
loss_total = 0
for data in dataset:
    output = a_function(input_tensor=data, p=p)
    l = loss(data, output)
    loss_total += l

loss_total.backward()
print(p.grad)
print(p.grad/len(dataset))


tensor([ -41.8369, -230.4015])
tensor([-0.4184, -2.3040])


#### Exercise: write you the SGD optimizer

In [37]:
torch.manual_seed(0)
p = torch.randn([dim], requires_grad=True)
lr = 0.001

for epoch in range(100):
    for data in dataset:
        output = a_function(input_tensor=data, p=p)
        l = loss(data, output)

        p.grad = torch.zeros_like(p)  # reset gradients
        l.backward()
        
        with torch.no_grad():  # prevents torch from tracking the update
            p -= lr * p.grad  

    print(f'epoch {epoch}: ', p)


epoch 0:  tensor([ 1.5810, -0.0739], requires_grad=True)
epoch 1:  tensor([1.6176, 0.1246], requires_grad=True)
epoch 2:  tensor([1.6509, 0.3041], requires_grad=True)
epoch 3:  tensor([1.6814, 0.4664], requires_grad=True)
epoch 4:  tensor([1.7092, 0.6131], requires_grad=True)
epoch 5:  tensor([1.7345, 0.7459], requires_grad=True)
epoch 6:  tensor([1.7577, 0.8659], requires_grad=True)
epoch 7:  tensor([1.7788, 0.9745], requires_grad=True)
epoch 8:  tensor([1.7981, 1.0726], requires_grad=True)
epoch 9:  tensor([1.8157, 1.1614], requires_grad=True)
epoch 10:  tensor([1.8318, 1.2416], requires_grad=True)
epoch 11:  tensor([1.8464, 1.3142], requires_grad=True)
epoch 12:  tensor([1.8598, 1.3798], requires_grad=True)
epoch 13:  tensor([1.8721, 1.4392], requires_grad=True)
epoch 14:  tensor([1.8832, 1.4929], requires_grad=True)
epoch 15:  tensor([1.8934, 1.5414], requires_grad=True)
epoch 16:  tensor([1.9027, 1.5853], requires_grad=True)
epoch 17:  tensor([1.9112, 1.6250], requires_grad=True)


In [38]:
# this is the solution given by ChatGPT when I asked to solve the problem... a little bit too messy
torch.manual_seed(0)
dim = 2  
lr = 0.001  
p = torch.randn([dim], requires_grad=True)

for epoch in range(100):
    for data in dataset:
        output = a_function(input_tensor=data, p=p)
        l = loss(data, output)
        l.backward()

        p_new = (p - lr * p.grad).clone().detach().requires_grad_(True)
        p.grad.zero_()  
        p = p_new  

    print(f'epoch {epoch}: ', p)


epoch 0:  tensor([ 1.5810, -0.0739], requires_grad=True)
epoch 1:  tensor([1.6176, 0.1246], requires_grad=True)
epoch 2:  tensor([1.6509, 0.3041], requires_grad=True)
epoch 3:  tensor([1.6814, 0.4664], requires_grad=True)
epoch 4:  tensor([1.7092, 0.6131], requires_grad=True)
epoch 5:  tensor([1.7345, 0.7459], requires_grad=True)
epoch 6:  tensor([1.7577, 0.8659], requires_grad=True)
epoch 7:  tensor([1.7788, 0.9745], requires_grad=True)
epoch 8:  tensor([1.7981, 1.0726], requires_grad=True)
epoch 9:  tensor([1.8157, 1.1614], requires_grad=True)
epoch 10:  tensor([1.8318, 1.2416], requires_grad=True)
epoch 11:  tensor([1.8464, 1.3142], requires_grad=True)
epoch 12:  tensor([1.8598, 1.3798], requires_grad=True)
epoch 13:  tensor([1.8721, 1.4392], requires_grad=True)
epoch 14:  tensor([1.8832, 1.4929], requires_grad=True)
epoch 15:  tensor([1.8934, 1.5414], requires_grad=True)
epoch 16:  tensor([1.9027, 1.5853], requires_grad=True)
epoch 17:  tensor([1.9112, 1.6250], requires_grad=True)


In [39]:
# this is my solution... asking ChatGPT isn't always the simplest solution :)
torch.manual_seed(0)
p = torch.randn([dim], requires_grad=True)
lr = 0.001
for epoch in range(100):
    for data in dataset:
        output = a_function(input_tensor=data, p=p)
        l = loss(data, output)

        p.grad = torch.zeros_like(p)
        l.backward()
        
        p_grad = p.grad
        p = p.detach()
        p = p - lr*p_grad
        p.requires_grad = True
        

    print(f'epoch {epoch}: ', p)

epoch 0:  tensor([ 1.5810, -0.0739], requires_grad=True)
epoch 1:  tensor([1.6176, 0.1246], requires_grad=True)
epoch 2:  tensor([1.6509, 0.3041], requires_grad=True)
epoch 3:  tensor([1.6814, 0.4664], requires_grad=True)
epoch 4:  tensor([1.7092, 0.6131], requires_grad=True)
epoch 5:  tensor([1.7345, 0.7459], requires_grad=True)
epoch 6:  tensor([1.7577, 0.8659], requires_grad=True)
epoch 7:  tensor([1.7788, 0.9745], requires_grad=True)
epoch 8:  tensor([1.7981, 1.0726], requires_grad=True)
epoch 9:  tensor([1.8157, 1.1614], requires_grad=True)
epoch 10:  tensor([1.8318, 1.2416], requires_grad=True)
epoch 11:  tensor([1.8464, 1.3142], requires_grad=True)
epoch 12:  tensor([1.8598, 1.3798], requires_grad=True)
epoch 13:  tensor([1.8721, 1.4392], requires_grad=True)
epoch 14:  tensor([1.8832, 1.4929], requires_grad=True)
epoch 15:  tensor([1.8934, 1.5414], requires_grad=True)
epoch 16:  tensor([1.9027, 1.5853], requires_grad=True)
epoch 17:  tensor([1.9112, 1.6250], requires_grad=True)


In [40]:
# a possible WRONG way to do it
torch.manual_seed(0)
p = torch.randn([dim], requires_grad=True)
lr = 0.001
for epoch in range(100):
    for data in dataset:
        p_new = p.clone() # p.clone() creates a new tensor disconnected from the computation graph: p_new does not track gradients anymore.
        output = a_function(input_tensor=data, p=p_new)
        l = loss(data, output)

        p.grad = torch.zeros([2])
        l.backward()
        p_new = p - lr*p.grad # The original p remains unchanged, and its .grad is never used to update it.
        

    print(f'epoch {epoch}: ', p)

epoch 0:  tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 1:  tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 2:  tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 3:  tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 4:  tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 5:  tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 6:  tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 7:  tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 8:  tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 9:  tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 10:  tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 11:  tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 12:  tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 13:  tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 14:  tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 15:  tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 16:  tensor([ 1.5410, -0.2934], requires_grad=True)
epoch 17:  tensor([ 1.54