While working on challenge 15 I realized I'm confused about what tensors need to be available and allocated for back prop. Do back prop by hand to build my intuition.

Let me start with an example that uses scalars only and only one piece of data.

```
x = 10
y = 20

and my model is a function f parameterized by a single weight w: f(w) = w*x

To keep in mind that x and y are fixed and we're interested in the derivative of the loss with respect to w (so we know how to adjust it), I'm going to write x and y as 10 and 20.

f(w) = 10 * w             this is our model
error(w) = 20 - f(w)          
loss(w) = square(error(w)) 

Suppose w is 5

I'll first calculate the derivative of loss with respect to w the way I learned in HS:

loss(w) = (20 - 10w)^2
loss(w) = 100w^2 - 400w + 400
loss'(w) = 200w - 400

loss'(5) = 600           this tells us increasing w will increase the loss

Now I want to calculate the way I think back propagation works.

Forward pass:

f(w) = 50
error(w) = 20 - f(w) = -30
loss(w) = square(error(w)) = square(-30) = 900

Backward pass:

loss'(w) = square'(error(w)) * error'(w) = 2(error(w)) * error'(w) = -60 *
           error'(w) = -f'(w) =  -1 *
           f'(w) = 10
           = 600

or maybe it's clearer written like this:

loss'(w) = square'(error(w)) * error'(w)
         = 2(error(w)) * (-f'(w))
         = 2(-30) * (-10)
         = -60 * -10
         = 600

calc rules that come into play here:
d/dx f(g(x)) = f'(g(x)) * g'(x)
d/dx x^2 = 2x
d/dx c + f(x) = f'(x)
d/dx cx = c
```

Now do it with torch

In [1]:
import torch

In [2]:
x = torch.tensor(10, dtype=torch.float32)
y = torch.tensor(20, dtype=torch.float32)
w = torch.tensor(5, dtype=torch.float32, requires_grad=True)
output = x * w
error = y - output
loss = error ** 2

In [3]:
x, y, w, output, error, loss

(tensor(10.),
 tensor(20.),
 tensor(5., requires_grad=True),
 tensor(50., grad_fn=<MulBackward0>),
 tensor(-30., grad_fn=<SubBackward0>),
 tensor(900., grad_fn=<PowBackward0>))

In [4]:
loss.backward()
w.grad

tensor(600.)

In [5]:
output.grad

  output.grad


To see the backward calculation at each tensor (is that the right way to think about it?) looks like we can tell torch to retain the grad:

In [6]:
x = torch.tensor(10, dtype=torch.float32)
y = torch.tensor(20, dtype=torch.float32)
w = torch.tensor(5, dtype=torch.float32, requires_grad=True)
output = x * w
error = y - output
loss = error ** 2
output.retain_grad()
error.retain_grad()
loss.retain_grad()
loss.backward()
loss.grad, error.grad, output.grad, w.grad

(tensor(1.), tensor(-60.), tensor(60.), tensor(600.))

So it seems like back to the question in challenge 15 about how tensors get allocated during back prop, it probably creates a bunch but also can free them as soon as it no longer needs them.

Now let me try back prop by hand where we have 2 weights and so partials will come into play.

```
x = [10, 15]
y = 20

model is a function f parameterized by 2 weights W = [w1, w2]: f(W) = w1*x1 + w2*x2

As above, write x and y as numbers to make it less confusing.

f(W) = 10*w1 + 15*w2
error(W) = 20 - f(W)          
loss(W) = square(error(W)) 

Suppose our starting value for W is [5,6]

I'll first calculate the derivative of loss with respect to W the HS way, plus knowing that to get the partial derivative with respect to one variable you treat the others like constants.

loss(W) = (20 - (10w1 + 15w2))^2
loss(W) = 100w1^2 + 300w1w2 - 400w1 + 225w2^2 - 600w2 + 400
loss'(w1) = 200w1 + 300w2 - 400
loss'(w2) = 300w1 + 450w2 - 600
loss'([5,6]) = [2400, 3600]

Now I want to calculate the way I think back propagation works.

Forward pass:

f(W) = 10*5 + 15*6 = 140
error(W) = 20 - f(W) = -120
loss(W) = square(error(W)) = square(-120) = 14400

Backward pass:

loss'(W) = square'(error(W)) * error'(W) = 2(error(W)) * error'(W) = -240 *
           error'(W) = -f'(W) = -1 *
           f'(W) = [10, 15]
           = [2400, 3600]

Why is f'(W) = [10,15]? Becuase f'(w1) = 10 and f'(w2) = 15
```

Now with torch:

In [7]:
x = torch.tensor([10, 15], dtype=torch.float32)
y = torch.tensor(20, dtype=torch.float32)
w = torch.tensor([5, 6], dtype=torch.float32, requires_grad=True)
output = x @ w
error = y - output
loss = error ** 2

In [8]:
x, y, w, output, error, loss

(tensor([10., 15.]),
 tensor(20.),
 tensor([5., 6.], requires_grad=True),
 tensor(140., grad_fn=<DotBackward0>),
 tensor(-120., grad_fn=<SubBackward0>),
 tensor(14400., grad_fn=<PowBackward0>))

In [9]:
loss.backward()
w.grad

tensor([2400., 3600.])

And now using retain_grad():

In [10]:
x = torch.tensor([10, 15], dtype=torch.float32)
y = torch.tensor(20, dtype=torch.float32)
w = torch.tensor([5, 6], dtype=torch.float32, requires_grad=True)
output = x @ w
error = y - output
loss = error ** 2
output.retain_grad()
error.retain_grad()
loss.retain_grad()
loss.backward()
loss.grad, error.grad, output.grad, w.grad

(tensor(1.), tensor(-240.), tensor(240.), tensor([2400., 3600.]))

Now try back prop by hand where we have 2 pieces of data and 2 weights. 

```
x = [[10, 15], [30, 40]]
y = [20, 25]

f(W) = [10w1 + 15w2, 30w1 + 40w2]
error(W) = [20 - f(W), 25 - f(W)]
loss = mean(square(error(W)))

suppose W = [5,6]

Forward pass:

f(W) = [140, 390]
error(W) = [-120, -365]
square(error(W)) = [14400, 133225]
mean(square(error(W))) = 73812.5

Backward pass:

loss'(W) = mean'(square(error(W)))
         = 0.5 * (   square'(error(W))          +     square'(error(W))         )       †             
         = 0.5 * (   2(error(W)) * error'(W)    +     2(error(W)) * error'(W)   )       †
         = 0.5 * (   2(-120) * -f'(W)           +     2(-365) * -f'(W)          )       †
         = 0.5 * (   -240 * (-1) * [10, 15])    +     (-730 * (-1) * [30, 40]   )       †
         = 0.5 * (   [2400, 3600]               +     [21900, 29200]            )       †
         = 0.5 * [24300, 32800]
         = [12150, 16400]

† - this notation is confusing / wrong, but each column is like the prior example, column 1
    is for the first piece of data and column 2 for the second
```

In [14]:
x = torch.tensor([[10, 15], [30, 40]], dtype=torch.float32)
y = torch.tensor([20, 25], dtype=torch.float32)
w = torch.tensor([5, 6], dtype=torch.float32, requires_grad=True)
output = x @ w
error = y - output
error_squared = error ** 2
loss = torch.mean(error_squared)
output.retain_grad()
error.retain_grad()
error_squared.retain_grad()
loss.retain_grad()
loss.backward()
loss.grad, error_squared.grad, error.grad, output.grad, w.grad

(tensor(1.),
 tensor([0.5000, 0.5000]),
 tensor([-120., -365.]),
 tensor([120., 365.]),
 tensor([12150., 16400.]))

In [15]:
output, error, error_squared, loss

(tensor([140., 390.], grad_fn=<MvBackward0>),
 tensor([-120., -365.], grad_fn=<SubBackward0>),
 tensor([ 14400., 133225.], grad_fn=<PowBackward0>),
 tensor(73812.5000, grad_fn=<MeanBackward0>))