While working on challenge 15 I realized I'm confused about what tensors need to be available and allocated for back prop. Do back prop by hand to build my intuition.

Let's first try an example that uses scalars only and doesn't even involve data or a loss function.

```
We're going to build a "machine" called "f" with 4 "knobs." We're only allowed to turn the first knob which we call "a" but we can read the value off any of the knobs.

[a knob] -> [b knob] -> [c knob] -> [d knob]

the value of the "b" knob will be the square of its input
the value of the "c" knob will be the cube of its input
the value of the "d" knob will be the square of its input

We set "a" to 2. We want to know how "d" will change if we turn "a" a little bit.

a(x) = x
b(x) = x^2
c(x) = x^3
d(x) = x^2
f(x) = d(c(b(a(x))))

This means that f(x) = x^12
so f'(x) = 12x^11
so f'(2) = 24,576

If we turn "a" tiny bit, say +1/100,000, we expect the output "d" to increase by around 24,576/100,000 =~ .25. Let's see:

(2 + 1/100,000)^12 - 2^12  =~ 0.246

With the raise to the 12th power, it's no surprise that a tiny, tiny turn of "a" causes a much bigger increase in "d".

Now let's calculate this again but with back propagation through the machine. In other words, instead of first simplfying the "machine" to x^12 and using the power rule in calculus once, we use the chain rule.

A note about notation because I was a little sloppy at first and got myself confused:

d/dx f(x) means the derivative of f with respect to x
f'(x) means the derivative of f with respect to the input of f
so d/dx f(x) = f'(x)
but d/dx f(g(x)) ≠ f'(g(x))

First we calculate forward:

a = 2
b(a) = a^2 = 2^2 = 4
c(b(a)) = 4^3 = 64
d(c(b(a))) = 64^2 = 4096

Let's show these on our knobs:

[a knob] -> [b knob] -> [c knob] -> [d knob]
    2          4           64         4096

Now backward:

f'(a) = d/da d(c(b(a)))
      = d'(c(b(a))) * d/da c(b(a))
      = 2 * c(b(a)) * d/da c(b(a))        # becuase d'(x) = 2x
      = 2 * 64 * d/da c(b(a))
      = 128 * d/da c(b(a))                <--- 128 is "c" knob grad
      = 128 * c'(b(a)) * d/da b(a)
      = 128 * 3 * b(a)^2 * d/da b(a)      # because c'(x) = 3x^2
      = 128 * 3 * 4^2 * d/da b(a)
      = 128 * 3 * 16 * d/da b(a)
      = 128 * 48 * d/da b(a)
      = 6144 * d/da b(a)                  <--- 6144 is "b" knob grad
      = 6144 * 2 * a                      # because b'(x) = 2x
      = 6144 * 2 * 2
      = 6144 * 4
      = 24,576                            <--- 24,576 is "a" knob grad


                    b(x)            c(x)            d(x)
          [a knob]   ->   [b knob]   ->   [c knob]   ->   [d knob]
forward:      2              4               64             4096
backward:  24,576           6144             128              1
                                             

How to think about the "c" knob grad of 128? We know in this "machine" we aren't allowed to adjust the "c" knob by hand. But let's say we did. The 128 tell us that in the current state (where "a" is set to 2 and "c" is therefore set to 64), if we increase "c" slightly, we'll see an increase of 128 times that amount in "d":

(64 + 1/1000)^2 - 4096 =~ 0.128

How about the "b" knob grad of 6144? Same idea:

(4 + 1/10,000)^6 - 4096 =~ 0.614

Now going back to our original motivation for exploring this stuff from challenge 15, let's think about what information we need to do the backward pass.

grad_of_d = 1
grad_of_c = d_backwards(grad_of_d, value_of_c) = grad_of_d * d'(value_of_c) = 1 * 2(64) = 128
grad_of_b = c_backwards(grad_of_c, value_of_b) = grad_of_c * c'(value_of_b) = 128 * 3(4^2) = 6144
grad_of_a = b_backwards(grad_of_b, value_of_a) = grad_of_b * b'(value_of_a) = 6144 * 2(2) = 24576

In other words, for each knob, calculate its derivative with respect to its input, plug in the input it received on the forward pass, multiply by its gradent, and that's the gradient of the knob before.

Interesting, looks like we don't actually need to compute the value of "d" to determine the grad of "a".

```

Now let's do it in torch:

In [1]:
import torch
a = torch.tensor(2, dtype=torch.float32, requires_grad=True)
b = a ** 2
c = b ** 3
d = c ** 2
b.retain_grad()
c.retain_grad()
d.retain_grad()
d.backward()
print(f"Forward: a:{a.item()} -> b:{b.item()} -> c:{c.item()} -> d:{d.item()}")
print(f"Backward: d:{d.grad.item()} -> c:{c.grad.item()} -> b:{b.grad.item()} -> a:{a.grad.item()}")

Forward: a:2.0 -> b:4.0 -> c:64.0 -> d:4096.0
Backward: d:1.0 -> c:128.0 -> b:6144.0 -> a:24576.0


^ The numbers match

To get a bit more practice, try where the input to the machine has two values (a vector of size 2), the output is a single number, and we're interested in the gradient of the input vector. This brings us a little closer to a situation where we're interested in the gradient of many weights and our final output is a single number (the loss).

```
[a1 knob] -> [b1 knob] -> [c1 knob] -> [d1 knob] \
                                                   -> [e knob]
[a2 knob] -> [b2 knob] -> [c2 knob] -> [d2 knob] /

This will be just like the example above except the "e" knob is set by the mean of its inputs.

X is the vector [x1, x2]

b(X) = X^2
c(X) = X^3
d(X) = X^2
e(X) = 0.5(x1) + 0.5(x2)

Let's set our A knobs to [2,3]

                    b(X)            c(X)            d(X)                  e(X)
          [a knob]   ->   [b knob]   ->   [c knob]   ->   [d knob]         ->    [e knob]

forward:   [2,3]           [4,9]          [64,729]        [4096, 531441]         267768.5


To calculate backward we need to know e'(X):
e(X) = 0.5(x1) + 0.5(x2)
∂/∂x1 e(X) = 0.5     (I believe another notation for this is e_x1)
∂/∂x2 e(X) = 0.5
e'(X) = [0.5, 0.5]

e_backwards(grad_of_e, value_of_d) = grad_of_e * e'(value_of_d) = 1 * [0.5, 0.5] = [0.5, 0.5]


So now we do things like before:

                       b(X)                c(X)                  d(X)                  e(X)
          [a knob]      ->    [b knob]      ->      [c knob]      ->   [d knob]         ->     [e knob]

forward:   [2, 3]              [4, 9]                [64, 729]          [4096, 531441]          267768.5
backward:  [12288, 1062882]    [3072, 177147]        [64, 729]          [0.5, 0.5]              1

And as a sanity check, if we "force" the d2 knob up by 1, it makes sense that our output will increase by 0.5 because our output is the mean of d1 and d2. (Mean of 2 and 4 is 3, mean of 2 and 5 is 3.5.)

```

Now with torch:

In [2]:
pp = lambda t: '[' + ', '.join([f"{x:.1f}" for x in t]) + ']'
a = torch.tensor([2,3], dtype=torch.float32, requires_grad=True)
b = a ** 2
c = b ** 3
d = c ** 2
e = torch.mean(d)
b.retain_grad()
c.retain_grad()
d.retain_grad()
e.retain_grad()
e.backward()
print(f"Forward: a:{pp(a)} -> b:{pp(b)} -> c:{pp(c)} -> d:{pp(d)} -> e:{e.item()}")
print(f"Backward: e:{e.grad.item()} -> d:{pp(d.grad)} -> c:{pp(c.grad)} -> b:{pp(b.grad)} -> a:{pp(a.grad)}")

Forward: a:[2.0, 3.0] -> b:[4.0, 9.0] -> c:[64.0, 729.0] -> d:[4096.0, 531441.0] -> e:267768.5
Backward: e:1.0 -> d:[0.5, 0.5] -> c:[64.0, 729.0] -> b:[3072.0, 177147.0] -> a:[12288.0, 1062882.0]


Now try an example with data and a weight to be adjusted but only with scalars. (This was actually what I did first and then realized I wanted to start even simpler with the ones above.)

```
data:

x = 10
y = 20

model:

our prediction is parameterized by a single weight w: prediction(x,w) = w*x

To keep in mind that x and y are fixed and we're interested in the derivative of the loss with respect to w (so we know how to adjust it), I'm going to write x and y as 10 and 20.

prediction(w) = 10 * w                  this gives the prediction of our model for x = 10
error(w) = 20 - prediction(w)           this gives the error for x = 10, y = 20 
loss(w) = square(error(prediction(w))) 

Suppose w is 5

I'll first calculate the derivative of loss with respect to w the way I learned in HS:

loss(w) = (20 - 10w)^2
loss(w) = 100w^2 - 400w + 400
loss'(w) = 200w - 400

loss'(5) = 600           this tells us increasing w will increase the loss

Now I'll use the back propagation approach.

First forward:

           [w]   ->    [prediction]       ->      [error]       ->      [loss]
Forward:    5             50                        -30                  900

Before doing backward, calculate the derivatives we'll need (using z to be very clear this has nothing to do with x above):

loss(z) = z^2 so loss'(z) = 2z
error(z) = 20 - z so error'(z) = -1
prediction(z) = 10z so prediction'(z) = 10

           [w]   ->    [prediction]    ->    [error]       ->      [loss]
Forward:    5             50                   -30                  900

Backward:   60 * 10     -60 * -1           1 * 2(-30)                1
            600            60                 -60                    1   


Or we can write it out this way:

loss'(w) = d/dw square(error(prediction(w)))
         = square'(error(prediction(w))) * d/dw error(prediction(w))
         = square'(-30) * d/dw error(prediction(w))
         = -60 * d/dw error(prediction(w))
         = -60 * error'(prediction(w)) * d/dw prediction(w)
         = -60 * error'(50) * d/dw prediction(w)
         = -60 * -1 * d/dw prediction(w)
         = 60 * d/dw prediction(w)
         = 60 * 10
         = 600

Looking at the prediction "knob", if we force it up our loss will increase, which we know is true because the "actual" value is 20.

```

Now do it with torch

In [3]:
x = torch.tensor(10, dtype=torch.float32)
y = torch.tensor(20, dtype=torch.float32)
w = torch.tensor(5, dtype=torch.float32, requires_grad=True)
prediction = x * w
error = y - prediction
loss = error ** 2
prediction.retain_grad()
error.retain_grad()
loss.retain_grad()
loss.backward()
print(f"Forward:  w:{w.item()} -> prediction:{prediction.item()} -> error:{error.item()} -> loss:{loss.item()}")
print(f"Backward: loss:{loss.grad.item()} -> error:{error.grad.item()} -> prediction:{prediction.grad.item()} -> w:{w.grad.item()}")

Forward:  w:5.0 -> prediction:50.0 -> error:-30.0 -> loss:900.0
Backward: loss:1.0 -> error:-60.0 -> prediction:60.0 -> w:600.0


What happens if we try to access the grad of say error without calling `.retain_grad()`?

In [4]:
x = torch.tensor(10, dtype=torch.float32)
y = torch.tensor(20, dtype=torch.float32)
w = torch.tensor(5, dtype=torch.float32, requires_grad=True)
prediction = x * w
error = y - prediction
loss = error ** 2
prediction.retain_grad()
error.retain_grad()
# loss.retain_grad()
loss.backward()
print(f"Forward:  w:{w.item()} -> prediction:{prediction.item()} -> error:{error.item()} -> loss:{loss.item()}")
print(f"Backward: loss:{loss.grad.item()} -> error:{error.grad.item()} -> prediction:{prediction.grad.item()} -> w:{w.grad.item()}")

Forward:  w:5.0 -> prediction:50.0 -> error:-30.0 -> loss:900.0


  print(f"Backward: loss:{loss.grad.item()} -> error:{error.grad.item()} -> prediction:{prediction.grad.item()} -> w:{w.grad.item()}")


AttributeError: 'NoneType' object has no attribute 'item'

Now let's try 3 pieces of data and 2 weights.

```
Data:

X = [[10, 15], [3, 4], [7, 11]]
Y = [20, 8, 14]

prediction(W) = X @ W
error(W) = Y - prediction(W)
square_error(W) = square(error(W))
loss(W) = mean(square(error(W))

Suppose W is [5,6]


     [W]             ->       [prediction]    ->    [error]            ->  [square_error]    -> [loss]

Fwd: [5,6]                    [140, 39, 101]        [-120, -31, -87]       [14400, 961, 7569]   7643.3335

Bwd: X.T @ [80, 20.6667, 58]  -1*[-80,-20.67,-58]   2/3*[-120, -31, -87]   1*[1/3,1/3,1/3]      1

     [1268, 1920.67]          [80, 20.6667, 58]     [-80, -20.6667, -58]   [1/3, 1/3, 1/3]      1


Everything is like above except for the last backwards step: X.T @ [80, 20.67, 58]. I don't understand that. Will explore below after doing this in torch.

```

In [5]:
pp = lambda t: '[' + ', '.join([format(x, '.0f' if x % 1 == 0 else '.1f') for x in t]) + ']'
x = torch.tensor([[10, 15], [3, 4], [7, 11]], dtype=torch.float32)
y = torch.tensor([20, 8, 14], dtype=torch.float32)
w = torch.tensor([5, 6], dtype=torch.float32, requires_grad=True)
prediction = x @ w
error = y - prediction
error_squared = error ** 2
loss = torch.mean(error_squared)
prediction.retain_grad()
error.retain_grad()
error_squared.retain_grad()
loss.retain_grad()
loss.backward()
print(f"Fwd: w:{pp(w)} -> pred:{pp(prediction)} -> error:{pp(error)} -> error_sq:{pp(error_squared)} -> loss:{loss.item():.1f}")
print(f"Bwd: loss:{loss.grad.item()} -> error_sq:{pp(error_squared.grad)} -> error:{pp(error.grad)} -> pred:{pp(prediction.grad)} -> w:{pp(w.grad)}")

Fwd: w:[5, 6] -> pred:[140, 39, 101] -> error:[-120, -31, -87] -> error_sq:[14400, 961, 7569] -> loss:7643.3
Bwd: loss:1.0 -> error_sq:[0.3, 0.3, 0.3] -> error:[-80, -20.7, -58] -> pred:[80, 20.7, 58] -> w:[1268, 1920.7]


```
Why in the last backward step do we do X.T @ [80, 20.67, 58]?

We know the general idea is something like:

W.grad = prediction.grad * prediction'(W)

We also know that prediction'(W) will need to tell us things like if we increase w1 a little bit or if we increase w2 a little bit, how does the prediction for each of our 3 pieces of data change. Since this is all linear, we can see that increasing w1 by 1 and not touching w2 will increase the first prediction by 10, the second by 3, and the third by 7. Similarly, if we leave w1 alone and increase w2 by 1, we will increase our first prediction by 15, our second by 4, and our third by 11. So it makes sense that the function prediction'(W) has X in it, and I believe there's no way it could contain something with less information than X becuase then it wouldn't be possible to say how much each of our 3 predictions increases.

Is the the derivative of the prediction with respect to w1 [10, 3, 7]? If so that too makes sense. But why is it valid to compute w1.grad as 10 * 80 + 3 * 20.7 + 7 * 58 = 1268? Why are we allowed to add those numbers together?

Let me compute W.grad the non-back-prop, non-matrix way. Maybe that will give some insight.

loss(W) = 1/3(20 - (10w1 + 15w2))^2 + 1/3(8 - (3w1 + 4w2))^2 + 1/3(14 - (7w1 + 11w2))^2
        = (158w1^2)/3 + (478w1w2)/3 - (644w1)/3 + (362w2^2)/3 - 324w2 + 220

∂loss/∂w1 = 2/3(158)w1 + 1/3(478)w2 - 1/3(644) = 1268

∂loss/∂w2 = d/dw2 = 1/3(478)w1 + 2/3(362)w2 - 324 = 1920.67

```

That's good but I still don't get it. Giving up and asking ChatGPT. Let me at least try to ask the question clearly.

me: Given matrices X, W and Y and X @ W = Y, when doing back propagation to determine W.grad, why is W.grad calculated by X.T @ Y.grad?

ChatGPT said a lot. Not sure I appreciate it telling me "Below is the clean, no-mystery answer, wrapped in imagery so it goes down easier."! Not going to try to digest it now.

Also found these [Stanford CS class slides](https://cs231n.stanford.edu/slides/2018/cs231n_2018_ds02.pdf) which seem helpful but want to move on. Perhaps revisit.