# Backprop

## Introduction

### Problem
Given $f(x)$, compute $\nabla f(x)$.

### Gradients
- $\frac{df}{dx}=\frac{f(x+h)-f(x)}{h}$

- Rearranging, we have: $f(x+h)=f(x)+h\frac{df}{dx}$

- Approximation: nudging $x$ by $h$ changes $f$ proportional to the gradient
- $f$ is basically a straight line when $h$ is really small, which is why that approximation works

## Backpropagation

Each node computes
1. Output
2. Local gradient of output with respect to inputs.

Correct way to combine local and upstream gradients is multiplication.
- Intuitively: gradient tells you which direction to nudge a variable in order to increase final output. Sign and magnitude of gradient tell you which direction and how strongly to update the variable.
- Backpropagation communicates this information to each input node so they can compute *their* effect on the final output value.

### Notes

1. Cache forward pass variables
2. Gradients add at forks (multivariate chain rule)

### Patterns

1. Add gate: distributes grad equally to each input
    - This is because local grad is 1, so it just relays upstream grad
2. Max gate: distributes grad to only one input (the max)
    - This is because local grad is 1 w.r.t max input and 0 w.r.t all other inputs
3. Multiply: distributes grad by swapping them

## Vectorization

Use dimensional analysis (always only one way to get correct answer)

Example:
```
# fwd pass
# W: (5,10)
# x: (10,3)
# D: (5,3)
D = W@x

# bkwd pass
# dD: (5,3)
dD = np.random.randn(*D.shape)

# we know dL/dW = dL/dD * dD/dW = dL/dD * x, but what are the exact dimensions involved?
# well, dL/dD has a (5,3) shape, and x has (10,3) shape, so dL/dW must have (5,3)x(3,10)=(5,10) shape
# this means we need to transpose x before the multiplication
dW = dD @ x.T

# similarly, dL/dX = dL/dD * dD/dX = dL/dD * W, so (10,5)x(5,3)=(10,3)
# we must transpose W
dX = W.T @ dD
```

Scenarios
1. Scalar -> Scalar: gradient is scalar
2. Vector -> Scalar: gradient is a vector
3. Vector -> Vector: gradient is a 2x2 matrix (Jacobian)
4. Matrix -> Matrix: gradient is a multidimensional matrix