# Autodiff

These are my notes on an appendix of Hands on Machine Learning 3rd Edition.

---

# Automatic Differentiation
We have function:

$$f(x, y) = x^2 + y+2$$

We need the partial derivatives $\frac{df}{dx}$ and $\frac{df}{dy}$.
Usually done to do gradient descent (or another optimization algo).

Can either:
- Use sympy to calculate the derivatives // manual differentiation
- Finite difference approximation
- Use autograd to calculate the derivatives // automatic differentiation

But there are a few ways to do autodiff:
- Forward mode (calculate derivative of each variable)
- Reverse mode (calculate derivative of each function)

## Manual Differentiation

Pick up a piece of paper and use calculus to derive the appropriate equation.
This gets incredibly tedious.

## Finite Difference Approximation

Unfortunately, this is imprecise and slow.

In [1]:
def f(x, y):
    return x**2 * y + y + 2


def derivative(f, x, y, x_eps, y_eps):
    return (f(x + x_eps, y + y_eps) - f(x, y)) / (x_eps + y_eps)


df_dx = derivative(f, 3, 4, 0.00001, 0)
df_dy = derivative(f, 3, 4, 0, 0.00001)

print(df_dx, df_dy)

24.000039999805264 10.000000000331966


## Forward-Mode Autodiff

Algo goes through computation graph from inputs to outputs (hence 'forward').
Starts by getting partial derivatives of leaf nodes.
Then uses chain rule to calculate derivatives of other nodes.

Forward-Mode takes one computation graph and produces another.
This is called symbolic differentiation.
A nice byproduct of this is that we can reuse the output computation graph to calculate the derivatives of the given function for any value of $x$ and $y$.
And we can run it again on the output graph to get second-order derivatives (and so on).

But we can also do forward-mode autodiff without creating a graph (so numerically, not symbolically) by computing intermediate results on the fly. Can use dual numbers for this.

The major flaw of forward-mode autodiff is that it's not efficient for functions with many inputs. Not great for deep learning, where there are so many parameters.
This is where reverse-mode autodiff comes in.

## Reverse-Mode Autodiff

Goes through graph in forward direction to compute values of each node, and then does a second, reverse pass to compute all the partial derivatives.

We gradually go through the graph in reverse, computing the partial derivatives of the function, w.r.t each consecutive node, until we reach the inputs.
This uses the chain rule, and is called reverse accumulation.

Reverse-mode autodiff is efficient for functions with many inputs, but not so much for many outputs.
It requires only one forward pass & one reverse pass per output to compute all the partial derivatives for all outputs, w.r.t the inputs.
When we train neural nets, there's only one output (the loss), but many inputs.

It can also handle functions that aren't entirely differentiable - but only if you only ask it to compute the partial derivatives at points where the function is differentiable.