# Automatic Differentiation (AD)

We all know how to take derivatives.
```julia
f(x) = 5*x^2 + 3

df(x) = 10*x

ddf(x) = 10
```

The promise of AD is

```julia
df(x) = derivative(f, x)

ddf(x) = derivative(df, x)
```

### What AD is not

(https://www.jmlr.org/papers/volume18/17-468/17-468.pdf)

**Symbolic differentiation:** (at least not exactly)
$$ \frac{d}{dx}x^n = n x^{n-1}. $$

**Numerical differentiation:**
$$ \frac{df}{dx} \approx \frac{f(x+h) - f(x)}{\Delta h} $$

## Forward mode AD

Key to AD is the application of the chain rule
$$\dfrac{d}{dx} f(g(x)) = \dfrac{df}{dg} \dfrac{dg}{dx}$$

Consider the function $f(a,b) = \ln(ab + \sin(a))$.

In [None]:
f(a,b) = log(a*b + sin(a))

In [None]:
f_derivative(a,b) = 1/(a*b + sin(a)) * (b + cos(a))

In [None]:
a = 3.1
b = 2.4
f_derivative(a,b)

Dividing the function into the elementary steps, it corresponds to the following "*computational graph*":

<img src="imgs/comp_graph.svg" width=300px>

In [None]:
function f_graph(a,b)
    c1 = a*b
    c2 = sin(a)
    c3 = c1 + c2
    c4 = log(c3)
end

In [None]:
f(a,b) == f_graph(a,b)

To calculate $\frac{\partial f}{\partial a}$ we have to apply the chain rule multiple times.

$\dfrac{\partial f}{\partial a} = \dfrac{\partial f}{\partial c_4} \dfrac{\partial c_4}{\partial a} = \dfrac{\partial f}{\partial c_4} \left( \dfrac{\partial c_4}{\partial c_3} \dfrac{\partial c_3}{\partial a}  \right) = \dfrac{\partial f}{\partial c_4} \left( \dfrac{\partial c_4}{\partial c_3} \left( \dfrac{\partial c_3}{\partial c_2} \dfrac{\partial c_2}{\partial a} + \dfrac{\partial c_3}{\partial c_1} \dfrac{\partial c_1}{\partial a}\right)  \right)$

In [None]:
function f_graph_derivative(a,b)
    c1 = a*b
    c1_ϵ = b
    
    c2 = sin(a)
    c2_ϵ = cos(a)
    
    c3 = c1 + c2
    c3_ϵ = c1_ϵ + c2_ϵ
    
    c4 = log(c3)
    c4_ϵ = 1/c3 * c3_ϵ
    
    c4, c4_ϵ
end

In [None]:
f_graph_derivative(a,b)[2] == f_derivative(a,b)

**How can we automate this?**

In [None]:
# D for "dual number", invented by Clifford in 1873.
struct D <: Number
    x::Float64 # value
    ϵ::Float64 # derivative
end

In [None]:
import Base: +, *, /, -, sin, log, convert, promote_rule

a::D + b::D = D(a.x + b.x, a.ϵ + b.ϵ) # sum rule
a::D - b::D = D(a.x - b.x, a.ϵ - b.ϵ)
a::D * b::D = D(a.x * b.x, a.x * b.ϵ + a.ϵ * b.x) # product rule
a::D / b::D = D(a.x / b.x, (b.x * a.ϵ - a.x * b.ϵ)/b.x^2) # quotient rule
sin(a::D) = D(sin(a.x), cos(a.x) * a.ϵ)
log(a::D) = D(log(a.x), 1/a.x * a.ϵ)

Base.convert(::Type{D}, x::Real) = D(x, zero(x))
Base.promote_rule(::Type{D}, ::Type{<:Number}) = D

In [None]:
f(D(a,1), b)

Boom! That was easy!

In [None]:
f(D(a,1), b).ϵ ≈ f_derivative(a,b)

**How does this work?!**

The trick of forward mode AD is to make the computer do the rewrite `f -> f_graph_derivative` for you (and then optimize the resulting code structure).

In [None]:
@code_typed f(D(a,1), b)

While this is somewhat hard to parse, plugging these operations manually into each other we find that this code equals

```julia
D.x = log(a.x*b + sin(a.x))
D.ϵ = 1/(a.x*b + sin(a.x)) * (a.x*0 + (a.ϵ*b) + cos(a.x)*a.ϵ)
```

which, if we drop `a.x*0`, set `a.ϵ = 1`, and rename `a.x` $\rightarrow$ `a`, reads

```julia
D.x = log(a*b + sin(a))
D.ϵ = 1/(a*b + sin(a)) * (b + cos(a)
```

This precisely matches our definitions from above:

```julia
f(a,b) = log(a*b + sin(a))

f_derivative(a,b) = 1/(a*b + sin(a)) * (b + cos(a))
```

Importantly, the compiler sees the entire "rewritten" code and can therefore apply optimizations. In this simple example, we find that the code produced by our simple Forward mode AD is essentially identical to the explicit implementation.

In [None]:
@code_llvm debuginfo=:none f_graph_derivative(a,b)

In [None]:
@code_llvm debuginfo=:none f(D(a,1), b)

It's general:

In [None]:
# utility function for our small forward AD
derivative(f::Function, x::Number) = f(D(x, one(x))).ϵ

In [None]:
derivative(x->f(x,b), a)

In [None]:
derivative(x->3*x^2+4x+5, 2)

In [None]:
derivative(x->sin(x)*log(x), 3)

Or as a function:

In [None]:
df(x) = derivative(a->f(a,b),x) # partial derivative wrt a

In [None]:
df(1.23)

## Taking derivatives of code: Babylonian sqrt

> Repeat $t \leftarrow (t + x/2)/2$ until $t$ converges to $\sqrt{x}$.

In [None]:
@inline function Babylonian(x; N = 10)
    t = (1+x)/2
    for i = 2:N
        t = (t + x/t)/2
    end
    t
end

In [None]:
Babylonian(2), √2

In [None]:
using Plots

xs = 0:0.01:49

p = plot(title = "Those Babylonians really knew what they were doing")
for i in 1:5
    plot!(p, xs, [Babylonian(x; N=i) for x in xs], label="Iteration $i")
end

plot!(p, xs, sqrt.(xs), label="sqrt", color=:black)

## ... and now the derivative, automagically

The same babylonian algorithm with no rewrite at all computes properly the derivative as the check shows.

In [None]:
Babylonian(D(5, 1))

In [None]:
√5, 0.5 / √5

It just works and is efficient.

In [None]:
@code_native debuginfo=:none Babylonian(D(5, 1))

Recursion? Works as well...

In [None]:
pow(x, n) = n <= 0 ? 1 : x*pow(x, n-1)

In [None]:
derivative(x -> pow(x,3), 2)

Deriving our Vandermonde matrix from yesterday? Sure:

In [None]:
function vander_generic(x::AbstractVector{T}) where T
    m = length(x)
    V = Matrix{T}(undef, m, m)
    for j = 1:m
        V[j,1] = one(x[j])
    end
    for i= 2:m
        for j = 1:m
            V[j,i] = x[j] * V[j,i-1]
            end
        end
    return V
end

In [None]:
a, b, c, d = 2, 3, 4, 5
V = vander_generic([D(a,1), D(b,1), D(c,1), D(d,1)])

In [None]:
(x->getfield(x, :ϵ)).(V)

## Symbolically (because we can)

The below is mathematically equivalent, **though not exactly what the computation is doing**.

In [None]:
using SymPy

In [None]:
x = symbols("x")

# display("Iterations as a function of x")
# for k = 1:5
#     display(simplify(Babylonian(x; N=k)))
# end

display("Derivatives as a function of x")
for k = 1:5
    display(simplify(diff(simplify(Babylonian(x; N=k)), x)))
end

In [None]:
@code_native debuginfo=:none Babylonian(D(5, 1); N=5)

## ForwardDiff.jl

Now that we have understood how forward AD works, we can use the more feature complete package [ForwardDiff.jl](https://github.com/JuliaDiff/ForwardDiff.jl).

In [None]:
using ForwardDiff

In [None]:
ForwardDiff.derivative(Babylonian, 2)

In [None]:
@edit ForwardDiff.derivative(Babylonian, 2)

(Note: [DiffRules.jl](https://github.com/JuliaDiff/DiffRules.jl))

## Reverse mode AD

Forward mode:
$\dfrac{\partial f}{\partial x} = \dfrac{\partial f}{\partial c_4} \dfrac{\partial c_4}{\partial x} = \dfrac{\partial f}{\partial c_4} \left( \dfrac{\partial c_4}{\partial c_3} \dfrac{\partial c_3}{\partial x}  \right) = \dfrac{\partial f}{\partial c_4} \left( \dfrac{\partial c_4}{\partial c_3} \left( \dfrac{\partial c_3}{\partial c_2} \dfrac{\partial c_2}{\partial x} + \dfrac{\partial c_3}{\partial c_1} \dfrac{\partial c_1}{\partial x}\right)  \right)$

Reverse mode:
$\dfrac{\partial f}{\partial x} = \dfrac{\partial f}{\partial c_4} \dfrac{\partial c_4}{\partial x} = \left( \dfrac{\partial f}{\partial c_3}\dfrac{\partial c_3}{\partial c_4}   \right) \dfrac{\partial c_4}{\partial x} = \left( \left( \dfrac{\partial f}{\partial c_2} \dfrac{\partial c_2}{\partial c_3} + \dfrac{\partial f}{\partial c_1} \dfrac{\partial c_1}{\partial c_3} \right) \dfrac{\partial c_3}{\partial c_4} \right) \dfrac{\partial c_4}{\partial x}$

Forward mode AD requires $n$ passes in order to compute an $n$-dimensional
gradient.

Reverse mode AD requires only a single run in order to compute a complete gradient but requires two passes through the graph: a forward pass during which necessary intermediate values are computed and a backward pass which computes the gradient.

*Rule of thumb:*

Forward mode is good for $\mathbb{R} \rightarrow \mathbb{R}^n$ while reverse mode is good for $\mathbb{R}^n \rightarrow \mathbb{R}$.

An efficient source-to-source reverse mode AD is implemented in [Zygote.jl](https://github.com/FluxML/Zygote.jl), the AD underlying [Flux.jl](https://fluxml.ai/) (since version 0.10).

In [None]:
using Zygote

In [None]:
f(x) = 5*x + 3

In [None]:
gradient(f, 5)

In [None]:
@code_llvm debuginfo=:none gradient(f,5)

## Some nice reads

Lectures:

* https://mitmath.github.io/18337/lecture8/automatic_differentiation.html

Blog posts:

* ML in Julia: https://julialang.org/blog/2018/12/ml-language-compiler

* Nice example: https://fluxml.ai/2019/03/05/dp-vs-rl.html

* Nice interactive examples: https://fluxml.ai/experiments/

* Why Julia for ML? https://julialang.org/blog/2017/12/ml&pl

* Neural networks with differential equation layers: https://julialang.org/blog/2019/01/fluxdiffeq

* Implement Your Own Automatic Differentiation with Julia in ONE day : http://blog.rogerluo.me/2018/10/23/write-an-ad-in-one-day/

* Implement Your Own Source To Source AD in ONE day!: http://blog.rogerluo.me/2019/07/27/yassad/

Repositories:

* AD flavors, like forward and reverse mode AD: https://github.com/MikeInnes/diff-zoo (Mike is one of the smartest Julia ML heads)

Talks:

* AD is a compiler problem: https://juliacomputing.com/assets/pdf/CGO_C4ML_talk.pdf