# Automatic Differentiation
AD is an extremely powerful tool. In julia, you can differentiate almost any valid julia program to obtain derivatives, gradients, jacobians and hessians etc. automatically, with high performance. This is what makes up the bulk of most deep-learning libraries, but contrary to most libraries, you do not need to write your code using a subset of julia or a DL-specific language, you can just write regular julia code.

There are a number of different kinds of AD. In the following, we will refer to a function 
$$ f : \mathbb{R}^n -> \mathbb{R}^m$$

## Forward-mode AD
Using dual numbers, forward-mode AD performs a single forward pass of your program, calculating both the function value and gradients in one go. FAD is algorithmically favorable when $f$ is "few to many", or $n < m$. It also typically has the least overhead, so is competitive when both $n$ and $m$ are small.

## Reverse-mode AD
This is what is used in DL libraries. RAD works by constructing a computation graph, either before execution (as old tensorflow) or during the execution (most common today).
RAD is algorithmically favorable when $f$ is "many to few", or $n > m$. This is the case in most DL, where the cost function is a scalar-valued function of very many parameters, the NN weights. For functions with many outputs, 

In [23]:
using Pkg; Pkg.activate(".")
using ForwardDiff, BenchmarkTools

f(x) = sum(sin, x) + prod(tan, x) * sum(sqrt, x);

x = rand(5) # small size 
g = x -> ForwardDiff.gradient(f, x); # g = ∇f
g(x)

[32m[1m  Activating[22m[39m project at `~/repos/juliacourse2022/lecture6_optimization_learning`


5-element Vector{Float64}:
 1.1232106411559188
 0.9836039492002165
 0.8391388924682177
 1.0341481020949057
 0.9385659002702967

In [24]:
@btime g($x);

  389.748 ns (4 allocations: 848 bytes)


In [25]:
ForwardDiff.hessian(f, x)

5×5 Matrix{Float64}:
 -0.0151824   0.421806   0.302021   0.555813   0.359894
  0.421806   -0.307795   0.121435   0.223572   0.14471
  0.302021    0.121435  -0.535026   0.160064   0.103596
  0.555813    0.223572   0.160064  -0.19853    0.19074
  0.359894    0.14471    0.103596   0.19074   -0.392262

In [26]:
using Zygote

f'(x) ≈ g(x)

true

In [27]:
@btime f'($x);

  1.672 μs (30 allocations: 1.45 KiB)


If we change the size of the input vector, the relative timings change

In [28]:
x = rand(5000)
@btime g($x);

  50.424 ms (6 allocations: 549.25 KiB)


In [29]:
@btime f'($x);

  106.030 μs (39 allocations: 469.64 KiB)


Most AD (except for Zygote and Yota) in julia works by overloading `Base` functions on custom types. This means that you can not use AD if you restrict the input types to your functions too much! In the following example, the input is restricted to `Vector{Float64}`

In [30]:
x = abs.(randn(3))
f2(x::Vector{Float64}) = sum(sin, x) + prod(tan, x) * sum(sqrt, x);
f2(x)

104.18089589896489

In [31]:
ForwardDiff.gradient(f2, x);

LoadError: MethodError: no method matching f2(::Vector{ForwardDiff.Dual{ForwardDiff.Tag{typeof(f2), Float64}, Float64, 3}})
[0mClosest candidates are:
[0m  f2([91m::Vector{Float64}[39m) at In[30]:2

This did not work, since ForwardDiff  calls the function with an argument of type `Vector{<: Dual}`

In [None]:
f2'(x)

This works since Zygote does not use dispatch on custom types.

In [None]:
f3(x::Vector) = sum(sin, x) + prod(tan, x) * sum(sqrt, x);
ForwardDiff.gradient(f3, x)