<h1 style="color:rgb(0,120,170)">Neural Networks and Deep Learning</h1>
<h2 style="color:rgb(0,120,170)">Flux Introduction</h2>

For the non-initiated, Flux.jl is currently (september/2021) the most starred package in the
Julia ecossystem, and it's the go-to package in terms of Deep Neural Networks for Julia.


Autograd: Automatic Differentiation
===================================

Automatic differentiation (the topic of this notebook) is 
Flux's core feature. It take a Julia function `f` and a set of arguments, returning the
gradient with respect to each argument.


*Ref: This tutorial is based on
[this example tutorial from Flux's github page](https://github.com/FluxML/model-zoo/blob/master/tutorials/60-minute-blitz/60-minute-blitz.jl).*

In [13]:
# First, let's import Flux as a package. This comes with the function `gradient` from Zygote.jl
using Flux

f(x) = 3x^2 + 2x + 1

# Returns the gradient at 0.0.
gradient(f,0.0)

(2.0,)

The `gradient` function uses automatic differentiation to calculate
the derivative of polynomials.

This does no work for any arbitrary
function. Try for example, `f(x) = exp(x)` and you'll get an error.

Below, we write another example for a function of three variables.

In [14]:
h(x,y,z) = y^3 + x^3 + x*y + z

gradient(h,1,0,0)

(3.0, 1.0, 1)

Now it's where things get interesting. We can take gradients
of arrays.

Take for example
$Ax + b$, where $A$ is a 2 by 2 matrix, $x$ and $b$ are two dimensional vectors.
This function actually returns another vector. So there is no gradient,
but a jacobian. For this situation, we use the `jacobian` function from Zygote,
which is not shipped with Flux.

In [15]:
using Zygote: jacobian

f(A,x,b) = A*x .+ b

A = [1 2
     3 3]
b = [1,1]
x = [0,0]

jacobian(f,A,x,b)

([0 0 0 0; 0 0 0 0], [1 2; 3 3], [1 0; 0 1])

The reason that `jacobian` is not shipped on Flux is that Neural Networks
usually only require the gradient in order to perform backward propagation. Hence,
instead of $Ax + b$, we have functions such as $\sum^n_{i=1} (Ax)_i + b_i$, which does
have a gradient. Look the example below.

In [38]:
f(A,x,b) = sum(A*x .+ b)

A = [1 2
     3 3]
b = [1,1]
x = [0,0]

gradient(f,A,x,b)

([0 0; 0 0], [4, 5], Fill(1, 2))

<div class="alert alert-info"><p>
<strong>Obs: <\strong>note the `Fill(1,2)` as the last element of the output of the `gradient` function.
This is just a way to represent a vector of dimension 2 where all elements are equal to 1.
if you want to underestand more about it, copy the code below and run it in a cell to see the output.

```julia
using FillArrays
@show collect(Fill(1,2))
@show collect(Fill(3.5,2,2))
```

It's even more impressive. It can take gradient of functions defined programmatically! 
Take a look at the example below.

In [27]:
function mycrazyfunction(x)
    if x ≥ 1
        return sin(x)
    elseif -1 < x < 1
        return exp(x)
    else
        return x^2
    end
end

@show gradient(mycrazyfunction,0)[1] == exp(0)
@show gradient(mycrazyfunction,1)[1] == cos(1)
@show gradient(mycrazyfunction,-10)[1] == 2*-10;

(gradient(mycrazyfunction, 0))[1] == exp(0) = true
(gradient(mycrazyfunction, 1))[1] == cos(1) = true
(gradient(mycrazyfunction, -10))[1] == 2 * -10 = true


In [25]:
myloss(W, b, x) = sum(W * x .+ b)

W = randn(3, 5)
b = zeros(3)
x = rand(5)

gradient(myloss, W, b, x)[3]


5-element Vector{Float64}:
 -3.2740154025668464
  1.7271538165976312
  1.0671549391866826
  1.1165473529223182
  0.94526273534884

2.718281828459045

In [5]:
# Automatic Differentiation
# -------------------------

# You probably learned to take derivatives in school. We start with a simple
# mathematical function like

f(x) = 3x^2 + 2x + 1

f(5)

# In simple cases it's pretty easy to work out the gradient by hand – here it's
# `6x+2`. But it's much easier to make Flux do the work for us!

df(x) = gradient(f, x)[1]

df(5)

# You can try this with a few different inputs to make sure it's really the same
# as `6x+2`. We can even do this multiple times (but the second derivative is a
# fairly boring `6`).

ddf(x) = gradient(df, x)[1]

ddf(5)

# Flux's AD can handle any Julia code you throw at it, including loops,
# recursion and custom layers, so long as the mathematical functions you call
# are differentiable. For example, we can differentiate a Taylor approximation
# to the `sin` function.

mysin(x) = sum((-1)^k*x^(1+2k)/factorial(1+2k) for k in 0:5)

x = 0.5

mysin(x), gradient(mysin, x)
#-
sin(x), cos(x)

# You can see that the derivative we calculated is very close to `cos(x)`, as we
# expect.

# This gets more interesting when we consider functions that take *arrays* as
# inputs, rather than just a single number. For example, here's a function that
# takes a matrix and two vectors (the definition itself is arbitrary)

myloss(W, b, x) = sum(W * x .+ b)

W = randn(3, 5)
b = zeros(3)
x = rand(5)

gradient(myloss, W, b, x)

# Now we get gradients for each of the inputs `W`, `b` and `x`, which will come
# in handy when we want to train models.

# Because ML models can contain hundreds of parameters, Flux provides a slightly
# different way of writing `gradient`. We instead mark arrays with `param` to
# indicate that we want their derivatives. `W` and `b` represent the weight and
# bias respectively.

([0.5885336103564676 0.08707929087479926 … 0.2158671035677726 0.05615764632970266; 0.5885336103564676 0.08707929087479926 … 0.2158671035677726 0.05615764632970266; 0.5885336103564676 0.08707929087479926 … 0.2158671035677726 0.05615764632970266], Fill(1.0, 3), [-0.2824871946941474, 1.215211566576241, 2.385026797366619, 2.2394062953720333, -1.6592430716463225])