## Flux.jl

In this notebook, we will go through the basic usage of the Flux library that offers basic tools for building neural networks in Julia!

In [1]:
using Flux

### Gradients

Flux offers gradient and easy computation of the chain rule. It already has `gradient` for most standard functions.

In [5]:
f(x) = 3x^2 + 2x + 1;
df(x) = 6x + 2;

In [10]:
df.([-1,1,3])

3-element Array{Int64,1}:
 -4
  8
 20

In [8]:
gradient(f,1)

(8,)

In [14]:
df(x) = gradient(f,x)[1]
df.([-1,1,3])

3-element Array{Int64,1}:
 -4
  8
 20

Second order derivative is easy.

In [17]:
d2f(x) = gradient(df, x)[1]
d2f.([-1,1,3])

3-element Array{Int64,1}:
 6
 6
 6

Gradient of composite functions works naturaly.

In [20]:
g(x) = sin(f(x)) + cos(f(x)) 

g (generic function with 1 method)

In [23]:
gradient(g, 0)

(-0.6023373578795135,)

We can also differentiate multivariate functions.

In [30]:
h(x, y) = 2 .*x .*y .+ x.^2 

h (generic function with 1 method)

In [31]:
gradient(h, 1, 2)

(6, 2)

But the output has to be scalar.

In [33]:
a, b = [1, 0], [2, 1]
h(a,b)

2-element Array{Int64,1}:
 5
 0

In [34]:
gradient(h, a, b)

ErrorException: Output should be scalar; gradients are not defined for output [5, 0]

There is the `params` function, that collects the trainable parameters of its arguments. By default, every array structure is trainable.

In [72]:
params([1,3,5])

Params([[1, 3, 5]])

In [73]:
params(zeros(3,5))

Params([[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]])

In [74]:
params(1)

Params([])

In [66]:
x = [2, 1];
y = [2, 0];
f(x,y) = sum((x .- y).^2)

f (generic function with 2 methods)

In [42]:
params(x)

Params([[2, 1]])

In [44]:
params(x,y)

Params([[2, 1], [2, 0]])

In [43]:
params(f)

Params([])

The primary use of this is that it provides an iterable over all trainable parameters of the arguments (e.g. a NN layer). Then, we can collect gradient with respect to all the trainable parameters.

In [68]:
# this is like map()
gs = gradient(params(x, y)) do
    f(x,y)
end

Grads(...)

Which is simillar to this, only this already gives the values of the gradient.

In [48]:
map(x->gradient(f,x[1],y[2]), params(x,y))

2-element Array{Tuple{Int64,Int64},1}:
 (4, -4)
 (4, -4)

Now, `gs` containts gradients of `f` with respect to `x` and `y` evaluated at `x` and `y` values. 

In [69]:
gs[x]

2-element Array{Int64,1}:
 0
 2

In [51]:
gs[y]

2-element Array{Int64,1}:
  0
 -2

The assigments are given the whole variable, not only its value. Honza will talk about this more.

In [62]:
z = [2,1]
gs[z]

KeyError: KeyError: key [2, 1] not found

In [63]:
z = x
gs[z]

2-element Array{Int64,1}:
 0
 2

Simple linear regression model.

In [95]:
W = rand(2, 5) # weight matrix

2×5 Array{Float64,2}:
 0.423072  0.61746   0.741955   0.564364  0.677725
 0.952557  0.483217  0.0400681  0.517426  0.00170474

In [96]:
b = rand(2) # bias vector

2-element Array{Float64,1}:
 0.9660418636262085
 0.31926599874752615

In [97]:
# setup the prediction and loss functions
predict(x) = W*x .+ b

function loss(x, y)
  ŷ = predict(x)
  sum((y .- ŷ).^2)
end

x, y = rand(5), rand(2) # Dummy data
loss(x, y) # ~ 3

4.3844902613055385

This gives us gradients of `loss`, with respect to `W` and `b` and evaluated at `x` and `y`.

In [98]:
gs = gradient(() -> loss(x, y), params(W, b))

Grads(...)

In [99]:
gs[b]

2-element Array{Float64,1}:
 4.158329854178062
 0.4962397294389196

In [102]:
gs[W]

2×5 Array{Float64,2}:
 0.971839  2.86115   3.59914   1.85862   2.56374
 0.115976  0.341439  0.429508  0.221801  0.305947

Let's do one step of gradient descent on `W`.

In [110]:
ΔW = gs[W]
W .-= 0.1 .* ΔW # equivalent to W .= W .- 0.1 .* ΔW

2×5 Array{Float64,2}:
 0.0343368  -0.526999  -0.697701  -0.179084  -0.347771
 0.906167    0.346641  -0.131735   0.428705  -0.120674

In [111]:
loss(x,y)

1.0340910128981746

### Building your own layer

This is a general way to constructing your own layers. This `Affine` layer is equivalent to the generic `Dense` layer.

In [112]:
struct Affine
  W
  b
end

# constructor
Affine(in::Integer, out::Integer) =
  Affine(randn(out, in), randn(out))

# Overload call, so the object can be used as a function
(m::Affine)(x) = m.W * x .+ m.b

In [115]:
a = Affine(10, 5)
x = rand(10)
a(x)

5-element Array{Float64,1}:
  2.3171259850207164
  0.29932674159907513
  3.414888241112594
  1.8948799753517813
 -1.5603099414126003

This is how you would train it.

In [123]:
y = randn(5)
loss(x) = Flux.mse(a(x), y)

loss (generic function with 2 methods)

In [119]:
?Flux.mse

```
mse(ŷ, y)
```

Return the mean squared error between ŷ and y; calculated as `sum((ŷ .- y).^2) / length(y)`.

# Examples

```jldoctest
julia> Flux.mse([0, 2], [1, 1])
1//1
```


In [124]:
gradient(loss, x)

([1.2645227146100988, -1.1568070613619492, -0.7253953299846515, 2.428544523863895, 2.3968036081983337, -0.7109940253003317, -2.8376733229419826, 2.240541381593441, 6.365583385100586, -0.28770714590399205],)

In [134]:
gs = gradient(() -> loss(x), params(a.W, a.b))

Grads(...)

In [136]:
gs[a.W]

5×10 Array{Float64,2}:
  0.731564     0.79207      0.602092    …   0.783689     0.217362
 -0.00500343  -0.00541725  -0.00411792     -0.00535993  -0.00148662
  1.23973      1.34227      1.02033         1.32807      0.368349
  0.712987     0.771957     0.586803        0.763788     0.211842
 -0.393937    -0.426518    -0.324218       -0.422005    -0.117046

But this is a bit awkward, instead, we would like to be able to call `params(a)`.

In [137]:
params(a)

Params([])

For that, we need to call the `@functor` macro on the structure we have created.

In [139]:
Flux.@functor Affine

This does a lot of stuff in the background, enables collection of parameters and also GPU operations.

In [140]:
params(a)

Params([[0.9059637991871812 -1.204948729220692 … 1.6104494996658223 -1.5921129291004092; 0.4464281307087328 -0.30004611126937125 … -0.9451035515937852 0.2793767943057875; … ; -0.06837077662362061 0.4395277607620131 … 2.4157883658205264 -1.2245245948404366; -1.3992451498455167 0.9638225402250484 … -1.5880690601500653 -1.108842454225542], [0.5973366619951, 0.4343777167145093, 0.20764423746494565, 0.5725076729533816, -0.3634501916507353]])

In [145]:
gs = gradient(() -> loss(x), params(a))
gs[a.W]

Grads(...)

We can also define which parameters are trainable.

In [141]:
Flux.trainable(a::Affine) = (a.W,)

In [144]:
params(a)

Params([[0.9059637991871812 -1.204948729220692 … 1.6104494996658223 -1.5921129291004092; 0.4464281307087328 -0.30004611126937125 … -0.9451035515937852 0.2793767943057875; … ; -0.06837077662362061 0.4395277607620131 … 2.4157883658205264 -1.2245245948404366; -1.3992451498455167 0.9638225402250484 … -1.5880690601500653 -1.108842454225542]])

This can be also achieved with `@functor`.

In [143]:
Flux.@functor Affine (W,)

### Layer chaining

Is useful for constructing larger networks.

In [58]:
m = Chain(
    Dense(5,10,relu),
    Dense(10,1, sigmoid)
)

Chain(Dense(5, 10, relu), Dense(10, 1, σ))

In [59]:
x = randn(5,10) # a batch of 10 5 dimensional samples
y = rand([0,1], 1, 10) # labels
x[1,vec(y).==1] .= -3 # make a connection between the labels and the data

5-element view(::Array{Float64,2}, 1, [3, 4, 5, 6, 9]) with eltype Float64:
 -3.0
 -3.0
 -3.0
 -3.0
 -3.0

In [60]:
m(x)

1×10 Array{Float32,2}:
 0.392138  0.438458  0.133978  0.167633  …  0.441877  0.129883  0.581017

### Optimisers

Serve to steer the gradient optimization in the correct direction.

The loss is mean binary crossentropy, ideal for classification. Logit for better stability.

\begin{equation*}
    l(y,\hat{y}) = - \frac{1}{N} \sum_{i=1}^N y_i \log(\hat{y}_i) +  (1-y_i) \log(1-\hat{y}_i)
\end{equation*}

In [61]:
loss(x,y) = sum(Flux.logitbinarycrossentropy.(m(x), y))/size(y,2)

loss (generic function with 1 method)

In [62]:
loss(x, y)

0.7894722f0

First we collect the gradients with respect to parameters of `m`.

In [63]:
θ = params(m)
grads = gradient(() -> loss(x,y), θ)

Grads(...)

Then we update the parameters.

In [64]:
using Flux.Optimise: update!

η = 0.1 # Learning Rate
for p in θ
  update!(p, -η .* grads[p])
end

In [65]:
loss(x, y)

0.7920066f0

ADAM optimiser. Remember not to restart it everytime you optimize with it as it has memory.

In [66]:
opt = ADAM(0.01)

ADAM(0.01, (0.9, 0.999), IdDict{Any,Any}())

In [67]:
opt.state

IdDict{Any,Any} with 0 entries

In [68]:
for p in θ
  update!(opt, p, grads[p])
end

In [69]:
loss(x,y)

0.78266543f0

In [70]:
opt.state

IdDict{Any,Any} with 4 entries:
  Float32[-0.252127 0.2386… => (Float32[0.00307778 0.0042404 … -0.00307911 0.00…
  Float32[-0.00531625]      => (Float32[0.00468375], Float32[2.19375f-6], (0.81…
  Float32[0.00891607, -0.0… => (Float32[-0.00108392, 0.0010806, 0.00114829, 0.0…
  Float32[0.475649 0.23809… => (Float32[-0.000741009 -0.000553543 … -0.00084561…

There is a lot more to optimisers, you can choose from a wide range, also different decay strategies, which can be composed with optimisers.

### Training

There are functions that enable easy training via batches.

In [19]:
# do not do this, you can use train!
for i in 1:100
    for p in θ
      update!(opt, p, grads[p])
    end
end

First we generate some more data and split it into batches.

In [71]:
X = randn(5,1000)
Y = rand([0,1], 1, 1000) # labels
X[1,vec(Y).==1] .= -3 # make a connection between the labels and the data
data = Flux.Data.DataLoader(X, Y, batchsize=128) 

Flux.Data.DataLoader(([-0.20319554973784926 0.7664197124768922 … -3.0 -3.0; 0.18976232034546922 0.9908676053541495 … 0.17432628446833415 0.4567179964371301; … ; 0.9652189483130672 0.48193225077941515 … -1.7587692155972963 -1.0977898713415803; 0.583656907469228 0.2669864497924328 … 0.8833551098769331 0.750108352225107], [0 0 … 1 1]), 128, 1000, true, 1000, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  991, 992, 993, 994, 995, 996, 997, 998, 999, 1000], false)

We can also define a callback function.

In [72]:
cb = () -> println(loss(x,y))

#14 (generic function with 1 method)

We use the model and optimiser from the above code. This will compute gradients and update the parameters on each batch in data twice.

In [73]:
θ = params(m)
Flux.@epochs 2 Flux.train!(loss, θ, data, opt; cb = cb)

┌ Info: Epoch 1
└ @ Main /home/vit/.julia/packages/Flux/Fj3bt/src/optimise/train.jl:121


0.77472967
0.7671571
0.7598773
0.752929
0.7461692
0.7392982
0.73244995
0.72546804
0.7185718
0.7115805
0.7046176
0.6978354
0.6910521
0.6840501
0.6770234
0.6698472


┌ Info: Epoch 2
└ @ Main /home/vit/.julia/packages/Flux/Fj3bt/src/optimise/train.jl:121


Remember that `train!` calls `loss(d...)` on each element of `data`. Therefore if you only pass one argument to the loss, function, you have to pass `(x,)` not just `x`.

In [74]:
m(x)

1×10 Array{Float32,2}:
 0.0999795  0.257127  0.436714  0.432262  …  0.20172  0.438776  0.446962

In [75]:
y

1×10 Array{Int64,2}:
 0  0  1  1  1  1  0  0  1  0

### Parameter freezing

You can also freeze some parameters, i.e. exclude them from `params` and therefore from optimization.

In [27]:
length(θ)

4

In [28]:
θ = params(m[1])
length(θ)

2

Or you can delete a specific one.

In [30]:
θ = params(m)
delete!(θ, m[2].b)
length(θ)

3

### Model loading and saving

In [31]:
using BSON

In [76]:
BSON.@save "model.bson" m

In [77]:
BSON.@load "model.bson" m

You can also save/load the individual weights - see documentation. I recommend to save the optimizer as well.

### GPU

In [85]:
m = Chain(
    Dense(5,10,relu),
    Dense(10,1, sigmoid)
)

Chain(Dense(5, 10, relu), Dense(10, 1, σ))

In [87]:
m[1].W

10×5 Array{Float32,2}:
  0.278172   -0.393357   -0.0831804  -0.127154  -0.338539
 -0.516893   -0.155446   -0.170235   -0.395183  -0.37047
  0.392659   -0.302037    0.412356    0.122162   0.573642
 -0.612919   -0.591922   -0.590492   -0.275072  -0.318672
  0.0477564   0.0205548  -0.27069     0.393052   0.570552
  0.599294   -0.165937    0.563544   -0.291239   0.197528
 -0.443511    0.588622    0.127816    0.449827  -0.0208685
 -0.175823    0.24797     0.470389    0.566752   0.226565
 -0.435666   -0.170884   -0.597009    0.174136   0.165709
 -0.549613    0.0434508   0.0475399   0.129501   0.163205

In [88]:
using CuArrays
m = m |> gpu

Chain(Dense(5, 10, relu), Dense(10, 1, σ))

In [89]:
m[1].W

10×5 CuArray{Float32,2,Nothing}:
  0.278172   -0.393357   -0.0831804  -0.127154  -0.338539
 -0.516893   -0.155446   -0.170235   -0.395183  -0.37047
  0.392659   -0.302037    0.412356    0.122162   0.573642
 -0.612919   -0.591922   -0.590492   -0.275072  -0.318672
  0.0477564   0.0205548  -0.27069     0.393052   0.570552
  0.599294   -0.165937    0.563544   -0.291239   0.197528
 -0.443511    0.588622    0.127816    0.449827  -0.0208685
 -0.175823    0.24797     0.470389    0.566752   0.226565
 -0.435666   -0.170884   -0.597009    0.174136   0.165709
 -0.549613    0.0434508   0.0475399   0.129501   0.163205

In [90]:
x = randn(5,10) |> gpu
m(x)

1×10 CuArray{Float32,2,Nothing}:
 0.585089  0.705926  0.423339  0.595285  …  0.583601  0.57859  0.448476

Before saving your model, move it to CPU!

In [92]:
m = m |> cpu

Chain(Dense(5, 10, relu), Dense(10, 1, σ))

### Conv layers

In [94]:
l = Conv((3,3), 1=>4, relu)

Conv((3, 3), 1=>4, relu)

In [93]:
?Conv

search: [0m[1mC[22m[0m[1mo[22m[0m[1mn[22m[0m[1mv[22m [0m[1mc[22m[0m[1mo[22m[0m[1mn[22m[0m[1mv[22m [0m[1mc[22m[0m[1mo[22m[0m[1mn[22m[0m[1mv[22m! [0m[1mc[22m[0m[1mo[22m[0m[1mn[22m[0m[1mv[22mert [0m[1mC[22m[0m[1mo[22m[0m[1mn[22m[0m[1mv[22mDims [0m[1mC[22m[0m[1mo[22m[0m[1mn[22m[0m[1mv[22mTranspose ∇[0m[1mc[22m[0m[1mo[22m[0m[1mn[22m[0m[1mv[22m_data ∇[0m[1mc[22m[0m[1mo[22m[0m[1mn[22m[0m[1mv[22m_data!



```
Conv(size, in => out, σ = identity; init = glorot_uniform,
     stride = 1, pad = 0, dilation = 1)
```

Standard convolutional layer. `size` should be a tuple like `(2, 2)`. `in` and `out` specify the number of input and output channels respectively.

Data should be stored in WHCN order (width, height, # channels, batch size). In other words, a 100×100 RGB image would be a `100×100×3×1` array, and a batch of 50 would be a `100×100×3×50` array.

# Examples

Apply a `Conv` layer to a 1-channel input using a 2×2 window size, giving us a 16-channel output. Output is activated with ReLU.

```julia
size = (2,2)
in = 1
out = 16
Conv(size, in => out, relu)
```


In [96]:
x = randn(32, 32, 1, 2);

In [97]:
y = l(x);
size(y)

(30, 30, 4, 2)

`outdims` is very useful

In [102]:
Flux.outdims(l, size(x))

(30, 30)