# Introduction to deep learning in Flux.jl

## Introduction

- [Flux] (http://fluxml.ai/) is a Julia library designed to create machine learning models.
- It is written entirely in Julia, which makes it trivial to modify it and adapt it to your needs.
- It is possible to use inside of it Julia syntax, functions and macros.
- Creating complex models is intuitive and fast, it usually takes only a few lines of code.

## Example

In this lecture we are going to design a MLP for classifying handwritten digits from the MNIST dataset.

### Implementation

In [None]:
using Flux
using LinearAlgebra
using Statistics
using Flux: onehotbatch, onecold, crossentropy, throttle, Tracker, @epochs
using Base.Iterators: repeated
using Distributed
using PyPlot

In [None]:
# Classify MNIST digits with a simple multi-layer-perceptron

@everywhere const TRAIN = (class=UInt8[], image=Matrix{Int32}[])
@everywhere const TEST = (class=UInt8[], image=Matrix{Int32}[])

@everywhere for (filename, data) in [("mnist_train.int", TRAIN),
                                     ("mnist_test.int", TEST)]
    open(filename) do f
        while !eof(f)
            c = read(f, UInt8)
            v = read(f, 28^2)
            push!(data.class, c)
            push!(data.image, reshape(v, 28, 28))
        end
    end
end

imshow(TRAIN.image[1])
println(TRAIN.class[1])

In [None]:
# Stack images into one large batch
X = hcat(float.(reshape.(TRAIN.image, :))...) |> gpu;
# One-hot-encode the labels
Y = onehotbatch(TRAIN.class, 0:9) |> gpu;

In [None]:
# Stack images into one large batch
tX = hcat(float.(reshape.(TEST.image, :))...) |> gpu;
# One-hot-encode the labels
tY = onehotbatch(TEST.class, 0:9) |> gpu;

####  Model definition

When data is ready we should start designing the DL model. 

Let's start with manually designing a perceptron with sigmoidal activation function.

In [None]:
W = rand(4, 8)
b = rand(4)

In [None]:
layer₁(x) = 1.0 ./ (1.0.+exp.(-W*x - b))

In [None]:
x = rand(8)
layer₁(x)

To train this model in Flux, we need to inform Flux, that those parameters: <tt>W</tt> and <tt>b</tt> should be trainable:

In [None]:
using Flux.Tracker

W = param(W)
b = param(b)

Actually all of the above is already defined in Flux, where you can find most popular [activation functions](http://fluxml.ai/Flux.jl/stable/models/layers.html#Activation-Functions-1), which we can use in the model:

In [None]:
layer₂(x) = σ.(W * x .+ b)

In [None]:
layer₂(x)

The same applies to defining [model layers](http://fluxml.ai/Flux.jl/stable/models/layers.html#Basic-Layers-1):

In [None]:
layer₃ = Dense(8,4,σ)

In [None]:
layer₃(x)

Of course, if we cannot find a suitable definition in Flux, we can declare our own layer:

In [None]:
struct Affine
  W
  b
end

Affine(in::Integer, out::Integer) =
  Affine(param(randn(out, in)), param(randn(out)))

# Overload call, so the object can be used as a function
(m::Affine)(x) = m.W * x .+ m.b

a = Affine(10, 5)

a(rand(10)) # => 5-element vector

To fully use Flux functionalities, the following line with calling macro <tt>treelike</tt> must be executed:

In [None]:
Flux.@treelike Affine

Model can consist of more layers than one:

In [None]:
Layer₁ = Dense(28^2, 32, relu)
Layer₂ = Dense(32, 10)
Layer₃ = softmax

In [None]:
m₁ = Chain(Layer₁ , Layer₂, Layer₃) |> gpu

<tt>Chain</tt> function allows to join arbitrary functions into execution chain:

In [None]:
chain = Chain(x -> x^2, x-> -x);

In [None]:
chain(5)

or:

In [None]:
m₂(x) = Layer₃(Layer₂(Layer₁(x)))

or as a function composition:

In [None]:
m₃(x) = Layer₁ ∘ Layer₂ ∘ Layer₃  

or:

In [None]:
m₄(x) = Layer₁(x) |> Layer₂  |> Layer₃ 

Having the model defined we can move on with defining the loss function for the model and its regularization.

#### Loss function, regularization

As before, we can declare the loss function manually:

In [None]:
model = Dense(5,2)

x, y = rand(5), rand(2);

In [None]:
loss(ŷ, y) = sum((ŷ.- y).^2)/ length(y)

In [None]:
loss(model(x), y) 

We can use [one of the already defined in Flux:](https://github.com/FluxML/Flux.jl/blob/8f73dc6e148eedd11463571a0a8215fd87e7e05b/src/layers/stateless.jl)

In [None]:
Flux.mse(model(x),y)

Also [regularization](http://fluxml.ai/Flux.jl/stable/models/regularisation.html) is [intuitive](http://fluxml.ai/Flux.jl/stable/models/layers.html#Normalisation-and-Regularisation-1):

In [None]:
penalty() =  norm(model.W) + norm(model.b)
loss(ŷ,y) = Flux.mse(ŷ,y) + penalty()

In [None]:
loss(model(x),y)

or even simpler:

In [None]:
loss(ŷ,y) = Flux.mse(ŷ,y) + sum(norm,params(model))

In [None]:
loss(model(x),y)

Other regularization techniques can be implemented [as layers:](http://fluxml.ai/Flux.jl/stable/models/layers.html#Normalisation-and-Regularisation-1)

In [None]:
model = Chain(Dense(28^2, 32, relu),
    Dropout(0.1),
Dense(32, 10),
BatchNorm(64, relu),
softmax)

In the designed model loss function can look like this:

In [None]:
loss(x, y) = crossentropy(m₁(x), y)

#### Model learning

After defining the model we can move on with training it.

The main element that allows a proper training proces is an appropriate algorithm for computing gradients. In Flux it looks as presented below:

In [None]:
f(x) = 3x^2 + 2x + 1

# df/dx = 6x + 2
df(x) = Tracker.gradient(f, x)[1]

df(2) # 14.0 (tracked)

When function has many variables weights can be kept as a collection and passed to the function

In [None]:
W = param(2) # 2.0 (tracked)
b = param(3) # 3.0 (tracked)

f(x) = W * x + b

par = Flux.Params([W, b])
grads = Tracker.gradient(() -> f(4), par)

grads[W] # 4.0
grads[b] # 1.0

Flux can control the whole training process, we don't need to implement that manually. A special function <tt>train!</tt> is designed for that:

In [None]:
Flux.train!(objective, data, opt)

The drawback is that it allows to run training only for one epoch. To iteratively train the model over the available dataset we either replicate the dataset appropriately:

In [None]:
dataset = repeated((X, Y), 200)

or use the <tt>@epochs</tt> macro:

In [None]:
Flux.@epochs 2 println("hello")

It allows us to define functions, that are called during the training process.

In [None]:
evalcb = () -> @show(loss(tX, tY))

Now we can move on with gathering it all together:

In [None]:
# Classify MNIST digits with a simple multi-layer-perceptron

# Stack images into one large batch
X = hcat(float.(reshape.(TRAIN.image, :))...) |> gpu;
# One-hot-encode the labels
Y = onehotbatch(TRAIN.class, 0:9) |> gpu;

# Stack images into one large batch
tX = hcat(float.(reshape.(TEST.image, :))...) |> gpu;
# One-hot-encode the labels
tY = onehotbatch(TEST.class, 0:9) |> gpu;

m = Chain(
  Dense(28^2, 320, relu),
  Dense(320, 10),
  softmax) |> gpu
                                                                                                                                                                                                                                                                                                                            
loss(x, y) = crossentropy(m(x), y)
                                                                                                                                                    
accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))

dataset = (X, Y)
evalcb = () -> @show(accuracy(tX, tY))
opt = ADAM(params(m))

@epochs 200 Flux.train!(loss, [dataset], opt, cb = throttle(evalcb, 10))

Without GPU this training process can be quite long. Julia natively supports moving the computation to  the [GPU](http://fluxml.ai/Flux.jl/stable/gpu.html).

## Alternatives

Flux is not the only machine/deep learning framework in Julia. Below are listed other, that can also be used:

- [Knet.jl](https://github.com/denizyuret/Knet.jl)
- [MXnet.jl](https://github.com/dmlc/MXNet.jl)
- [TensorFlow.jl](https://github.com/malmaud/TensorFlow.jl)