# Welcome to the Flux Breakdown

In this demonstration, we take the most basic, but important, building block of deep learning - the perceptron, and build it up to contextualize images. Most of these blocks have brief explainations as to their intention, but if something is confusing, feel free to let me know!

In [None]:
using Flux

## Part One: A Single Perceptron:
Here we will build a single node, and demonstrate the properties of said node. Start with an x and y that we will use as the inputs and expected outputs of our system. Notice the dimentionality of y doesnt correlate with the output of a single perceptron, we will make use of this one later.

In [None]:
x = [1]
y = [0.6, 0.4]

Using the dense layer constructor, we will create and instance of a single perceptron with one input and one output. This will also be intialized with a sigmoid nonlinearity function (σ)

In [None]:
# Other cool Activations in Flux!
# Sigmoid
# Tanh
# Relu
# Elu
# Swish

single_perceptron_model = Dense(1,1,σ)

We can look through the properties of this model, like its weight and bias terms. This is shown below, we can print out the arrays holding these values

In [None]:
single_perceptron_model.W

In [None]:
single_perceptron_model.b

Say we want to see how the model predicts a value on the basis of our input. By using it as a function with the parameter x, we can see how it predicts this value

In [None]:
single_perceptron_model(x)

Notice that this is no different that making use of the weight and the bias terms in the form of a linear equation alongside the sigmoid function. This is shown by calculating this directly below

In [None]:
σ.(single_perceptron_model.W*[1] + single_perceptron_model.b)

## Part Two: Multi-Dimentionality:
Here we will build a chain of outputs to the perceptron, and observe the changes in the weights and bias terms that allow us to visualize this

In [None]:
# Other cool layers in Flux!
# Chain
# Conv, DepthwiseConv, ConvTranspose
# AdaptiveMaxPool, MaxPool, GlobalMaxPool, GlobalMeanPool
# CrossCor, SamePad
# ConvFilter, DepthWiseFilter
# RNN
# LSTM, GRU, Recur
# Recur
# Maxout, SkipConnection

single_dense_model = Dense(1,2,σ)

The weight and bias arrays are no longer single dimentional. Finally, we can see that perceptrons are built on vector multiplication, and operations like these are optimized when we enter hardware accelerators, GPUs and TPUs

In [None]:
single_dense_model.W

In [None]:
single_dense_model.b

Just like before, we can test the input to the model, this should be the same as taking the multiplications of these vectors, which each element operating on the inner product

In [None]:
single_dense_model(x)

In [None]:
σ.(single_dense_model.W*x + single_dense_model.b)

## Part Three: Chaining Operations Together:
What happpens when we want to add continuous layers together to produce a model. We quickly observe the use of the Chain function, which is the basis of creating larger systems in flux. We know one option to chaining the output to first perceptron to the second, is by placing the call to run the model inside the other. 

In [None]:
single_dense_model(single_perceptron_model([1]))

But doing so is pretty inefficient, we can get the same result, while saving lines and saving the new model using Chain. Chain will also be helpful as we begin to introduce new deep learning elements to the party

In [None]:
small_mlp = Chain(single_perceptron_model,single_dense_model)

In [None]:
small_mlp(x)

## Part Four: Getting The Right Results:
Our outputs aren't quite what we wanted. How do we quantify this, and how can we adjust the perceptrons we have to get closer to our desired output. To do this, we need to be able to quantify how far we are, and which way to move

In [None]:
using Flux: mse

Lets start by defining loss using a mean squared error operation, which will aggregate the squared error among all of the expected and actual values. Using this, we can find the right direction to move in during training

In [None]:
loss(x,y) = mse(small_mlp(x),y)

In [None]:
original_loss = loss(x,y)

Lets define an optimization algorithm we would like to use. We will make use of the gradient decent with a learning rate of 0.01. Of course, we can use other optimizers, mentioned in comments, with more details in the docs found here:

https://fluxml.ai/Flux.jl/v0.8/training/optimisers/#Optimiser-Reference-1

In [None]:
# Other Cool Optimizers in Flux
# Update!
# Momentup
# Neteroc
# RADAM
# AdaMax
# AdaGrad
# Adadelta
# AmsGrad
# NADAM
# ADAMw

opt = Descent(0.01)

We can use the train! function to shift our perceptron parameters in the right direction. Notice that after using the training, we are a little closer to the goal, but we aren't quite there, maybe if we do this enough times we can get it, but how will we do that?

In [None]:
Flux.train!(loss, params(small_mlp), [(x, y)], opt)

In [None]:
small_mlp(x)

In [None]:
new_loss = loss(x,y)

In [None]:
Δ = original_loss - new_loss

Lets introduce epochs. Epochs indicate the repetition in training that we would like to observe. As an example, if I want to train ten times, then I would train for ten epochs. We test this macro below

In [None]:
using Flux: @epochs

In [None]:
epochs = 10

In [None]:
@epochs epochs Flux.train!(loss, params(small_mlp), [(x, y)], opt)

In [None]:
epoch_loss = loss(x,y)

In [None]:
Δ = new_loss - epoch_loss

## Part Five: Go Big Mode
We have the building blocks, and understanding of the Flux perceptron, and the knowledge to chain together some layers, but using a dataset of 1, and optimizing only for that isn't something you would ever use a NN model for. We explore the dataloader, and training a model to spec in this next section

In [None]:
using Random
using Plots

Here, we will begin to develop our own large dataset, to show how we can use data that hasn't been processed or prepped beforehand. You might notice the transpose - it seems from my own research that Flux doesnt like column vectors, and rather, works appropriately only if you use rows. Weird considering Julia, like Matlab, is column default.

In [None]:
x = shuffle!(collect(1:1000))

In [None]:
y = [i+rand(1:100) for i in x]

We generate a quick plot to show our distribution. You might notice the linearity in the curvature, if we could predict a weight to use, we coupld probably predict this line!

In [None]:
plot(x,y,seriestype = :scatter)

In [None]:
x_train = transpose(x[1:800])
y_train = transpose(y[1:800])
x_test = transpose(x[801:1000])
y_test = transpose(y[801:1000])

Lets introduce the dataloader. It will allow us to take these large gatherings of out data, and use the model to train and test it without manually segmenting through batches.

In [None]:
using Flux.Data: DataLoader

In [None]:
train_data = DataLoader(x_train, y_train, batchsize=50, shuffle=true)
test_data = DataLoader(x_test, y_test, batchsize=50)

Here, we collect multiple different layers to make our multi layer perceptron

In [None]:
large_mlp = Chain(Dense(1,15),Dense(15,20),Dense(20,1,relu))

In [None]:
large_mlp([x_test[1]])

In [None]:
loss(x,y) = mse(large_mlp(x),y)
loss([x_test[1]],[y_test[1]])

We would also like to find a way to define loss across the entire system. We aggregate/avg the loss across all the points to get an all encompasing loss.

In [None]:
function loss_all(dataloader, model)
    l = 0
    for (x,y) in dataloader
        l += mse(model(x),y)
    end
    l/length(dataloader)
end

In [None]:
opt = Descent(1e-3)

In [None]:
epochs = 10

Notice we add an additional variable, cb, which reffers to a call back. One use of this, as seen here, is calling back to a visual function that can show us our progress

In [None]:
@epochs epochs Flux.train!(loss, params(large_mlp), train_data, opt, cb = () -> @show(loss_all(train_data, large_mlp)))

## Part Six: Putting it All Together - Working with Images
In this final perceptron section, we take an MNIST dataset of image classification for handwrting, and use the building blocks of multilayer perceptrons to contruct identfication of the 9 digits that may exist in each image. Credit to the Flux Model Zoo Github for the original demonstration.

https://github.com/FluxML/model-zoo

In [None]:
using PyPlot
using CUDA
using MLDatasets
using Flux: onehotbatch, onecold

We can start by downloading the dataset and visualizing one example, lets observe how this prediction changes with time

In [None]:
xtrain, ytrain = MLDatasets.MNIST.traindata(Float32)
xtest, ytest = MLDatasets.MNIST.testdata(Float32)

In [None]:
ytrain[1]

In [None]:
matshow(transpose(xtrain[:,:,1]))

As great as simply predicting the exact number is, we will need one hot encoding of the result to dedicate as the result of our system. Using onehotbatch from flux, we can quickly one hot encode the output for our perceptrons

In [None]:
ytrain, ytest = onehotbatch(ytrain, 0:9), onehotbatch(ytest, 0:9)

In [None]:
ytrain[:,1]

We also want our input to be flattened from the matrix form, to quickly do this we make use of the flatten command

In [None]:
xtrain = Flux.flatten(xtrain)
xtest = Flux.flatten(xtest)

In [None]:
train_data = DataLoader(xtrain, ytrain, batchsize=1024, shuffle=true)
test_data = DataLoader(xtest, ytest, batchsize=1024)

In [None]:
img_model = Chain(Dense(28*28, 32, relu), Dense(32, 10))

In [None]:
img_model(xtrain[:,1])

We can work out what our expectations for the prediction would be here, using the max value from the one hot encoding

In [None]:
findmax(img_model(xtrain[:,1]))

In [None]:
loss(x,y) = mse(img_model(x),y)
loss(xtrain[:,1],ytrain[:,1])

In [None]:
epochs = 10

In [None]:
opt = ADAM(3e-4)

In [None]:
@epochs epochs Flux.train!(loss, params(img_model), train_data, opt,cb = () -> @show(loss_all(train_data, img_model)))

# Part Seven: An Example of a Truely put together Model
In this final section, we briefly explore the conv model for the MNIST data shown above to show how we can construct an even larger, well formated model, but also as an exploration of using GPUs to speed up the computation. Model Zoo used for the demo, link below, thanks!

https://github.com/FluxML/model-zoo

In [None]:
using Flux
using Flux.Data: DataLoader
using Flux.Optimise: Optimiser, WeightDecay
using Flux: onehotbatch, onecold
using Flux.Losses: logitcrossentropy
using Statistics, Random
using Logging: with_logger
using TensorBoardLogger: TBLogger, tb_overwrite, set_step!, set_step_increment!
using ProgressMeter: @showprogress
import MLDatasets
using CUDA

In [None]:
# LeNet5 "constructor". 
# The model can be adapted to any image size
# and any number of output classes.
function LeNet5(; imgsize=(28,28,1), nclasses=10) 
    out_conv_size = (imgsize[1]÷4 - 3, imgsize[2]÷4 - 3, 16)
    
    return Chain(
            Conv((5, 5), imgsize[end]=>6, relu),
            MaxPool((2, 2)),
            Conv((5, 5), 6=>16, relu),
            MaxPool((2, 2)),
            flatten,
            Dense(prod(out_conv_size), 120, relu), 
            Dense(120, 84, relu), 
            Dense(84, nclasses)
          )
end

In [None]:
function get_data(args)
    xtrain, ytrain = MLDatasets.MNIST.traindata(Float32)
    xtest, ytest = MLDatasets.MNIST.testdata(Float32)

    xtrain = reshape(xtrain, 28, 28, 1, :)
    xtest = reshape(xtest, 28, 28, 1, :)

    ytrain, ytest = onehotbatch(ytrain, 0:9), onehotbatch(ytest, 0:9)

    train_loader = DataLoader((xtrain, ytrain), batchsize=args.batchsize, shuffle=true)
    test_loader = DataLoader((xtest, ytest),  batchsize=args.batchsize)
    
    return train_loader, test_loader
end

In [None]:
loss(ŷ, y) = logitcrossentropy(ŷ, y)

In [None]:
Base.@kwdef mutable struct Args
    η = 3e-4             # learning rate
    λ = 0                # L2 regularizer param, implemented as weight decay
    batchsize = 128      # batch size
    epochs = 10          # number of epochs
    seed = 0             # set seed > 0 for reproducibility
    use_cuda = true      # if true use cuda (if available)
    infotime = 1      # report every `infotime` epochs
    checktime = 5        # Save the model every `checktime` epochs. Set to 0 for no checkpoints.
    tblogger = true      # log training with tensorboard
    savepath = "runs/"    # results path
end

In [None]:
function train(; kws...)
    args = Args(; kws...)
    args.seed > 0 && Random.seed!(args.seed)
    use_cuda = args.use_cuda && CUDA.functional()
    
    if use_cuda
        device = gpu
        @info "Training on GPU"
    else
        device = cpu
        @info "Training on CPU"
    end

    ## DATA
    train_loader, test_loader = get_data(args)
    @info "Dataset MNIST: $(train_loader.nobs) train and $(test_loader.nobs) test examples"

    ## MODEL AND OPTIMIZER
    model = LeNet5() |> device
    @info "LeNet5 model: $(num_params(model)) trainable params"    
    
    ps = Flux.params(model)  

    opt = ADAM(args.η) 
    if args.λ > 0 # add weight decay, equivalent to L2 regularization
        opt = Optimiser(opt, WeightDecay(args.λ))
    end
    
    
    ## TRAINING
    @info "Start Training"
    report(0)
    for epoch in 1:args.epochs
        @showprogress for (x, y) in train_loader
            x, y = x |> device, y |> device
            gs = Flux.gradient(ps) do
                    ŷ = model(x)
                    loss(ŷ, y)
                end

            Flux.Optimise.update!(opt, ps, gs)
        end
    end
end
