# Linear models, loss functions, gradients, SGD
(c) Deniz Yuret, 2019
* Objectives: Define, train and visualize a simple model; understand gradients and SGD; learn to use the GPU.
* Prerequisites: [Callable objects](https://docs.julialang.org/en/v1/manual/methods/#Function-like-objects-1), [Generator expressions](https://docs.julialang.org/en/v1/manual/arrays/#Generator-Expressions-1), [MNIST](20.mnist.ipynb), [Iterators](25.iterators.ipynb)
* New functions: 
[mnistdata](https://github.com/denizyuret/Knet.jl/blob/master/data/mnist.jl),
[accuracy](http://denizyuret.github.io/Knet.jl/latest/reference/#Knet.accuracy), 
[zeroone](http://denizyuret.github.io/Knet.jl/latest/reference/#Knet.zeroone), 
[nll](http://denizyuret.github.io/Knet.jl/latest/reference/#Knet.nll), 
[Param, @diff, value, params, grad](http://denizyuret.github.io/Knet.jl/latest/reference/#AutoGrad),
[sgd](http://denizyuret.github.io/Knet.jl/latest/reference/#Knet.sgd),
[progress, progress!](http://denizyuret.github.io/Knet.jl/latest/reference/#Knet.progress), 
[gpu](http://denizyuret.github.io/Knet.jl/latest/reference/#Knet.gpu), 
[KnetArray](http://denizyuret.github.io/Knet.jl/latest/reference/#Knet.KnetArray), 
[load](http://denizyuret.github.io/Knet.jl/latest/reference/#Knet.load), 
[save](http://denizyuret.github.io/Knet.jl/latest/reference/#Knet.save)


<img src="https://www.oreilly.com/library/view/tensorflow-for-deep/9781491980446/assets/tfdl_0401.png" alt="A linear model" width=300/> ([image source](https://www.oreilly.com/library/view/tensorflow-for-deep/9781491980446/ch04.html))

In Knet, a machine learning model is defined using plain Julia code. A typical model consists of a **prediction** and a **loss** function. The prediction function takes some input, returns the prediction of the model for that input. The loss function measures how bad the prediction is with respect to some desired output. We train a model by adjusting its parameters to reduce the loss.

In this section we will implement a simple linear model to classify MNIST digits. The prediction function will return 10 scores for each of the possible labels 0..9 as a linear combination of the pixel values. The loss function will convert these scores to normalized probabilities and return the average -log probability of the correct answers. Minimizing this loss should maximize the scores assigned to correct answers by the model. We will make use of the loss gradient with respect to each parameter, which tells us the direction of the greatest loss increase. We will improve the model by moving the parameters in the opposite direction (using a GPU if available). We will visualize the model weights and performance over time. The final accuracy of about 92% is close to the limit of what we can achieve with this type of model. To improve further we must look beyond linear models.

In [None]:
# Set display width, load packages, import symbols
ENV["COLUMNS"]=72
using Statistics: mean
using Base.Iterators: flatten
using IterTools: ncycle, takenth
import Random # seed!
using MLDatasets: MNIST
import CUDA # functional
using Knet: Knet, AutoGrad, dir, Data, minibatch, Param, @diff, value, params, grad, progress, progress!, gpu, KnetArray, load, save
# The following are defined for instruction even though they are provided in Knet
# using Knet: accuracy, zeroone, nll, sgd

In [None]:
# Load MNIST data
xtrn,ytrn = MNIST.traindata(Float32); ytrn[ytrn.==0] .= 10
xtst,ytst = MNIST.testdata(Float32);  ytst[ytst.==0] .= 10
dtrn = minibatch(xtrn, ytrn, 100; xsize = (784,:))
dtst = minibatch(xtst, ytst, 100; xsize = (784,:))
println.(summary.((dtrn,dtst)));

## Model definition

In [None]:
# In Julia we define a new datatype using `struct`:
struct Linear; w; b; end

# The new struct comes with a default constructor:
model = Linear(0.01 * randn(10,784), zeros(10))

# We can define other constructors with different inputs:
Linear(i::Int,o::Int,scale=0.01) = Linear(scale * randn(o,i), zeros(o))

# This one allows instances to be defined using input and output sizes:
model = Linear(784,10)

## Prediction

In [None]:
# We turn Linear instances into callable objects for prediction:
(m::Linear)(x) = m.w * x .+ m.b

In [None]:
x,y = first(dtst) # The first minibatch from the test set
summary.((x,y))

In [None]:
Int.(y)' # correct answers are given as an array of integers (remember we use 10 for 0)

In [None]:
ypred = model(x)  # Predictions on the first minibatch: a 10x100 score matrix

In [None]:
# We can calculate the accuracy of our model for the first minibatch
accuracy(model,x,y) = mean(y' .== map(i->i[1], findmax(Array(model(x)),dims=1)[2]))
accuracy(model,x,y)

In [None]:
# We can calculate the accuracy of our model for the whole test set
accuracy(model,data) = mean(accuracy(model,x,y) for (x,y) in data)
accuracy(model,dtst)

In [None]:
# ZeroOne loss (or error) is defined as 1 - accuracy
zeroone(x...) = 1 - accuracy(x...)
zeroone(model,dtst)

## Loss function

In [None]:
# For classification we use negative log likelihood loss (aka cross entropy, softmax loss, NLL)
# This is the average -log probability assigned to correct answers by the model
function nll(scores, y)
    expscores = exp.(scores)
    probabilities = expscores ./ sum(expscores, dims=1)
    answerprobs = (probabilities[y[i],i] for i in 1:length(y))
    mean(-log.(answerprobs))
end

In [None]:
# model(x) gives predictions, let model(x,y) give the loss
(m::Linear)(x, y) = nll(m(x), y)
model(x,y)

In [None]:
# We can also use the Knet nll implementation for efficiency
(m::Linear)(x, y) = Knet.nll(m(x), y)
model(x,y)

In [None]:
# If the input is a dataset compute average loss:
(m::Linear)(data::Data) = mean(m(x,y) for (x,y) in data)

In [None]:
# Here is per-instance average negative log likelihood for the whole test set
model(dtst)

**Bonus question:** What is special about the loss value 2.3?

## Calculating the gradient using AutoGrad

In [None]:
@doc AutoGrad

In [None]:
# Redefine the constructor to use Param's so we can compute gradients
Linear(i::Int,o::Int,scale=0.01) = 
    Linear(Param(scale * randn(o,i)), Param(zeros(o)))

In [None]:
# Set random seed for replicability
Random.seed!(9);

In [None]:
# Use a larger scale to get a large initial loss
model = Linear(784,10,1.0)

In [None]:
# We can still do predictions and calculate loss:
model(x,y)

In [None]:
# And we can do the same loss calculation also computing gradients:
J = @diff model(x,y)

In [None]:
# To get the actual loss value from J:
value(J)

In [None]:
# params(J) returns an iterator of Params J depends on (i.e. model.b, model.w):
params(J) |> collect

In [None]:
# To get the gradient of a parameter from J:
∇w = grad(J,model.w)

In [None]:
# Note that each gradient has the same size and shape as the corresponding parameter:
@show ∇b = grad(J,model.b);

## Checking the gradient using numerical approximation

What does ∇b represent?

∇b[10] = 0.79 means if I increase b[10] by ϵ, loss will increase by 0.79ϵ

In [None]:
# Loss for the first minibatch with the original parameters
@show value(model.b)
model(x,y)

In [None]:
# To numerically check the gradient let's increase the last entry of b by +0.1.
model.b[10] = 0.1

In [None]:
# We see that the loss moves by ≈ +0.79*0.1 as expected.
@show value(model.b)
model(x,y)

In [None]:
# Reset the change.
model.b[10] = 0

## Checking the gradient using manual implementation

In [None]:
# Without AutoGrad we would have to define the gradients manually:
function nllgrad(model,x,y)
    scores = model(x)
    expscores = exp.(scores)
    probabilities = expscores ./ sum(expscores, dims=1)
    for i in 1:length(y); probabilities[y[i],i] -= 1; end
    dJds = probabilities / length(y)
    dJdw = dJds * x'
    dJdb = vec(sum(dJds,dims=2))
    dJdw,dJdb
end;

In [None]:
∇w2,∇b2 = nllgrad(model,x,y)

In [None]:
∇w2 ≈ ∇w

In [None]:
∇b2 ≈ ∇b

## Training with Stochastic Gradient Descent (SGD)

In [None]:
# Here is a single SGD update:
function sgdupdate!(func, args; lr=0.1)
    fval = @diff func(args...)
    for param in params(fval)
        ∇param = grad(fval, param)
        param .-= lr * ∇param
    end
    return value(fval)
end

In [None]:
# We define SGD for a dataset as an iterator so that:
# 1. We can monitor and report the training loss
# 2. We can take snapshots of the model during training
# 3. We can pause/terminate training when necessary
sgd(func, data; lr=0.1) = 
    (sgdupdate!(func, args; lr=lr) for args in data)

In [None]:
# Let's train a model for 10 epochs to compare training speed on cpu vs gpu.
# progress!(itr) displays a progress bar when wrapped around an iterator like this:
# 2.94e-01  100.00%┣████████████████████┫ 6000/6000 [00:10/00:10, 592.96/s] 2.31->0.28
model = Linear(784,10)
@show model(dtst)
progress!(sgd(model, ncycle(dtrn,10)))
@show model(dtst);

## Using the GPU

In [None]:
# The training would go a lot faster on a GPU:
# 2.94e-01  100.00%┣███████████████████┫ 6000/6000 [00:02/00:02, 2653.45/s]  2.31->0.28
# To work on a GPU, all we have to do is convert Arrays to KnetArrays:
if CUDA.functional()  # returns true if there is a GPU
    atype = KnetArray{Float32}  # KnetArrays are stored and operated in the GPU
    dtrn = minibatch(xtrn, ytrn, 100; xsize = (784,:), xtype=atype)
    dtst = minibatch(xtst, ytst, 100; xsize = (784,:), xtype=atype)
    Linear(i::Int,o::Int,scale=0.01) = 
        Linear(Param(atype(scale * randn(o,i))), 
               Param(atype(zeros(o))))

    model = Linear(784,10)
    @show model(dtst)
    progress!(sgd(model,ncycle(dtrn,10)))
    @show model(dtst)
end;


## Recording progress

In [None]:
function trainresults(file, model)
    if (print("Train from scratch? (~77s) "); readline()[1]=='y')
        # We will train 100 epochs (the following returns an iterator, does not start training)
        training = sgd(model, ncycle(dtrn,100))
        # We will snapshot model and train/test loss and errors
        snapshot() = (deepcopy(model),model(dtrn),model(dtst),zeroone(model,dtrn),zeroone(model,dtst))
        # Snapshot results once every epoch (still an iterator)
        snapshots = (snapshot() for x in takenth(progress(training),length(dtrn)))
        # Run the snapshot/training iterator, reshape and save results as a 5x100 array
        lin = reshape(collect(flatten(snapshots)),(5,:))
        # Knet.save and Knet.load can be used to store models in files
        Knet.save(file,"results",lin)
    else
        isfile(file) || download("http://people.csail.mit.edu/deniz/models/tutorial/$file", file)
        lin = Knet.load(file,"results")    
    end
    return lin
end

In [None]:
# 2.43e-01  100.00%┣████████████████▉┫ 60000/60000 [00:44/00:44, 1349.13/s]
lin = trainresults("lin113.jld2",Linear(784,10));

## Linear model shows underfitting

In [None]:
using Plots; default(fmt = :png)

In [None]:
# Demonstrates underfitting: training loss not close to 0
# Also slight overfitting: test loss higher than train
trnloss,tstloss = Array{Float32}(lin[2,:]), Array{Float32}(lin[3,:]) 
plot([trnloss,tstloss],ylim=(.0,.4),labels=[:trnloss :tstloss],xlabel="Epochs",ylabel="Loss")

In [None]:
# this is the error plot, we get to about 7.5% test error, i.e. 92.5% accuracy
trnerr,tsterr = Array{Float32}(lin[4,:]), Array{Float32}(lin[5,:]) 
plot([trnerr,tsterr],ylim=(.0,.12),labels=[:trnerr :tsterr],xlabel="Epochs",ylabel="Error")

## Visualizing the learned weights

In [None]:
# Let us visualize the evolution of the weight matrix as images below
# Each row is turned into a 28x28 image with positive weights light and negative weights dark gray
using Images, ImageMagick
for t in 10 .^ range(0,stop=log10(size(lin,2)),length=20) #logspace(0,2,20)
    i = ceil(Int,t)
    f = lin[1,i]
    w1 = reshape(Array(value(f.w))', (28,28,1,10))
    w2 = clamp.(w1.+0.5,0,1)
    IJulia.clear_output(true)
    display(hcat([mnistview(w2,i) for i=1:10]...))
    display("Epoch $(i-1)")
    sleep(1) # (0.96^i)
end