# Linear models, loss functions, gradients, SGD
(c) Deniz Yuret, 2019

* Objectives: Define, train and visualize a simple model; understand gradients and SGD; learn to use the GPU.
* Prerequisites: [Callable objects](https://docs.julialang.org/en/v1/manual/methods/#Function-like-objects-1), MNIST data (02.mnist.ipynb)
* AutoGrad: Param, @diff, grad, value (used and explained)
* Knet: accuracy, zeroone, nll, train! (defined and explained)
* Knet: gpu, KnetArray (used and explained)
* Knet: dir, minibatch (used by mnist.jl)
* Knet: load, save (used by the experiment)

In [2]:
# Load packages, set display
using Pkg; for p in ("Knet","AutoGrad","Plots","Images","ImageMagick"); haskey(Pkg.installed(),p) || Pkg.add(p); end
ENV["COLUMNS"]=72

72

In [8]:
# Load data (see 02.mnist.ipynb)
import Knet
include(Knet.dir("data","mnist.jl"))
dtrn,dtst = mnistdata(xsize=(784,:),xtype=Array{Float32},ytype=Array{Int});

## Define linear model

In [19]:
# In Julia we define a new datatype using `struct`:
struct Linear; w; b; end

# The new struct comes with a default constructor:
model = Linear(0.01 * randn(10,784), zeros(10))

# We can define other constructors with different inputs:
Linear(i::Int,o::Int,scale=0.01) = Linear(scale * randn(o,i), zeros(o))

# This allows instances to be defined using input and output sizes:
model = Linear(784,10)

# We turn Linear instances into callable objects with the following:
(m::Linear)(x) = m.w * x .+ m.b

## Prediction and accuracy

In [10]:
# Let's take the first minibatch from the test set
x,y = first(dtst)
summary.((x,y))

("784×100 Array{Float32,2}", "100-element Array{Int64,1}")

In [11]:
# Display its prediction on the first minibatch: a 10xN score matrix
ypred = model(x)

10×100 Array{Float64,2}:
 -0.103425   -0.0194934  -0.030135     …   0.0476652  -0.105175 
 -0.161313   -0.0216474   0.000149238     -0.0930928  -0.0215103
  0.143538    0.135251    0.0740459        0.100858    0.112494 
 -0.0047826  -0.0353925   0.106043         0.0272591   0.053774 
 -0.143899    0.0749842   0.0335083        0.0589286   0.115073 
  0.0310643  -0.0243017  -0.083892     …  -0.0419657   0.150236 
 -0.109966    0.0818592   0.00318108       0.0115989  -0.109778 
 -0.107214   -0.195953   -0.00195027      -0.0485685  -0.0282934
  0.0217651  -0.0657203  -0.0760563       -0.142605    0.0686347
  0.0187545   0.254221    0.0724638        0.0530679   0.0336366

In [13]:
# correct answers are given as an array of integers
# (remember we use 10 for 0)
y'

1×100 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
 7  2  1  10  4  1  4  9  5  9  …  1  3  6  9  3  1  4  1  7  6  9

In [14]:
# We can calculate the accuracy of our model for the first minibatch
using Statistics
accuracy(model,x,y) = mean(y' .== map(i->i[1], findmax(Array(model(x)),dims=1)[2]))
accuracy(model,x,y)

0.16

In [15]:
# We can calculate the accuracy of our model for the whole test set
accuracy(model,data) = mean(accuracy(model,x,y) for (x,y) in data)
accuracy(model,dtst)

0.13919999999999993

In [16]:
# ZeroOne loss (or error) is defined as 1 - accuracy
zeroone(x...) = 1 - accuracy(x...)
zeroone(model,dtst)

0.8608

## Negative log likelihood

In [20]:
# With two inputs, let the model compute a loss. For classification we use
# Negative log likelihood (aka cross entropy, softmax loss, NLL)
function (m::Linear)(x, y)
    scores = m(x)
    expscores = exp.(scores)
    probabilities = expscores ./ sum(expscores, dims=1)
    answerprobs = (probabilities[y[i],i] for i in 1:length(y))
    mean(-log.(answerprobs))
end

In [21]:
# Calculate loss of our model for the first minibatch
model(x,y)

2.3176554373897527

In [22]:
# If the input is a dataset compute average loss:
# per-instance average negative log likelihood for the whole test set
(m::Linear)(data::Knet.Data) = mean(m(x,y) for (x,y) in data)
model(dtst)

2.310214740879614

## Calculating the gradient using AutoGrad

In [23]:
using AutoGrad
@doc AutoGrad

Usage:

```
x = Param([1,2,3])          # user declares parameters
x => P([1,2,3])             # they are wrapped in a struct
value(x) => [1,2,3]         # we can get the original value
sum(abs2,x) => 14           # they act like regular values outside of differentiation
y = @diff sum(abs2,x)       # if you want the gradients
y => T(14)                  # you get another struct
value(y) => 14              # which represents the same value
grad(y,x) => [2,4,6]        # but also contains gradients for all Params
```

`Param(x)` returns a struct that acts like `x` but marks it as a parameter you want to compute gradients with respect to.

`@diff expr` evaluates an expression and returns a struct that contains its value (which should be a scalar) and gradient information.

`grad(y, x)` returns the gradient of `y` (output by @diff) with respect to any parameter `x::Param`, or  `nothing` if the gradient is 0.

`value(x)` returns the value associated with `x` if `x` is a `Param` or the output of `@diff`, otherwise returns `x`.

`params(x)` returns an array of Params found by a recursive search of object `x`.

Alternative usage:

```
x = [1 2 3]
f(x) = sum(abs2, x)
f(x) => 14
grad(f)(x) => [2 4 6]
gradloss(f)(x) => ([2 4 6], 14)
```

Given a scalar valued function `f`, `grad(f,argnum=1)` returns another function `g` which takes the same inputs as `f` and returns the gradient of the output with respect to the argnum'th argument. `gradloss` is similar except the resulting function also returns f's output.


In [24]:
# Redefine the constructor to use Param's so we can compute gradients
Linear(i::Int,o::Int,scale=0.01) = Linear(Param(scale * randn(o,i)), Param(zeros(o)))

Linear

In [25]:
# Set random seed for replicability
using Random; Random.seed!(9);

In [26]:
# Use a larger scale to get a large initial loss
model = Linear(784,10,1.0)

Linear(P(Array{Float64,2}(10,784)), P(Array{Float64,1}(10)))

In [27]:
# We can still do predictions with f and calculate loss:
model(x,y)

19.10423456298375

In [28]:
# And we can do the same loss calculation also computing gradients:
J = @diff model(x,y)

T(19.104234562983752)

In [29]:
# To get the actual loss value from J:
value(J)

19.104234562983752

In [30]:
# To get the gradient of a parameter from J:
∇w = grad(J,model.w)

10×784 Array{Float64,2}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0

In [34]:
# Note that each gradient has the same size and shape as the corresponding parameter:
@show ∇b = grad(J,model.b);

∇b = grad(J, model.b) = [-0.139954, -0.064541, -0.109522, -0.1275, -0.059184, -0.0980703, -0.102617, 0.0133898, -0.104578, 0.792576]


## Checking the gradient using numerical approximation

What does ∇b represent?

∇b[10] = 0.79 means if I increase b[10] by ϵ, loss will increase by 0.79ϵ

In [36]:
# Loss for the first minibatch with the original parameters
@show value(model.b)
model(x,y)

value(model.b) = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


19.10423456298375

In [None]:
# To numerically check the gradient let's increase the last entry of b by +0.1.
model.b[10] = 0.1

In [37]:
# We see that the loss moves by ≈ +0.79*0.1 as expected.
@show value(model.b)
model(x,y)

value(model.b) = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1]


19.183620170313954

In [38]:
# Reset the change.
model.b[10] = 0

0

## Checking the gradient using manual implementation

In [39]:
# Without AutoGrad we would have to define the gradients manually:
function nllgrad(model,x,y)
    scores = model(x)
    expscores = exp.(scores)
    probabilities = expscores ./ sum(expscores, dims=1)
    for i in 1:length(y); probabilities[y[i],i] -= 1; end
    dJds = probabilities / length(y)
    dJdw = dJds * x'
    dJdb = vec(sum(dJds,dims=2))
    dJdw,dJdb
end;

In [40]:
∇w2,∇b2 = nllgrad(model,x,y)

([0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], [-0.139954, -0.064541, -0.109522, -0.1275, -0.059184, -0.0980703, -0.102617, 0.0133898, -0.104578, 0.792576])

In [41]:
∇w2 ≈ ∇w

true

In [42]:
∇b2 ≈ ∇b

true

## Training with Stochastic Gradient Descent (SGD)

In [52]:
function sgd!(model, data; lr=0.1)
    for (x,y) in data
        loss = @diff model(x,y)
        for param in (model.w, model.b)
            ∇param = grad(loss, param)
            param .= param - lr * ∇param
        end
    end
end

sgd! (generic function with 1 method)

In [57]:
# We will use the more efficient Knet.nll implementation for loss calculation
using Knet: nll
(m::Linear)(x, y) = nll(m(x), y)

In [58]:
# Let's try a randomly initialized model for 10 epochs
model = Linear(784,10)
@show model(dtst)
@time sgd!(model,repeat(dtrn,10)) # 17s
@show model(dtst)

model(dtst) = 2.314792544641632
 11.183495 seconds (1.57 M allocations: 3.266 GiB, 2.43% gc time)
model(dtst) = 0.28067116715923435


0.28067116715923435

## Using the GPU

In [64]:
model

Linear(P(KnetArray{Float32,2}(10,784)), P(KnetArray{Float32,1}(10)))

In [63]:
# To work on the GPU, all we have to do is convert our Arrays to KnetArrays:
using Knet: KnetArray   # KnetArrays are allocated on and operated by the GPUs
if Knet.gpu() >= 0      # Knet.gpu() returns a device id >= 0 if there is a GPU, -1 otherwise
    dtrn.xtype = dtst.xtype = KnetArray{Float32}
    Linear(i::Int,o::Int,scale=0.01) = 
        Linear(Param(KnetArray{Float32}(scale * randn(o,i))), 
               Param(KnetArray{Float32}(zeros(o))))

    model = Linear(784,10)
    @show model(dtst)
    @time sgd!(model,repeat(dtrn,10)) # 7.8s
    @show model(dtst)
end

model(dtst) = 2.3179018f0
  4.321921 seconds (3.74 M allocations: 1.889 GiB, 8.15% gc time)
model(dtst) = 0.28062376f0


0.28062376f0

In [None]:
# Let's collect some data to draw training curves and visualizing weights:
using ProgressMeter: @showprogress

function trainresults(file, epochs)
    results = []
    pa(x) = Knet.gpu() >= 0 ? Param(KnetArray{Float32}(x)) : Param(Array{Float32}(x))
    model = Linear(pa(randn(10,784)*0.01), pa(zeros(10)))
    @showprogress for epoch in 1:epochs  # 100ep 77s (0.2668, 0.0744)
        push!(results, deepcopy(model), Knet.nll(model,dtrn), Knet.nll(model,dtst), zeroone(model,dtrn), zeroone(model,dtst))
        train!(model,dtrn)
    end
    results = reshape(results, (5, :))
    Knet.save(file,"results",results)
end    

In [None]:
# Use Knet.load and Knet.save to store models, results, etc.
if (print("Train from scratch? (~77s) "); readline()[1]=='y')
    trainresults("lin.jld2",100)  # (0.2668679f0, 0.0745)
end
isfile("lin.jld2") || download("http://people.csail.mit.edu/deniz/models/tutorial/lin.jld2","lin.jld2")
lin = Knet.load("lin.jld2","results")
minimum(lin[3,:]), minimum(lin[5,:])

## Linear model shows underfitting

In [None]:
using Plots; default(fmt = :png)

In [None]:
# Demonstrates underfitting: training loss not close to 0
# Also slight overfitting: test loss higher than train
plot([lin[2,:], lin[3,:]],ylim=(.0,.4),labels=[:trnloss :tstloss],xlabel="Epochs",ylabel="Loss")

In [None]:
# this is the error plot, we get to about 7.5% test error, i.e. 92.5% accuracy
plot([lin[4,:], lin[5,:]],ylim=(.0,.12),labels=[:trnerr :tsterr],xlabel="Epochs",ylabel="Error")

## Visualizing the learned weights

In [None]:
# Let us visualize the evolution of the weight matrix as images below
# Each row is turned into a 28x28 image with positive weights light and negative weights dark gray
using Images, ImageMagick
for t in 10 .^ range(0,stop=log10(size(lin,2)),length=10) #logspace(0,2,20)
    i = floor(Int,t)
    f = lin[1,i]
    w1 = reshape(Array(value(f.w))', (28,28,1,10))
    w2 = clamp.(w1.+0.5,0,1)
    IJulia.clear_output(true)
    display(hcat([mnistview(w2,i) for i=1:10]...))
    display("Epoch $i")
    sleep(1) # (0.96^i)
end