In [1]:
using Knet
include("optimizers.jl")

Main.Optimizers

In [2]:
@doc Optimizers

This example demonstrates the usage of stochastic gradient descent(sgd) based optimization methods. We train LeNet model on MNIST dataset similar to `lenet.jl`.

You can run the demo using `julia optimizers.jl`.  Use `julia optimizers.jl --help` for a list of options. By default the [LeNet](http://yann.lecun.com/exdb/lenet) convolutional neural network model will be trained using sgd for 10 epochs. At the end of the training accuracy for the training and test sets for each epoch will be printed  and optimized parameters will be returned.


In [3]:
Optimizers.main("--help")

usage: <PROGRAM> [--seed SEED] [--batchsize BATCHSIZE] [--lr LR]
                 [--eps EPS] [--gamma GAMMA] [--rho RHO]
                 [--beta1 BETA1] [--beta2 BETA2] [--epochs EPOCHS]
                 [--iters ITERS] [--optim OPTIM] [--atype ATYPE]

optimizers.jl (c) Ozan Arkan Can and Deniz Yuret, 2016. Demonstration
of different sgd based optimization methods using LeNet.

optional arguments:
  --seed SEED           random number seed: use a nonnegative int for
                        repeatable results (type: Int64, default: -1)
  --batchsize BATCHSIZE
                        minibatch size (type: Int64, default: 100)
  --lr LR               learning rate (type: Float64, default: 0.1)
  --eps EPS             epsilon parameter used in adam, adagrad,
                        adadelta (type: Float64, default: 1.0e-6)
  --gamma GAMMA         gamma parameter used in momentum and nesterov
                        (type: Float64, default: 0.95)
  --rho RHO             rho parameter used

In [4]:
@doc SGD

```
SGD(;lr=0.001,gclip=0)
update!(w,g,p::SGD)
update!(w,g;lr=0.001)
```

Container for parameters of the Stochastic gradient descent (SGD) optimization algorithm used by [`update!`](@ref).

SGD is an optimization technique to minimize an objective function by updating its weights in the opposite direction of their gradient. The learning rate (lr) determines the size of the step.  SGD updates the weights with the following formula:

```
w = w - lr * g
```

where `w` is a weight array, `g` is the gradient of the loss function w.r.t `w` and `lr` is the learning rate.

If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`.  If `gclip==0` no scaling takes place.

SGD is used by default if no algorithm is specified in the two argument version of `update!`[@ref].


In [6]:
Optimizers.main(""); # Tries SGD by default

optimizers.jl (c) Ozan Arkan Can and Deniz Yuret, 2016. Demonstration of different sgd based optimization methods using LeNet.
opts=(:atype, "KnetArray{Float32}")(:gamma, 0.95)(:rho, 0.9)(:eps, 1.0e-6)(:batchsize, 100)(:beta1, 0.9)(:iters, 6000)(:beta2, 0.95)(:epochs, 10)(:optim, "SGD")(:lr, 0.1)(:seed, -1)
(:epoch, 0, :trn, 0.1017, :tst, 0.1019)
(:epoch, 1, :trn, 0.96335, :tst, 0.9657)
(:epoch, 2, :trn, 0.97825, :tst, 0.9778)
(:epoch, 3, :trn, 0.9844, :tst, 0.9821)
(:epoch, 4, :trn, 0.988, :tst, 0.9853)
(:epoch, 5, :trn, 0.9904166666666666, :tst, 0.9862)
(:epoch, 6, :trn, 0.9917666666666667, :tst, 0.9873)
(:epoch, 7, :trn, 0.9926333333333334, :tst, 0.9873)
(:epoch, 8, :trn, 0.9932666666666666, :tst, 0.9879)
(:epoch, 9, :trn, 0.9938, :tst, 0.9884)
(:epoch, 10, :trn, 0.9946, :tst, 0.9883)
 30.585886 seconds (11.26 M allocations: 4.346 GiB, 1.08% gc time)


In [7]:
@doc Momentum

```
Momentum(;lr=0.001, gclip=0, gamma=0.9)
update!(w,g,p::Momentum)
```

Container for parameters of the Momentum optimization algorithm used by [`update!`](@ref).

The Momentum method tries to accelerate SGD by adding a velocity term to the update.  This also decreases the oscillation between successive steps. It updates the weights with the following formulas:

```
velocity = gamma * velocity + lr * g
w = w - velocity
```

where `w` is a weight array, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `gamma` is the momentum parameter, `velocity` is an array with the same size and type of `w` and holds the accelerated gradients.

If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`.  If `gclip==0` no scaling takes place.

Reference: [Qian, N. (1999)](http://doi.org/10.1016/S0893-6080(98)00116-6). On the momentum term in gradient descent learning algorithms.  Neural Networks : The Official Journal of the International Neural Network Society, 12(1), 145–151.


In [8]:
Optimizers.main("--optim Momentum --lr 0.005 --gamma 0.99");

optimizers.jl (c) Ozan Arkan Can and Deniz Yuret, 2016. Demonstration of different sgd based optimization methods using LeNet.
opts=(:atype, "KnetArray{Float32}")(:gamma, 0.99)(:rho, 0.9)(:eps, 1.0e-6)(:batchsize, 100)(:beta1, 0.9)(:iters, 6000)(:beta2, 0.95)(:epochs, 10)(:optim, "Momentum")(:lr, 0.005)(:seed, -1)
(:epoch, 0, :trn, 0.10626666666666666, :tst, 0.1099)
(:epoch, 1, :trn, 0.9788666666666667, :tst, 0.9787)
(:epoch, 2, :trn, 0.9856, :tst, 0.9845)
(:epoch, 3, :trn, 0.9864166666666667, :tst, 0.9833)
(:epoch, 4, :trn, 0.9870833333333333, :tst, 0.9842)
(:epoch, 5, :trn, 0.9899166666666667, :tst, 0.9834)
(:epoch, 6, :trn, 0.9940333333333333, :tst, 0.9869)
(:epoch, 7, :trn, 0.99255, :tst, 0.986)
(:epoch, 8, :trn, 0.9959333333333333, :tst, 0.9879)
(:epoch, 9, :trn, 0.9962166666666666, :tst, 0.9881)
(:epoch, 10, :trn, 0.9966666666666667, :tst, 0.9906)
 31.861608 seconds (11.83 M allocations: 4.375 GiB, 1.22% gc time)


In [9]:
@doc Nesterov

```
Nesterov(; lr=0.001, gclip=0, gamma=0.9)
update!(w,g,p::Momentum)
```

Container for parameters of Nesterov's momentum optimization algorithm used by [`update!`](@ref).

It is similar to standard [`Momentum`](@ref) but with a slightly different update rule:

```
velocity = gamma * velocity_old - lr * g
w = w_old - velocity_old + (1+gamma) * velocity
```

where `w` is a weight array, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `gamma` is the momentum parameter, `velocity` is an array with the same size and type of `w` and holds the accelerated gradients.

If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`.  If `gclip == 0` no scaling takes place.

Reference Implementation : [Yoshua Bengio, Nicolas Boulanger-Lewandowski and Razvan P ascanu](https://arxiv.org/pdf/1212.0901.pdf)


In [10]:
Optimizers.main("--optim Nesterov --lr 0.005 --gamma 0.99");

optimizers.jl (c) Ozan Arkan Can and Deniz Yuret, 2016. Demonstration of different sgd based optimization methods using LeNet.
opts=(:atype, "KnetArray{Float32}")(:gamma, 0.99)(:rho, 0.9)(:eps, 1.0e-6)(:batchsize, 100)(:beta1, 0.9)(:iters, 6000)(:beta2, 0.95)(:epochs, 10)(:optim, "Nesterov")(:lr, 0.005)(:seed, -1)
(:epoch, 0, :trn, 0.06255, :tst, 0.0574)
(:epoch, 1, :trn, 0.9762833333333333, :tst, 0.9777)
(:epoch, 2, :trn, 0.9874666666666667, :tst, 0.9855)
(:epoch, 3, :trn, 0.9898166666666667, :tst, 0.9851)
(:epoch, 4, :trn, 0.99075, :tst, 0.9868)
(:epoch, 5, :trn, 0.9943166666666666, :tst, 0.9893)
(:epoch, 6, :trn, 0.9946666666666667, :tst, 0.9901)
(:epoch, 7, :trn, 0.9958333333333333, :tst, 0.9892)
(:epoch, 8, :trn, 0.9976833333333334, :tst, 0.9909)
(:epoch, 9, :trn, 0.9967666666666667, :tst, 0.9894)
(:epoch, 10, :trn, 0.9969666666666667, :tst, 0.99)
 32.278787 seconds (11.84 M allocations: 4.371 GiB, 1.26% gc time)


In [11]:
@doc Adagrad

```
Adagrad(;lr=0.1, gclip=0, eps=1e-6)
update!(w,g,p::Adagrad)
```

Container for parameters of the Adagrad optimization algorithm used by [`update!`](@ref).

Adagrad is one of the methods that adapts the learning rate to each of the weights.  It stores the sum of the squares of the gradients to scale the learning rate.  The learning rate is adapted for each weight by the value of current gradient divided by the accumulated gradients. Hence, the learning rate is greater for the parameters where the accumulated gradients are small and the learning rate is small if the accumulated gradients are large. It updates the weights with the following formulas:

```
G = G + g .^ 2
w = w - g .* lr ./ sqrt(G + eps)
```

where `w` is the weight, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `G` is an array with the same size and type of `w` and holds the sum of the squares of the gradients. `eps` is a small constant to prevent a zero value in the denominator.

If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`.  If `gclip==0` no scaling takes place.

Reference: [Duchi, J., Hazan, E., & Singer, Y. (2011)](http://jmlr.org/papers/v12/duchi11a.html). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159.


In [12]:
Optimizers.main("--optim Adagrad --lr 0.01 --eps 1e-6");

optimizers.jl (c) Ozan Arkan Can and Deniz Yuret, 2016. Demonstration of different sgd based optimization methods using LeNet.
opts=(:atype, "KnetArray{Float32}")(:gamma, 0.95)(:rho, 0.9)(:eps, 1.0e-6)(:batchsize, 100)(:beta1, 0.9)(:iters, 6000)(:beta2, 0.95)(:epochs, 10)(:optim, "Adagrad")(:lr, 0.01)(:seed, -1)
(:epoch, 0, :trn, 0.10341666666666667, :tst, 0.1003)
(:epoch, 1, :trn, 0.9788166666666667, :tst, 0.9795)
(:epoch, 2, :trn, 0.9867666666666667, :tst, 0.9853)
(:epoch, 3, :trn, 0.9912666666666666, :tst, 0.9891)
(:epoch, 4, :trn, 0.9936, :tst, 0.99)
(:epoch, 5, :trn, 0.995, :tst, 0.9906)
(:epoch, 6, :trn, 0.9957833333333334, :tst, 0.9909)
(:epoch, 7, :trn, 0.9966333333333334, :tst, 0.9911)
(:epoch, 8, :trn, 0.997, :tst, 0.9913)
(:epoch, 9, :trn, 0.9974333333333333, :tst, 0.9912)
(:epoch, 10, :trn, 0.9978, :tst, 0.9908)
 33.680254 seconds (13.07 M allocations: 4.410 GiB, 1.20% gc time)


In [13]:
@doc Adadelta

```
Adadelta(;lr=0.01, gclip=0, rho=0.9, eps=1e-6)
update!(w,g,p::Adadelta)
```

Container for parameters of the Adadelta optimization algorithm used by [`update!`](@ref).

Adadelta is an extension of Adagrad that tries to prevent the decrease of the learning rates to zero as training progresses. It scales the learning rate based on the accumulated gradients like Adagrad and holds the acceleration term like Momentum. It updates the weights with the following formulas:

```
G = (1-rho) * g .^ 2 + rho * G
update = g .* sqrt(delta + eps) ./ sqrt(G + eps)
w = w - lr * update
delta = rho * delta + (1-rho) * update .^ 2
```

where `w` is the weight, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `G` is an array with the same size and type of `w` and holds the sum of the squares of the gradients. `eps` is a small constant to prevent a zero value in the denominator.  `rho` is the momentum parameter and `delta` is an array with the same size and type of `w` and holds the sum of the squared updates.

If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`.  If `gclip==0` no scaling takes place.

Reference: [Zeiler, M. D. (2012)](http://arxiv.org/abs/1212.5701). ADADELTA: An Adaptive Learning Rate Method.


In [14]:
Optimizers.main("--optim Adadelta --lr 0.35 --rho 0.9 --eps 1e-6");

optimizers.jl (c) Ozan Arkan Can and Deniz Yuret, 2016. Demonstration of different sgd based optimization methods using LeNet.
opts=(:atype, "KnetArray{Float32}")(:gamma, 0.95)(:rho, 0.9)(:eps, 1.0e-6)(:batchsize, 100)(:beta1, 0.9)(:iters, 6000)(:beta2, 0.95)(:epochs, 10)(:optim, "Adadelta")(:lr, 0.35)(:seed, -1)
(:epoch, 0, :trn, 0.07245, :tst, 0.0725)
(:epoch, 1, :trn, 0.9737333333333333, :tst, 0.9754)
(:epoch, 2, :trn, 0.9836166666666667, :tst, 0.9834)
(:epoch, 3, :trn, 0.9884833333333334, :tst, 0.9859)
(:epoch, 4, :trn, 0.9905166666666667, :tst, 0.9876)
(:epoch, 5, :trn, 0.99355, :tst, 0.9898)
(:epoch, 6, :trn, 0.9953333333333333, :tst, 0.9902)
(:epoch, 7, :trn, 0.9960666666666667, :tst, 0.9919)
(:epoch, 8, :trn, 0.9963833333333333, :tst, 0.991)
(:epoch, 9, :trn, 0.9965333333333334, :tst, 0.9907)
(:epoch, 10, :trn, 0.9973333333333333, :tst, 0.9914)
 36.130899 seconds (13.88 M allocations: 4.422 GiB, 1.03% gc time)


In [15]:
@doc Rmsprop

```
Rmsprop(;lr=0.001, gclip=0, rho=0.9, eps=1e-6)
update!(w,g,p::Rmsprop)
```

Container for parameters of the Rmsprop optimization algorithm used by [`update!`](@ref).

Rmsprop scales the learning rates by dividing the root mean squared of the gradients. It updates the weights with the following formula:

```
G = (1-rho) * g .^ 2 + rho * G
w = w - lr * g ./ sqrt(G + eps)
```

where `w` is the weight, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `G` is an array with the same size and type of `w` and holds the sum of the squares of the gradients. `eps` is a small constant to prevent a zero value in the denominator.  `rho` is the momentum parameter and `delta` is an array with the same size and type of `w` and holds the sum of the squared updates.

If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`.  If `gclip==0` no scaling takes place.

Reference: [Tijmen Tieleman and Geoffrey Hinton (2012)](https://dirtysalt.github.io/images/nn-class-lec6.pdf). "Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude."  COURSERA: Neural Networks for Machine Learning 4.2.


In [16]:
Optimizers.main("--optim Rmsprop --lr 0.001 --rho 0.9 --eps 1e-6");

optimizers.jl (c) Ozan Arkan Can and Deniz Yuret, 2016. Demonstration of different sgd based optimization methods using LeNet.
opts=(:atype, "KnetArray{Float32}")(:gamma, 0.95)(:rho, 0.9)(:eps, 1.0e-6)(:batchsize, 100)(:beta1, 0.9)(:iters, 6000)(:beta2, 0.95)(:epochs, 10)(:optim, "Rmsprop")(:lr, 0.001)(:seed, -1)
(:epoch, 0, :trn, 0.09513333333333333, :tst, 0.0929)
(:epoch, 1, :trn, 0.9816666666666667, :tst, 0.9804)
(:epoch, 2, :trn, 0.9891, :tst, 0.988)
(:epoch, 3, :trn, 0.9922, :tst, 0.9892)
(:epoch, 4, :trn, 0.9942666666666666, :tst, 0.9911)
(:epoch, 5, :trn, 0.9955, :tst, 0.9914)
(:epoch, 6, :trn, 0.9968833333333333, :tst, 0.9923)
(:epoch, 7, :trn, 0.9974333333333333, :tst, 0.9923)
(:epoch, 8, :trn, 0.9968666666666667, :tst, 0.9918)
(:epoch, 9, :trn, 0.9977166666666667, :tst, 0.9919)
(:epoch, 10, :trn, 0.99445, :tst, 0.9884)
 33.505476 seconds (12.59 M allocations: 4.384 GiB, 1.26% gc time)


In [17]:
@doc Adam

```
Adam(;lr=0.001, gclip=0, beta1=0.9, beta2=0.999, eps=1e-8)
update!(w,g,p::Adam)
```

Container for parameters of the Adam optimization algorithm used by [`update!`](@ref).

Adam is one of the methods that compute the adaptive learning rate. It stores accumulated gradients (first moment) and the sum of the squared of gradients (second).  It scales the first and second moment as a function of time. Here is the update formulas:

```
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g .* g
mhat = m ./ (1 - beta1 ^ t)
vhat = v ./ (1 - beta2 ^ t)
w = w - (lr / (sqrt(vhat) + eps)) * mhat
```

where `w` is the weight, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `m` is an array with the same size and type of `w` and holds the accumulated gradients. `v` is an array with the same size and type of `w` and holds the sum of the squares of the gradients. `eps` is a small constant to prevent a zero denominator. `beta1` and `beta2` are the parameters to calculate bias corrected first and second moments. `t` is the update count.

If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`.  If `gclip==0` no scaling takes place.

Reference: [Kingma, D. P., & Ba, J. L. (2015)](https://arxiv.org/abs/1412.6980). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13.


In [18]:
Optimizers.main("--optim Adam --lr 0.001 --beta1 0.9 --beta2 0.95 --eps 1e-8");

optimizers.jl (c) Ozan Arkan Can and Deniz Yuret, 2016. Demonstration of different sgd based optimization methods using LeNet.
opts=(:atype, "KnetArray{Float32}")(:gamma, 0.95)(:rho, 0.9)(:eps, 1.0e-8)(:batchsize, 100)(:beta1, 0.9)(:iters, 6000)(:beta2, 0.95)(:epochs, 10)(:optim, "Adam")(:lr, 0.001)(:seed, -1)
(:epoch, 0, :trn, 0.13488333333333333, :tst, 0.1404)
(:epoch, 1, :trn, 0.9800166666666666, :tst, 0.9809)
(:epoch, 2, :trn, 0.9897333333333334, :tst, 0.9875)
(:epoch, 3, :trn, 0.99125, :tst, 0.9887)
(:epoch, 4, :trn, 0.9933166666666666, :tst, 0.9894)
(:epoch, 5, :trn, 0.99345, :tst, 0.9873)
(:epoch, 6, :trn, 0.9950333333333333, :tst, 0.9876)
(:epoch, 7, :trn, 0.9936833333333334, :tst, 0.9864)
(:epoch, 8, :trn, 0.9949333333333333, :tst, 0.9865)
(:epoch, 9, :trn, 0.9966666666666667, :tst, 0.9896)
(:epoch, 10, :trn, 0.9970333333333333, :tst, 0.9889)
 37.025818 seconds (13.83 M allocations: 4.422 GiB, 1.42% gc time)
