Add AdamW optimizer #4050

tkerola · 2017-12-06T09:48:51Z

This PR implements AdamW, which was proposed in the following paper: https://openreview.net/forum?id=rk6qdGgCZ

As shown in the paper, the current way that weight decay is implemented in Chainer does not work properly with Adam. AdamW is a modified version of Adam that properly handles weight decay,
and was shown to improve results.

While the original paper calls the algorithm AdamW, I call it AdamWeightDecay in this implementation, since I thought that it better spells out the purpose of the algorithm. Please let me know if you think the name should be changed to AdamW.

Note that this modification of Adam is theoretically applicable to AMSGrad as well (#4032).

hvy · 2017-12-06T09:56:56Z

Thank you for this PR! It looks very interesting. I have only skimmed through the paper but do you think it would be possible to reproduce (i.e. make plots) the experiments in the paper for CIFAR-10?

tkerola · 2017-12-06T10:00:48Z

Sure, I will try and post the results here!

tkerola · 2017-12-07T01:46:05Z

I modified the Chainer CIFAR10 example to compare SGDM, Adam and AdamW with VGG16 (so the experiment is different from the paper).
https://gist.github.com/tkerola/5f643a20c4dc3f2a2b9831bebff3af61

Loss

Accuracy

Like they say in the paper, AdamW seems to beat Adam in the latter half of the training in terms of accuracy (although the validation loss is higher, strange?) and becomes competitive with SGDM.
I think SGDM beats the adaptive methods since the hyperparameters are probably fine-tuned for this example, but like they say in the paper, AdamW has an end-result that is closer to SGDM, while using only default hyperparameters.

niboshi · 2017-12-07T07:35:07Z

Thank you for PR!
We discussed internally about the implementation, and concluded that it would be simpler to add additional hyperparameters (e.g. eta and weight_decay) to Adam.
Do you think it is OK to implement that way?

tkerola · 2017-12-07T09:08:05Z

Sure, that is a good idea, and much more maintainable! I will change the implementation in that manner.

tkerola · 2017-12-07T09:58:17Z

I merged AdamW into Adam. I set _default_hyperparam.weight_decay_rate = 0 to keep it backwards-compatible.

niboshi · 2017-12-08T04:59:46Z

Thank you for the fix!
LGTM about the implementation.

As for the documentation, users without knowledge of AdamW would think Chainer's implementation of Adam is different from original Adam. I think it's better to explicitly state it's an additional feature, mentioning eta and weight_decay_rate.

Also we should mention the name AdamW, so that users who knows about it will instantly understand what it means.

tkerola · 2017-12-08T09:14:57Z

I updated the documentation. Did you have something like this in mind?

kashif · 2017-12-12T11:50:17Z

@tkerola can you kindly also test my branch to make the nice graphs? I am not on my linux box for a while and I cannot get it running on my mac's gpu for some reason...

tkerola · 2017-12-13T04:49:29Z

Sure, I will test it with the same training script.

tkerola · 2017-12-13T08:36:12Z

As requested by @kashif, I redid the experiment above with AMSGrad added, so all 4 methods are compared.

Loss:

Accuracy:

kashif · 2017-12-15T09:31:28Z

@tkerola also I wanted to ask, would it make sense to fix the weight decay in chainer using this method, rather than implementing new optimizers? I am also thinking along similar lines for the AMSGrad PR... what do you think?

tkerola · 2017-12-17T06:12:08Z

Hmm, do you have any idea of how to implement that efficiently? WeightDecay would be needed to be modified to update param.data using a new variable param.prev_data (or something like that) instead of param.grad after calling param.update instead of before, like is being done now. This would not be backwards compatible and also require backprop to use twice the memory since we need to store the previous weights for all layers.
https://github.com/chainer/chainer/blob/master/chainer/optimizer.py#L693
Maybe by adding a "post-param-update" type of Optimizer extension would allow an implementation that only requires a constant amount of extra memory, so that call_hooks(). is called after param.update(): https://github.com/chainer/chainer/blob/master/chainer/optimizer.py#L594

niboshi · 2017-12-21T12:19:59Z

jenkins, test this please

niboshi · 2017-12-21T13:17:43Z

jenkins, test this please

niboshi · 2017-12-21T13:45:09Z

LGTM!

tkerola added 2 commits December 6, 2017 18:42

Add AdamW and tests

579581c

Update missing documentation

9176c3d

hvy added the cat:feature Implementation that introduces new interfaces. label Dec 6, 2017

kmaehashi assigned niboshi Dec 7, 2017

Enable AdamW with MPU

f34661c

tkerola added 3 commits December 7, 2017 18:46

Merge AdamWeightDecay into Adam

b2db583

Fix import

a8dcc6a

Make compatible with master version of Adam

c434ee5

Update documentation

76f1a6c

hvy mentioned this pull request Dec 20, 2017

added initial version of AMSGrad #4032

Merged

niboshi merged commit 0a8059e into chainer:master Dec 21, 2017

niboshi added this to the v4.0.0b3 milestone Dec 21, 2017

toslunar mentioned this pull request Dec 22, 2017

Simplify definitions of optimizers #4069

Closed

crcrpar mentioned this pull request Feb 28, 2019

Add AdaBound (and AMSBound) #6388

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AdamW optimizer #4050

Add AdamW optimizer #4050

tkerola commented Dec 6, 2017 •

edited

hvy commented Dec 6, 2017

tkerola commented Dec 6, 2017

tkerola commented Dec 7, 2017 •

edited

niboshi commented Dec 7, 2017

tkerola commented Dec 7, 2017

tkerola commented Dec 7, 2017

niboshi commented Dec 8, 2017

tkerola commented Dec 8, 2017

kashif commented Dec 12, 2017

tkerola commented Dec 13, 2017

tkerola commented Dec 13, 2017

kashif commented Dec 15, 2017

tkerola commented Dec 17, 2017 •

edited

niboshi commented Dec 21, 2017

niboshi commented Dec 21, 2017

niboshi commented Dec 21, 2017

Add AdamW optimizer #4050

Add AdamW optimizer #4050

Conversation

tkerola commented Dec 6, 2017 • edited

hvy commented Dec 6, 2017

tkerola commented Dec 6, 2017

tkerola commented Dec 7, 2017 • edited

niboshi commented Dec 7, 2017

tkerola commented Dec 7, 2017

tkerola commented Dec 7, 2017

niboshi commented Dec 8, 2017

tkerola commented Dec 8, 2017

kashif commented Dec 12, 2017

tkerola commented Dec 13, 2017

tkerola commented Dec 13, 2017

kashif commented Dec 15, 2017

tkerola commented Dec 17, 2017 • edited

niboshi commented Dec 21, 2017

niboshi commented Dec 21, 2017

niboshi commented Dec 21, 2017

tkerola commented Dec 6, 2017 •

edited

tkerola commented Dec 7, 2017 •

edited

tkerola commented Dec 17, 2017 •

edited