Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Inconsistent weight decay logics in multiple optimizers #9881

Closed
15 tasks
eric-haibin-lin opened this issue Feb 25, 2018 · 9 comments
Closed
15 tasks

Inconsistent weight decay logics in multiple optimizers #9881

eric-haibin-lin opened this issue Feb 25, 2018 · 9 comments
Labels

Comments

@eric-haibin-lin
Copy link
Member

eric-haibin-lin commented Feb 25, 2018

Issue

The default behaviors of many optimizers are not optimal/consistent for optimization. The desired implementation proposed below will help convergence.

Gradient Clipping

In Tensorflow/Pytorch, weight decay is usually applied before gradient clipping. Not clipping weight decay would lead to much larger update caused by wd regularization compared to the derivative of the loss. For the following optimizers, the weight decay term is applied after gradient clipping, which should be corrected:

  • SGD
  • Signum
  • LBSGD
  • DCASGD
  • NAG
  • SGLD
  • Adam
  • AdaDelta
  • AdaGrad

Weight Decay Not Used to Update Optimizer State

The following optimizers apply wd on weight directly, and the state is not updated. This can make the training slow if a small learning rate is applied, while divergence if a large learning rate is used.

  • AdaDelta
  • AdaGrad

Other Optimizers

FTRL is a proximal optimizer which doesn't use weight decay for gradient clipping nor updating state, which is fine.
The following optimizers apply wd before clipping gradient, which is also fine:

  • RMSProp
  • Adamax
  • Nadam
  • FTML
@piiswrong
Copy link
Contributor

Could you clarify which ones multiply wd with lr and which ones don't?

@sxjscience
Copy link
Member

Just curious, is clip_gradient used anywhere?

@eric-haibin-lin
Copy link
Member Author

@sxjscience supposedly it's used in all optimizers
AdaDelta doesn't multiply wd and lr.

@szhengac
Copy link
Contributor

Unless explicitly specified, for most of the optimizers as implemented in packages such as TF and torch, wd is merged into the gradient before the gradient clipping. When the proximal operator is used, the wd term is not merged.

@eric-haibin-lin
Copy link
Member Author

@szhengac thanks for your inputs.

My concern is that if we change the wd behavior now, the change is incompatible with previous versions of MXNet and users have to change their code for new hyperparameters. I don't think this is provides a good user experience.

@szha @piiswrong @szhengac What about specifically documenting the update rules for the existing optimizers like https://mxnet.incubator.apache.org/versions/master/api/python/optimization/optimization.html#mxnet.optimizer.SGD ? For new optimizers, during code review committers should check if the implementation is similar to pytorch/tf ones?

@sxjscience
Copy link
Member

Looks good. We can write math formulas instead.

@eric-haibin-lin
Copy link
Member Author

@sxjscience I could do that but math formulas is not very convenient when clip is involved..

@astonzhang
Copy link
Member

astonzhang commented May 18, 2018

Thank Haibin for raising such issues.

Besides, weight decay should only apply to weights (not bias). [1][2] Thus, users usually do

trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': learning_rate,
                                       'wd': weight_decay})

by assuming that weight decay only applies to weights. However, our current implementation applies weight decay to all model parameters including bias.

References:

[1] Franklin, J. (2005). The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2), 83-85.

[2] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning (Vol. 1). Cambridge: MIT press.

@safrooze
Copy link
Contributor

Also there is this paper that discusses how weight_decay must be applied independently from optimizer: https://arxiv.org/abs/1711.05101

This was recently implemented in TF: tensorflow/tensorflow#17438

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

8 participants