Lamb optimizer update #16715

access2rohit · 2019-11-04T01:33:34Z

Description

adding to new operators:

lamb_update_phase1
lamb_update_phase2
Link to paper: https://arxiv.org/pdf/1904.00962.pdf

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Code is well-documented:
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

lamb_update, tests, (and when applicable, API doc)

Testing

[DEBUG] 1000 of 1000: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2052238159 to reproduce.
ok

----------------------------------------------------------------------
Ran 1 test in 5085.147s

OK

access2rohit · 2019-11-13T20:06:13Z

@mxnet-label-bot add [pr-awaiting-review]

eric-haibin-lin · 2019-11-13T22:04:16Z

python/mxnet/optimizer/optimizer.py

+
+@register
+class LAMB(Optimizer):
+    """LAMB Optimizer.


pls add doc

working on it now

The name is clashing with the GLuon one, can we give it a different name?

python/mxnet/optimizer/optimizer.py

src/operator/optimizer_op.cc

src/operator/optimizer_op-inl.h

tests/python/unittest/test_optimizer.py

src/operator/optimizer_op.cc

tests/python/unittest/test_optimizer.py

src/operator/optimizer_op-inl.h

larroy · 2019-11-14T21:01:37Z

Please add description and reference to paper in the PR.

larroy · 2019-11-14T23:04:55Z

I see a crash in the way gluon trainer is calling the optimizer...

+ python3 run_pretraining.py '--data=/home/piotr/mxnet-data/bert-pretraining/datasets/book-corpus/book-corpus-large-split/*.train,/home/piotr/mxnet-data/bert-pretraining/datasets/enwiki/enwiki-feb-doc-split/*.train' '--data_eval=/home/piotr/mxnet-data/bert-pretraining/datasets/book-corpus/book-corpus-large-split/*.test,/home/piotr/mxnet-data/bert-pretraining/datasets/enwiki/enwiki-feb-doc-split/*.test' --optimizer lamb3 --warmup_ratio 0.2 --num_steps 200 --ckpt_interval 300000000 --dtype float16 --ckpt_dir ./test-ckpt --lr 0.0001 --total_batch_size 32 --total_batch_size_eval 32 --accumulate 1 --model bert_24_1024_16 --max_seq_length 128 --max_predictions_per_seq 20 --num_data_workers 1 --eval_interval 100000000 --verbose --no_compute_acc --raw --comm_backend horovod --log_interval 10 --verbose --synthetic_data --raw --eval_use_npz
[22:42:38] ../src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
Traceback (most recent call last):
  File "run_pretraining.py", line 574, in <module>
    train(data_train, data_eval, model)
  File "run_pretraining.py", line 457, in train
    num_ctxs=len(ctxs) * num_workers)
  File "/home/piotr/gluon-nlp/scripts/bert/fp16_utils.py", line 433, in step
    self.fp32_trainer.update(step_size)
  File "/home/piotr/mxnet_lamb/python/mxnet/gluon/trainer.py", line 397, in update
    self._update(ignore_stale_grad)
  File "/home/piotr/mxnet_lamb/python/mxnet/gluon/trainer.py", line 434, in _update
    updater(i, w, g)
  File "/home/piotr/mxnet_lamb/python/mxnet/optimizer/optimizer.py", line 1777, in __call__
    self.optimizer.update_multi_precision(i, w, g, self.states[i])
  File "/home/piotr/mxnet_lamb/python/mxnet/optimizer/optimizer.py", line 291, in update_multi_precision
    self.update(index, weight_master_copy, grad32, original_state)
  File "/home/piotr/mxnet_lamb/python/mxnet/optimizer/optimizer.py", line 1012, in update
    g = lamb_update(weight, grad, mean, var, wd=wd, **kwargs)
  File "<string>", line 88, in lamb_update
  File "/home/piotr/mxnet_lamb/python/mxnet/_ctypes/ndarray.py", line 107, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/home/piotr/mxnet_lamb/python/mxnet/base.py", line 254, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Some trailing characters could not be parsed: '[0.0015625]
<NDArray 1 @gpu(0)>', in operator lamb_update(name="", rescale_grad="
[0.0015625]
<NDArray 1 @gpu(0)>", wd="0.01", bias_correction="True", t="1", epsilon="1e-06", beta2="0.999", beta1="0.9")

access2rohit · 2019-11-15T02:16:31Z

Please add description and reference to paper in the PR.

Done

larroy · 2019-11-20T19:54:46Z

@access2rohit could you answer Sam's comments?

python/mxnet/optimizer/optimizer.py

access2rohit · 2019-11-22T18:58:03Z

@access2rohit could you answer Sam's comments?

Done

src/operator/optimizer_op-inl.h

sxjscience · 2019-11-24T20:39:02Z

src/operator/optimizer_op-inl.h

+    float beta1;
+    float beta2;
+    float epsilon;
+    float t;


@eric-haibin-lin @access2rohit I find this issue when reading the code. Here, the t should be the number of updates and should not be stored as float, which will lose the precision. I think we need to store it as index_t.

we are using float here for integer data type. @sxjscience can you explain how we will loses precision for the operation beta^t ?

sxjscience · 2019-11-24T21:03:28Z

src/operator/optimizer_op-inl.h

+
+    if (bias_correction) {
+      DType mean_hat = mean_data[i] / (1. - power::Map(beta1, t));
+      DType var_hat = var_data[i] / (1 - power::Map(beta2, t));


Actually, in apex, it uses a float32 to calculate the power and then switch to float16:
https://github.com/NVIDIA/apex/blob/325f5a0bec542701edba1628ad34f3b2ea47c556/csrc/multi_tensor_lamb.cu#L231-L249

* initial commit lamb optimizer * fixing base lamb optimizer * adding API doc for Lamb Phase 1 and 2

access2rohit requested a review from eric-haibin-lin as a code owner November 4, 2019 01:33

access2rohit changed the title ~~Lamb optimizer update~~ [WIP]Lamb optimizer update Nov 4, 2019

access2rohit force-pushed the lamb branch 3 times, most recently from 63488d6 to c0508d3 Compare November 8, 2019 19:38

access2rohit force-pushed the lamb branch 3 times, most recently from e1a3ad9 to 8d62300 Compare November 13, 2019 20:02

access2rohit changed the title ~~[WIP]Lamb optimizer update~~ Lamb optimizer update Nov 13, 2019

lanking520 added the pr-awaiting-review PR is waiting for code review label Nov 13, 2019

access2rohit force-pushed the lamb branch 5 times, most recently from dfdfdab to e656220 Compare November 13, 2019 22:02

eric-haibin-lin suggested changes Nov 13, 2019

View reviewed changes

access2rohit force-pushed the lamb branch 2 times, most recently from 32f26cc to e6ac0dc Compare November 14, 2019 01:36

access2rohit requested a review from eric-haibin-lin November 14, 2019 01:45

access2rohit force-pushed the lamb branch 2 times, most recently from f720d46 to 9beed69 Compare November 14, 2019 02:07

eric-haibin-lin reviewed Nov 14, 2019

View reviewed changes

src/operator/optimizer_op.cc Outdated Show resolved Hide resolved

tests/python/unittest/test_optimizer.py Outdated Show resolved Hide resolved

tests/python/unittest/test_optimizer.py Show resolved Hide resolved

src/operator/optimizer_op-inl.h Show resolved Hide resolved

access2rohit force-pushed the lamb branch 3 times, most recently from 1eb581a to b6851c9 Compare November 14, 2019 20:24

samskalicky reviewed Nov 14, 2019

View reviewed changes

src/operator/optimizer_op-inl.h Show resolved Hide resolved

samskalicky reviewed Nov 14, 2019

View reviewed changes

src/operator/optimizer_op-inl.h Show resolved Hide resolved

access2rohit force-pushed the lamb branch from b6851c9 to 4334b82 Compare November 14, 2019 21:48

access2rohit force-pushed the lamb branch from 4334b82 to 31d7955 Compare November 14, 2019 22:15

access2rohit requested a review from eric-haibin-lin November 15, 2019 01:56

eric-haibin-lin reviewed Nov 21, 2019

View reviewed changes

python/mxnet/optimizer/optimizer.py Outdated Show resolved Hide resolved

eric-haibin-lin reviewed Nov 23, 2019

View reviewed changes

src/operator/optimizer_op-inl.h Outdated Show resolved Hide resolved

initial commit lamb optimizer

1c00ede

access2rohit force-pushed the lamb branch from a350e8d to 12532b6 Compare November 24, 2019 00:20

Rohit Kumar Srivastava added 2 commits November 24, 2019 00:36

fixing base lamb optimizer

ff1e506

adding API doc for Lamb Phase 1 and 2

2cc3751

access2rohit force-pushed the lamb branch from 12532b6 to 2cc3751 Compare November 24, 2019 00:36

access2rohit requested a review from eric-haibin-lin November 24, 2019 00:36

eric-haibin-lin approved these changes Nov 24, 2019

View reviewed changes

eric-haibin-lin merged commit 85d3ef3 into apache:master Nov 24, 2019

sxjscience reviewed Nov 24, 2019

View reviewed changes

This was referenced Nov 25, 2019

New optimizer gluonnlp.optimizer.lamb.LAMB is overriding existing optimizer mxnet.optimizer.optimizer.LAMB dmlc/gluon-nlp#1019

Closed

Multi-tensor LAMB #16893

Merged

leezu removed the pr-awaiting-review PR is waiting for code review label Nov 27, 2019

ptrendx pushed a commit to ptrendx/mxnet that referenced this pull request Dec 10, 2019

Lamb optimizer update (apache#16715)

d69971a

* initial commit lamb optimizer * fixing base lamb optimizer * adding API doc for Lamb Phase 1 and 2

ptrendx mentioned this pull request Dec 10, 2019

Backport #16715 and #16903 to 1.6 #17036

Merged

eric-haibin-lin pushed a commit that referenced this pull request Dec 10, 2019

Lamb optimizer update (#16715)

c7d484e

* initial commit lamb optimizer * fixing base lamb optimizer * adding API doc for Lamb Phase 1 and 2

eric-haibin-lin pushed a commit to eric-haibin-lin/mxnet that referenced this pull request Dec 14, 2019

Lamb optimizer update (apache#16715)

93ae612

* initial commit lamb optimizer * fixing base lamb optimizer * adding API doc for Lamb Phase 1 and 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lamb optimizer update #16715

Lamb optimizer update #16715

access2rohit commented Nov 4, 2019 •

edited

access2rohit commented Nov 13, 2019 •

edited

eric-haibin-lin Nov 13, 2019

access2rohit Nov 14, 2019

larroy Nov 14, 2019

larroy commented Nov 14, 2019

larroy commented Nov 14, 2019

access2rohit commented Nov 15, 2019

larroy commented Nov 20, 2019

access2rohit commented Nov 22, 2019

sxjscience Nov 24, 2019

access2rohit Nov 25, 2019

sxjscience Nov 24, 2019

Lamb optimizer update #16715

Lamb optimizer update #16715

Conversation

access2rohit commented Nov 4, 2019 • edited

Description

Checklist

Essentials

Changes

Testing

access2rohit commented Nov 13, 2019 • edited

eric-haibin-lin Nov 13, 2019

Choose a reason for hiding this comment

access2rohit Nov 14, 2019

Choose a reason for hiding this comment

larroy Nov 14, 2019

Choose a reason for hiding this comment

larroy commented Nov 14, 2019

larroy commented Nov 14, 2019

access2rohit commented Nov 15, 2019

larroy commented Nov 20, 2019

access2rohit commented Nov 22, 2019

sxjscience Nov 24, 2019

Choose a reason for hiding this comment

access2rohit Nov 25, 2019

Choose a reason for hiding this comment

sxjscience Nov 24, 2019

Choose a reason for hiding this comment

access2rohit commented Nov 4, 2019 •

edited

access2rohit commented Nov 13, 2019 •

edited