Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API changes of bps.mxnet.trainer after https://github.com/bytedance/byteps/pull/225 #292

Open
ZiyueHuang opened this issue Sep 1, 2020 · 10 comments

Comments

@ZiyueHuang
Copy link
Contributor

After this PR, trainer.allreduce_grads will compute the average instead of the sum. Also, I think this line (https://github.com/bytedance/byteps/blob/master/byteps/mxnet/__init__.py#L322) is wrong, as it will ignore self._scale = self._optimizer.rescale_grad setting in the construction method, moreover, self._scale is designed to be multiplied on the gradients (in gluon.trainer and the previous version of bps.mxnet.trainer), not division in this line (https://github.com/bytedance/byteps/blob/master/byteps/mxnet/__init__.py#L337).

Then the bert example here (https://github.com/byteps/examples/blob/master/mxnet/bert-large/gluon-nlp/scripts/bert/run_pretraining.py#L384) is wrong when using the current byteps master, as the gradients will be divided twice by num_workers.

Any particular reason for these changes solely for gradient compression? I didn't read this PR in detail and maybe miss something. @eric-haibin-lin @szhengac

@ZiyueHuang
Copy link
Contributor Author

ZiyueHuang commented Sep 1, 2020

@ymjiang Hi, did you test the accuracy of the BERT model trained by https://github.com/byteps/examples/blob/master/mxnet/bert-large ? It seems that in this script, the NSP loss is normalized (by batch_size) on each worker, but MLM loss is not normalized (by num_masks), see https://github.com/byteps/examples/blob/master/mxnet/bert-large/gluon-nlp/scripts/bert/pretraining_utils.py#L333-L343. Then the MLM loss will likely overwhelm the NSP loss (see https://github.com/byteps/examples/blob/master/mxnet/bert-large/gluon-nlp/scripts/bert/run_pretraining.py#L177), however, NSP loss performs an important role in training BERT, see section 5.1 in https://arxiv.org/pdf/1810.04805.pdf. cc @eric-haibin-lin

@ymjiang
Copy link
Member

ymjiang commented Sep 1, 2020

Seems like there are two questions.

The first one is about API changes after the gradient compression PR. @vycezhong Can you please take a look?

The second question is about MLM loss. The example repo is using an old version of gluon-nlp (v0.8) so it is possible that some metrics are not handled properly. If it is fixed in the latest gluon-nlp, are you interested in creating a PR for this? Thank you.

@jasperzhong
Copy link
Contributor

@ZiyueHuang Yes, it will ignore self._scale = self._optimizer.rescale_grad. We did it on purpose. This change makes no difference for full-precision. But for compression, gradients needs to be averaged before being compressed.

@ZiyueHuang
Copy link
Contributor Author

ZiyueHuang commented Sep 1, 2020

Why not multiply num_workers back to the gradients at the end of _allreduce_grads? In order to let this API compute the sum, which is consistent to the previous API (and gluon.trainer and hvd.trainer also did sum in allreduce_grads).

Why ignore self._optimizer.rescale_grad? This is the parameter given by the user, e.g., trainer = bps.Trainer(optimizer_params={'rescale_grad': 2}). Why not self._scale *= batch_size instead of self._scale = batch_size? And in the current implementation, this will lead to different behavior for trainer.step and trainer.update. For example,

trainer = bps.Trainer(optimizer_params={'rescale_grad': 2})
trainer.step(1)

will be different from

trainer = bps.Trainer(optimizer_params={'rescale_grad': 2})
trainer.allreduce_grads()
trainer.update(1)

@ZiyueHuang
Copy link
Contributor Author

Let me summarize the API behavior change after/before this PR, and feel free to correct me if I make some mistakes :)

  • allreduce_grads computes the average instead of the sum. Note that this will break some packages which depends on BytePS, e.g., gluon-nlp-BERT in your recently released example. @ymjiang
  • if the user call step, now his/her input parameter rescale_grad will make no use, but this is not true if the user call allreduce_grads then update. While in the previous version (also in gluon.trainer and hvd.trainer ), the user's input parameter rescale_grad always take effect.
  • And now if the user call allreduce_grads then update, now the gradients will be divided by his/her input parameter rescale_grad , instead of multiplied in the previous version (also in gluon.trainer and hvd.trainer ).

Are all these changes intended by you @eric-haibin-lin @szhengac ? Then why not describe these changes in the PR description? It seems confused to me. Or do I miss something?

@jasperzhong
Copy link
Contributor

jasperzhong commented Sep 1, 2020

@ZiyueHuang

I remember previouly it also computed the average. https://github.com/bytedance/byteps/blob/b8948f0927/byteps/mxnet/__init__.py#L201

And I agree on your suggestion of using self._scale *= batch_size. It looks right.

But calling allreduce_grads then update, is a use case of MXNet? I never thought about it.

@ZiyueHuang
Copy link
Contributor Author

https://github.com/bytedance/byteps/blob/b8948f0927/byteps/mxnet/__init__.py#L201 only takes effect in step or update. bps.trainer.allreduce_grads will compute the sum.

allreduce_grads then update is allowed by MXNet API and heavily used in gluon-nlp.

@jasperzhong
Copy link
Contributor

I never thought of this use case.. Let me see how to fix it.

@ZiyueHuang
Copy link
Contributor Author

ZiyueHuang commented Sep 1, 2020

This use case is documented in the doc of gluon.Trainer.... and also used by byteps/example/gluon-nlp-BERT.

I think the most appropriate way is to multiply num_workers back to the gradients at the end of _allreduce_grads, in order to let this API compute the sum. I have discussed this with @eric-haibin-lin yesterday and he currently thinks that let allreduce_grads compute the average is more semantically appropriate and breaking API in 0.x version is OK. However, I hold an opposite opinion. But we both do think that this major API change should be discussed in the community.

@szhengac
Copy link

szhengac commented Sep 1, 2020

Another reason to compute average instead of sum is that it prevents overflow when training with FP16 and large number of workers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants