-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API changes of bps.mxnet.trainer after https://github.com/bytedance/byteps/pull/225 #292
Comments
@ymjiang Hi, did you test the accuracy of the BERT model trained by https://github.com/byteps/examples/blob/master/mxnet/bert-large ? It seems that in this script, the NSP loss is normalized (by batch_size) on each worker, but MLM loss is not normalized (by num_masks), see https://github.com/byteps/examples/blob/master/mxnet/bert-large/gluon-nlp/scripts/bert/pretraining_utils.py#L333-L343. Then the MLM loss will likely overwhelm the NSP loss (see https://github.com/byteps/examples/blob/master/mxnet/bert-large/gluon-nlp/scripts/bert/run_pretraining.py#L177), however, NSP loss performs an important role in training BERT, see section 5.1 in https://arxiv.org/pdf/1810.04805.pdf. cc @eric-haibin-lin |
Seems like there are two questions. The first one is about API changes after the gradient compression PR. @vycezhong Can you please take a look? The second question is about MLM loss. The example repo is using an old version of gluon-nlp (v0.8) so it is possible that some metrics are not handled properly. If it is fixed in the latest gluon-nlp, are you interested in creating a PR for this? Thank you. |
@ZiyueHuang Yes, it will ignore |
Why not multiply Why ignore
will be different from
|
Let me summarize the API behavior change after/before this PR, and feel free to correct me if I make some mistakes :)
Are all these changes intended by you @eric-haibin-lin @szhengac ? Then why not describe these changes in the PR description? It seems confused to me. Or do I miss something? |
I remember previouly it also computed the average. https://github.com/bytedance/byteps/blob/b8948f0927/byteps/mxnet/__init__.py#L201 And I agree on your suggestion of using But calling |
https://github.com/bytedance/byteps/blob/b8948f0927/byteps/mxnet/__init__.py#L201 only takes effect in
|
I never thought of this use case.. Let me see how to fix it. |
This use case is documented in the doc of gluon.Trainer.... and also used by byteps/example/gluon-nlp-BERT. I think the most appropriate way is to multiply |
Another reason to compute average instead of sum is that it prevents overflow when training with FP16 and large number of workers. |
After this PR, trainer.allreduce_grads will compute the average instead of the sum. Also, I think this line (https://github.com/bytedance/byteps/blob/master/byteps/mxnet/__init__.py#L322) is wrong, as it will ignore
self._scale = self._optimizer.rescale_grad
setting in the construction method, moreover,self._scale
is designed to be multiplied on the gradients (in gluon.trainer and the previous version of bps.mxnet.trainer), not division in this line (https://github.com/bytedance/byteps/blob/master/byteps/mxnet/__init__.py#L337).Then the bert example here (https://github.com/byteps/examples/blob/master/mxnet/bert-large/gluon-nlp/scripts/bert/run_pretraining.py#L384) is wrong when using the current byteps master, as the gradients will be divided twice by
num_workers
.Any particular reason for these changes solely for gradient compression? I didn't read this PR in detail and maybe miss something. @eric-haibin-lin @szhengac
The text was updated successfully, but these errors were encountered: