-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BT enablement on fairseq - fairseq change #4480
Conversation
This pull request was exported from Phabricator. Differential Revision: D37082681 |
6569bf9
to
16cffd9
Compare
This pull request was exported from Phabricator. Differential Revision: D37082681 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D37082681 |
16cffd9
to
3e13d39
Compare
Summary: Pull Request resolved: facebookresearch#4480 as titled and depends on D36057338 Fork the inference path inside the forward function. If loaded the checkpoint file and perform the inference, we will deploy BT. Otherwise, fairseq take the position. In summary: Accuracy: accuracy loss due to the fp16, the maximum diff is around 0.009. If we set it to fp32, there is no accuracy loss Perf: the current fairseq has similar speed as vanilla version. After the enablement, the speedup is similar to standalone BT test. With batch size=64 For V100, the speedup reaches to 1.23x For A100, the speedup reaches to 1.38x After enable nested tensor, For V100, the speedup reaches to 2.46x Reviewed By: mikekgfb Differential Revision: D37082681 fbshipit-source-id: 507c22e7fb7ca0d86962adfb85d83fd8e0c6b71a
This pull request was exported from Phabricator. Differential Revision: D37082681 |
89b287b
to
354e459
Compare
Summary: Pull Request resolved: facebookresearch#4480 as titled and depends on D36057338 Fork the inference path inside the forward function. If loaded the checkpoint file and perform the inference, we will deploy BT. Otherwise, fairseq take the position. In summary: Accuracy: accuracy loss due to the fp16, the maximum diff is around 0.009. If we set it to fp32, there is no accuracy loss Perf: the current fairseq has similar speed as vanilla version. After the enablement, the speedup is similar to standalone BT test. With batch size=64 For V100, the speedup reaches to 1.23x For A100, the speedup reaches to 1.38x After enable nested tensor, For V100, the speedup reaches to 2.46x Reviewed By: mikekgfb Differential Revision: D37082681 fbshipit-source-id: e7adf3ed6f07ad1e72e0b86cf8f6cedfe2731305
This pull request was exported from Phabricator. Differential Revision: D37082681 |
c44190f
to
6b27c28
Compare
… into export-D37082681
attn_mask.to(torch.bool), -1e8 if x.dtype == torch.float32 else -1e4 | ||
|
||
if self.training: | ||
self.ever_training = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have a variable that is the same as self.training?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is for the scenario where loading ckpt+continue training. We'd like to prevent this case to go fast path.
@@ -113,7 +140,9 @@ def _test_save_and_load(self, scripted_module): | |||
JIT_MSG = "Targeting OSS scriptability for the 1.6 release" | |||
|
|||
|
|||
@unittest.skipIf(torch.__version__ < "1.6.0", JIT_MSG) | |||
@unittest.skipIf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add the necessary pytorch version to the CI testing to make sure this test works and passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add one nightly version like 1.13.0.dev20220613 as one more test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me
@dianaml0 The flow creation could be separate from this PR so I landed it first. Could you help to create a flow with new yaml file? It should be 99% the same with your current
|
This pull request has been reverted by 956fcf4. |
Summary: Pull Request resolved: facebookresearch#4480 as titled and depends on D36057338 Fork the inference path inside the forward function. If loaded the checkpoint file and perform the inference, we will deploy BT. Otherwise, fairseq take the position. In summary: Accuracy: accuracy loss due to the fp16, the maximum diff is around 0.009. If we set it to fp32, there is no accuracy loss Perf: the current fairseq has similar speed as vanilla version. After the enablement, the speedup is similar to standalone BT test. With batch size=64 For V100, the speedup reaches to 1.23x For A100, the speedup reaches to 1.38x After enable nested tensor, For V100, the speedup reaches to 2.46x Reviewed By: mikekgfb Differential Revision: D37082681 fbshipit-source-id: 984266f850fc30603e48be56e41ac2c67da080f5
Summary:
as titled and depends on D36057338
Fork the inference path inside the forward function. If loaded the checkpoint file and perform the inference, we will deploy BT. Otherwise, fairseq take the position.
In summary:
Accuracy: accuracy loss due to the fp16, the maximum diff is around 0.009. If we set it to fp32, there is no accuracy loss
Perf: the current fairseq has similar speed as vanilla version. After the enablement, the speedup is similar to standalone BT test.
With batch size=64
For V100, the speedup reaches to 1.23x
For A100, the speedup reaches to 1.38x
After enable nested tensor,
For V100, the speedup reaches to 2.46x
Reviewed By: mikekgfb
Differential Revision: D37082681