Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BT enablement on fairseq - fairseq change #4480

Closed
wants to merge 8 commits into from

Conversation

frank-wei
Copy link

Summary:
as titled and depends on D36057338
Fork the inference path inside the forward function. If loaded the checkpoint file and perform the inference, we will deploy BT. Otherwise, fairseq take the position.

In summary:
Accuracy: accuracy loss due to the fp16, the maximum diff is around 0.009. If we set it to fp32, there is no accuracy loss
Perf: the current fairseq has similar speed as vanilla version. After the enablement, the speedup is similar to standalone BT test.
With batch size=64
For V100, the speedup reaches to 1.23x
For A100, the speedup reaches to 1.38x

After enable nested tensor,
For V100, the speedup reaches to 2.46x

Reviewed By: mikekgfb

Differential Revision: D37082681

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D37082681

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D37082681

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D37082681

Summary:
Pull Request resolved: facebookresearch#4480

as titled and depends on D36057338
Fork the inference path inside the forward function. If loaded the checkpoint file and perform the inference, we will deploy BT. Otherwise, fairseq take the position.

In summary:
Accuracy: accuracy loss due to the fp16, the maximum diff is around 0.009. If we set it to fp32, there is no accuracy loss
Perf: the current fairseq has similar speed as vanilla version. After the enablement, the speedup is similar to standalone BT test.
With batch size=64
For V100, the speedup reaches to 1.23x
For A100, the speedup reaches to 1.38x

After enable nested tensor,
For V100, the speedup reaches to 2.46x

Reviewed By: mikekgfb

Differential Revision: D37082681

fbshipit-source-id: 507c22e7fb7ca0d86962adfb85d83fd8e0c6b71a
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D37082681

Summary:
Pull Request resolved: facebookresearch#4480

as titled and depends on D36057338
Fork the inference path inside the forward function. If loaded the checkpoint file and perform the inference, we will deploy BT. Otherwise, fairseq take the position.

In summary:
Accuracy: accuracy loss due to the fp16, the maximum diff is around 0.009. If we set it to fp32, there is no accuracy loss
Perf: the current fairseq has similar speed as vanilla version. After the enablement, the speedup is similar to standalone BT test.
With batch size=64
For V100, the speedup reaches to 1.23x
For A100, the speedup reaches to 1.38x

After enable nested tensor,
For V100, the speedup reaches to 2.46x

Reviewed By: mikekgfb

Differential Revision: D37082681

fbshipit-source-id: e7adf3ed6f07ad1e72e0b86cf8f6cedfe2731305
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D37082681

attn_mask.to(torch.bool), -1e8 if x.dtype == torch.float32 else -1e4

if self.training:
self.ever_training = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have a variable that is the same as self.training?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is for the scenario where loading ckpt+continue training. We'd like to prevent this case to go fast path.

@@ -113,7 +140,9 @@ def _test_save_and_load(self, scripted_module):
JIT_MSG = "Targeting OSS scriptability for the 1.6 release"


@unittest.skipIf(torch.__version__ < "1.6.0", JIT_MSG)
@unittest.skipIf(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add the necessary pytorch version to the CI testing to make sure this test works and passes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add one nightly version like 1.13.0.dev20220613 as one more test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me

@frank-wei
Copy link
Author

frank-wei commented Jun 16, 2022

@dianaml0 The flow creation could be separate from this PR so I landed it first.
I am not allowed to create any flow or yaml file as I met such error:
new_flow -> new_flow (refusing to allow a Personal Access Token to create or update workflow .github/workflows/build_1_13.yml without workflow scope)

Could you help to create a flow with new yaml file? It should be 99% the same with your current build.yml but changed the line 31 to:

 run: pip3 install --pre torch==1.13.0.dev20220613 -f https://download.pytorch.org/whl/nightly/torch_nightly.html 

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 956fcf4.

lzzk pushed a commit to lzzk/fairseq that referenced this pull request Jul 24, 2022
Summary:
Pull Request resolved: facebookresearch#4480

as titled and depends on D36057338
Fork the inference path inside the forward function. If loaded the checkpoint file and perform the inference, we will deploy BT. Otherwise, fairseq take the position.

In summary:
Accuracy: accuracy loss due to the fp16, the maximum diff is around 0.009. If we set it to fp32, there is no accuracy loss
Perf: the current fairseq has similar speed as vanilla version. After the enablement, the speedup is similar to standalone BT test.
With batch size=64
For V100, the speedup reaches to 1.23x
For A100, the speedup reaches to 1.38x

After enable nested tensor,
For V100, the speedup reaches to 2.46x

Reviewed By: mikekgfb

Differential Revision: D37082681

fbshipit-source-id: 984266f850fc30603e48be56e41ac2c67da080f5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants