[feat][minor] 2/3 Make it explicit whether an attention mechanism supports a mask #266

blefaudeux · 2022-04-09T17:15:20Z

What does this PR do?

Preamble to the Triton2 PR (since triton 2 will change the blocksparse attention and not support attention masks anymore)

The unit test which now fails on CI should be unrelated to this PR, and is fixed by the next PR in line, triton2

Before submitting

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

blefaudeux · 2022-04-09T17:16:36Z

.circleci/config.yml

@@ -150,7 +150,7 @@ run_unittests: &run_unittests
  - run:
      name: Run Unit Tests
      command: |
-        pytest --junitxml=test-results/junit.xml --verbose --timeout 600 tests
+        CUDA_LAUNCH_BLOCKING=1 pytest --junitxml=test-results/junit.xml --verbose --timeout 600 tests


@fmassa CI crash happens with or without this, I added that to make sure that it was not another test before which was silently failing. The crash is in a sputnik kernel, but it makes little sense to me given the contents of this PR

blefaudeux · 2022-04-09T17:17:12Z

tests/test_model_factory.py

@@ -39,7 +39,7 @@
        "num_heads": 4,
        "residual_dropout": 0,
        "attention": {
-            "name": "linformer",
+            "name": "scaled_dot_product",


we were actually passing an attention mask in this test, which makes little sense with linformer. Now that we assert in this case, it had to be fixed :)

…ion mask or not check the assert

blefaudeux · 2022-04-19T21:25:36Z

@dianaml0 @finassa this goes with the triton2 PR again

fmassa

LGTM, thanks!

I have only one comment (which is sort of unrelated to this PR), but which I'd love to get your opinion on.

fmassa · 2022-04-20T18:14:35Z

xformers/components/attention/base.py

@@ -53,6 +53,10 @@ def __init__(self, dropout: Optional[float] = None, *args, **kwargs):
        # so that the MHA wrapper should skip it
        self.requires_skip_multi_head = False

+        # Whether this attention mechanism supports attention masks
+        self.supports_attention_mask = True
+        self.supports_key_padding_mask = False


Ideally, I think I would prefer if we only support attention_mask, and then support the "optimized" case of key_padding_mask internally by checking the strides of the attention_mask Tensor.
The key_padding_mask case is actually just assuming that the stride of dimension -2 of the attention_mask is 0. So the converting key_padding_mask to a attention_mask can be done efficiently via

attn_mask = key_padding_mask[:, None, None, :].expand(batch, heads, query_len, key_len)

Indeed, are we sure that all our implementations support both attention_mask and key_padding_mask at the same time?

Agree on the simpler interface, it's a bit confusing right now. One trick is that for some attentions (Nystrom for instance) there's no attention mask but there's a possible key padding mask (which can then generate a mask of sorts indeed, but in that case the dimensions will differ). I think we could do a second pass on that indeed, thoughts @dianaml0 ?

I agree ideally we could have this simpler approach but I ended up adding a key padding mask argument separately for Nystrom since Nystrom can't apply attention masks. I couldn't come up with a way around that but maybe there's a better way

hmm so ideally we would just support key padding for nystrom and not make it visible everywhere ? right now there's a global flag because this happens in the MHA, but maybe that we could more this one level down and not expose this for other mechanisms ?

else it feels like an abstraction leak on our end, not perfect.. could be worth some more thinking

@fmassa

…h combo (#271) * testing using conda to get the pytorch nightlies and matching cuda * [fix] Making it explicit whether the attention mechanism supports an attention mask or not (#266) check the assert * [backend] 3/3 Triton 2 update (#272) * parent be72b26 author Kashif Rasul <kashif.rasul@gmail.com> 1648069860 +0100 committer Benjamin Lefaudeux <benjamin.lefaudeux@pm.me> 1650256563 -0700 Move to Triton 2 Author: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Benjamin Lefaudeux <benjamin.lefaudeux@pm.me> Tentatively fixing layernorm - faster all around - bugfix better take on sparse tensors, put layout on the correct device update the pip packages, minor cleanup * catering for triton blocksparse being probably more reliable in fp16 * faster layernorm * Minor blocksparse refactoring, update block size restrictions, relax power of two constraint (#277) * Relax device size restrictions * Refactor device creation and run all tests * linting Co-authored-by: Cole Hawkins <colehawk@amazon.com> * code review, thanks @fmassa ! Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: colepshawkins <31542048+colehawkins@users.noreply.github.com> Co-authored-by: Cole Hawkins <colehawk@amazon.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: colepshawkins <31542048+colehawkins@users.noreply.github.com> Co-authored-by: Cole Hawkins <colehawk@amazon.com>

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 9, 2022

blefaudeux marked this pull request as draft April 9, 2022 17:15

blefaudeux commented Apr 9, 2022

View reviewed changes

blefaudeux force-pushed the label_attention_properties branch 6 times, most recently from db42817 to be72b26 Compare April 13, 2022 04:11

blefaudeux changed the base branch from main to conda_ci April 13, 2022 05:02

blefaudeux changed the title ~~[feat][minor] Make it explicit whether an attention mechanism supports a mask~~ [feat][minor] 2/3 Make it explicit whether an attention mechanism supports a mask Apr 18, 2022

blefaudeux force-pushed the conda_ci branch from 77fa95e to 9cad6bb Compare April 18, 2022 04:34

Making it explicit whether the attention mechanism supports an attent…

8113277

…ion mask or not check the assert

blefaudeux force-pushed the label_attention_properties branch from be72b26 to 8113277 Compare April 18, 2022 04:34

blefaudeux marked this pull request as ready for review April 19, 2022 21:25

fmassa approved these changes Apr 20, 2022

View reviewed changes

dianaml0 approved these changes Apr 20, 2022

View reviewed changes

blefaudeux merged commit e3b57de into conda_ci Apr 21, 2022

blefaudeux deleted the label_attention_properties branch April 25, 2022 03:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat][minor] 2/3 Make it explicit whether an attention mechanism supports a mask #266

[feat][minor] 2/3 Make it explicit whether an attention mechanism supports a mask #266

blefaudeux commented Apr 9, 2022 •

edited

blefaudeux Apr 9, 2022

blefaudeux Apr 9, 2022

blefaudeux commented Apr 19, 2022

fmassa left a comment

fmassa Apr 20, 2022

blefaudeux Apr 20, 2022

dianaml0 Apr 20, 2022

blefaudeux Apr 21, 2022

blefaudeux Apr 21, 2022

[feat][minor] 2/3 Make it explicit whether an attention mechanism supports a mask #266

[feat][minor] 2/3 Make it explicit whether an attention mechanism supports a mask #266

Conversation

blefaudeux commented Apr 9, 2022 • edited

What does this PR do?

Before submitting

PR review

blefaudeux Apr 9, 2022

Choose a reason for hiding this comment

blefaudeux Apr 9, 2022

Choose a reason for hiding this comment

blefaudeux commented Apr 19, 2022

fmassa left a comment

Choose a reason for hiding this comment

fmassa Apr 20, 2022

Choose a reason for hiding this comment

blefaudeux Apr 20, 2022

Choose a reason for hiding this comment

dianaml0 Apr 20, 2022

Choose a reason for hiding this comment

blefaudeux Apr 21, 2022

Choose a reason for hiding this comment

blefaudeux Apr 21, 2022

Choose a reason for hiding this comment

blefaudeux commented Apr 9, 2022 •

edited