New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Compositional attention #178
Conversation
Please feel free to reach out. I'd be happy to help :) |
88e0226
to
6c2ae94
Compare
@sarthmit actually, I have a question: I've seen quite a few .contiguous().do_something() (for instance), it's not completely obvious to me that it's beneficial since you change the tensor structure right after that, and you don't seem to be using specific kernels which would require having a contiguous tensor in the first place (there are actually a bunch of kernels in xformers which would love that, but .contiguous would be last in that case, not first). |
@blefaudeux I think you are right, there is probably no use for those .contiguous(). I just used the fairseq multi-head codebase as the starting place and they seem to have it also, so I never bothered to remove it. See: multi-head |
6c2ae94
to
47d5a51
Compare
Codecov Report
@@ Coverage Diff @@
## main #178 +/- ##
==========================================
+ Coverage 90.62% 91.02% +0.39%
==========================================
Files 58 59 +1
Lines 2892 3019 +127
==========================================
+ Hits 2621 2748 +127
Misses 271 271
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
47d5a51
to
8d4fa5d
Compare
386d54e
to
0df1adc
Compare
0df1adc
to
b48293e
Compare
71a75c8
to
b59980d
Compare
b59980d
to
7583dd4
Compare
@sarthmit if you have cycles for a review, that would be great ! I've removed a few options (not that many actually) and aligned the terms with the other attentions here, + reused existing building blocks that we have (input projections for instance, there's an optimization to be used for self-attention |
@@ -11,7 +11,7 @@ Please note that: | |||
- These numbers are dependent of hyperparameters (dimensions chosen for Linformer, sparsity of the pattern), they are mostly an illustration | |||
- The sparse attention patterns tested here are just presets, as explained in the linked notebook generating any new sparse attention pattern should be relatively easy, while keeping the benefits of optimized computations. | |||
|
|||
Some examples, generated with `python3 xformers/benchmarks/benchmark_encoder.py --activations gelu --plot -emb 256 -bs 32 -heads 16` | |||
Some examples, generated with `python3 xformers/benchmarks/benchmark_encoder.py --activations gelu --plot -emb 256 -bs 8 -heads 4` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reducing the memory load, generating a new graph for everyone
7583dd4
to
f17a70c
Compare
@@ -5,6 +5,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), | |||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). | |||
|
|||
## TBD | |||
### Added | |||
- Compositional Attention [#41] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hotfix, this was needed actually, not directly tied to this PR
@@ -43,10 +43,12 @@ def _get_multihead( | |||
"dropout": attn_dropout, | |||
"causal": causal, | |||
"seq_len": SEQ, | |||
"window_size": SEQ // 8 + 1, | |||
"window_size": SEQ // 8 + 1, # local attention |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to align the different field names and reformulate wherever possible, but there are still some specificities. It's more visible here since it's a specific attention unit test, but for real the new fields were already needed for the MHA and they don't need to be duplicated
assert ATTENTION_REGISTRY.keys(), "Attention layers should have been registered" | ||
|
||
|
||
@pytest.mark.parametrize("attn_dropout", [0.0, 0.3]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code coverage, this attention exposes quite a few knobs which are mostly orthogonal to the other attentions, so I figured it was best to cover them in a dedicated unit test
@@ -80,7 +80,7 @@ | |||
"eval_frequency": 50, | |||
"num_train_steps": 10000, | |||
"num_eval_steps": 62, | |||
"gradient_accumulation": 1 | |||
"gradient_accumulation": 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compositional takes more memory, so this task does not pass on a 8GB gpu (desktop 3080) without bumping up the accumulation
@@ -48,6 +48,11 @@ def build_multi_head_attention( | |||
"num_heads" | |||
] | |||
|
|||
if "dim_model" not in multi_head_config["attention"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
convenience, remove the need for field duplication in between attention and MHA
@@ -49,7 +49,7 @@ def from_bool(cls: Type[Self], x: torch.Tensor) -> Self: | |||
""" | |||
assert x.dtype == torch.bool | |||
|
|||
additive_mask = torch.empty_like(x, dtype=torch.float) | |||
additive_mask = torch.empty_like(x, dtype=torch.float, device=x.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's already doing that by default I believe, just making it a more explicit
dim_key = _either_or(dim_key, dim_model) | ||
dim_value = _either_or(dim_value, dim_model) | ||
|
||
self.in_proj_container = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
difference with the original implementation: reuse xformers input projection, because the task is the same, and there are some optimizations to be had here (if self attention, project once to get q,k,v instead of three calls for instance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting approach!
if att_mask_additive is not None: | ||
attn_weights += att_mask_additive.values | ||
|
||
if _is_triton_available: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just use the softmax here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, fixing that !
@blefaudeux What do you mean by cycles for a review? By the way, there is also an easy way to test/debug the code. If you set the Value Scores in Equation 13 to the Identity matrix (here), then you should recover the standard Multi-Head Attention performance. |
I meant "brain cycles" :) (aka free time). I can have a look later today, but the logic should be maintained |
it's a good idea I think, testing that right now, I'm still seeing a difference in perf, investigating. I can see that there are other differences with the Vaswani paper, for instance the normalization is not the same (/sqrt(dim_head) vs /sqrt(dim model)), but I don't think that's enough to explain the difference |
ok, confirmed @sarthmit @dianaml0 when forcing the values score to be identity the perf are in line with MHA: the difference in the above is due to.. the rotary embeddings ! These are not part of this PR but they were used in the comparison above for the classical mechanism, and they explained the gap (without them the two mechanisms are aligned if the value score is identity, as expected). All good for landing I think, and it means that supporting rotary embeddings in compositional attention could be a follow up |
@blefaudeux Awesome! I can try sparing some time if needed, is there anything in particular you want me to look at? I would also be curious to see if you see performance gains on any of the tasks (without the identity matrix) |
I've not checked LRA but that could be interesting to get the scores, even if I think that LRA could do with more and better tasks (see this paper from @xwhan for instance https://arxiv.org/pdf/2112.07210.pdf). Rotary embeddings also seem to make a big difference with NLP, would be nice to support them with compositional, should not be a big change. There's a definite perf impact though (at least with our implementations, even when pulling in some semi-optimized parts from xformers), it could be another axis to look at because I think that a lot of folks will "just" bump up the number of heads with the scaled dot product instead if compositional has too much of a speed/memory impact |
Thanks, I will check the paper out. How big is the perf impact? In favour of MHA or in favour of compositional? |
By perf I meant compute time impact, sorry that wasn't clear. Compositional (this PR at least) is significantly slower than MHA (from this repo again) for the same number of heads, even if I understand that it captures way more relations. Could be worth a second eye on whether there are low hanging fruits? |
What does this PR do?
Implements Compositional Attention (based on the reference implementation), as mentioned in #41
Paper
TODOs
cc @sarthmit if interested
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.