-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: is it possible to implement flash attention with keops #286
Comments
Hi @jaak-s, Thanks for your interest in our library! I actually did this back in April 2021 in the branch "attention" with a plug-in replacement for the MultiheadAttention layer. Please note, however, that KeOps is not competitive to implement standard attention layers with attention heads of size > 16:
In this context, I think that KeOps may be of interest to people who want to experiment with "original" attention layers (as we sometimes do in geometric deep learning), but not really a competitive option for Natural Language Processing. If you would like to ask anything else, please let me know. Best regards, |
Thanks a lot for the detailed answer! |
You're very welcome - that's an important question in today's context :-) |
Agreed, it is a hot topic. Even though the current implementation of flash-attention is well optimized for NLP there are applications outside NLP that need slight modifications like relative position encodings or distance based biases (ALIBI), which are not yet supported (Dao-AILab/flash-attention#17). With a keops-based implementation these changes feel like one-liner modifications and would make any customization quite straightforward :-). |
I see, thanks for the pointers :-) We are not close enough to Transformer experts to implement competitive layers ourselves (I already have my hands full applying KeOps to anatomical data and drug consumption records!), but I'm more than happy to provide performance tips and/or include useful features to KeOps if this could help the "attention" community. Our priorities for 2023 lay closer to transparent usage on generic hardware (100% compatible numpy interface, CPU support...) than to bleeding edge performance on Nvidia GPUs (with automated mixed precision, etc.), but these are certainly interesting research directions. Best regards, |
Hi, I'm new to the pykeops and was wonder if it would be possible implement flash attention, which used for removing the quadratic memory requirements on the sequence length:
https://github.com/HazyResearch/flash-attention
Basic idea is that one does not need to use N^2 memory because for each row of the attention matrix can be computed independently and then multiplied to the V (so the whole NxN matrix does not need to be stored).
Thanks!
The text was updated successfully, but these errors were encountered: