-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat]Adding a simple blockwise attention #192
Conversation
They are not used anymore as we have our own custom implementations
…acebookresearch#68) * Add broadcasting support for matmul_with_mask with sparse matrices * Fix lint
…ebookresearch#69) * Add functions for generating basic sparsity patterns They can be combined together to yield complex patterns * Use attention_patterns in modules * Fix lint
* Add current plots for reference * Remove unnecessary code * Had to do extra casting due to limitations of coalesce on bool tensors * linting
* Add naive CPU implementations for CSR format * Remove CUDA-only headers * Remove trailing cuda device in test
* Enable sputnik kernels in xformers blocks * Revert to previous value for density threshold So that plots can be comparable, as the global attention depends on density_threshold for benchmarks * Default device is now CPU
…ion - fixes flake8 warning
Attention are close. Similar to Global Attention test.
* Bugfix: gradients were wrong due to wrong tests * Add more checks for sddmm * Add checks for spmm * Add checks for sparse_softmax * Add more checks in python * Lint
Add Initial Nystrom Attention Mechanism
Moved it to util.py but had forgotten to delete it from core.py in previous PR (facebookresearch#59)
hey @xwhan thanks for the PR ! FYI sparse attention (just pass a block mask) would work also (it's CPU compatible, although not very optimized as of now, cc @fmassa), curious to see a benchmark in between the two. There are pattern preset here if that helps https://github.com/facebookresearch/xformers/blob/main/xformers/components/attention/attention_patterns.py |
on another front I've just added some more explanations on pre-commit here, in case that helps ? sorry about all these steps, we're trying to keep the repo coherent and easy to read |
Thanks, I did not know those are already CPU compatible. I'll try it out with block masks and see if it's faster and stable in my experiments. |
@blefaudeux I spent some time trying out passing a blockwise mask to scaled_dot_product_attention.
The fp16 support is the main pain point. It uses 1/4 more memory and is 2x slower than the PR with fp16. When compared them in fp32 settings, the sputnik one saves a little memory but is still much slower. IMO, this fp16 bottleneck of sputnik might block most NLP use cases, especially for large-scale pretraining. |
fair enough, thanks for checking it out ! fp16 support is a known issue for sparse (see #15), we don't have that many cycles or manpower to work on that unfortunately. Blocksparse handles it though, but in turn it's not CPU compatible (give it a try, it should be pretty fast ! Some example here, this can also be a helper), so IMO your PR has value in this in-between space. Ideally we should have stronger primitives with sparse and blocksparse though, and @fmassa is working towards a single interface for all these |
As @blefaudeux mentioned, I'm working on refactoring the sparse abstractions so that they are unified. I will look into it today |
Hi @xwhan Let me know if you would like some help getting the new |
Thanks @fmassa for the update. That would be a great feature. I am closing the PR for now |
What does this PR do?
This attention is used in fairseq to build a long-doc summarization model.
More efficient implementations should be possible with Triton or blocksparse. However, it needs to be compatible with CPU inference for the sake of deployment and demo.
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
@dianaml0