Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat]Adding a simple blockwise attention #192

Closed
wants to merge 322 commits into from

Conversation

xwhan
Copy link
Contributor

@xwhan xwhan commented Jan 24, 2022

What does this PR do?

This attention is used in fairseq to build a long-doc summarization model.

More efficient implementations should be possible with Triton or blocksparse. However, it needs to be compatible with CPU inference for the sake of deployment and demo.

Before submitting

  • Did you have fun?
    • Make sure you had fun coding 🙃
  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
    • N/A
  • Did you make sure to update the docs?
    • N/A
  • Did you write any new necessary tests?
    • N/A
  • Did you update the changelog? (if needed)
    • N/A

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
@dianaml0

fmassa and others added 30 commits April 28, 2021 10:31
They are not used anymore as we have our own custom implementations
…acebookresearch#68)

* Add broadcasting support for matmul_with_mask with sparse matrices

* Fix lint
…ebookresearch#69)

* Add functions for generating basic sparsity patterns

They can be combined together to yield complex patterns

* Use attention_patterns in modules

* Fix lint
* Add current plots for reference
* Remove unnecessary code
* Had to do extra casting due to limitations of coalesce on bool tensors
* linting
* Add naive CPU implementations for CSR format

* Remove CUDA-only headers

* Remove trailing cuda device in test
* Enable sputnik kernels in xformers blocks

* Revert to previous value for density threshold

So that plots can be comparable, as the global attention depends on density_threshold for benchmarks

* Default device is now CPU
Attention are close. Similar to Global Attention test.
* Bugfix: gradients were wrong due to wrong tests

* Add more checks for sddmm

* Add checks for spmm

* Add checks for sparse_softmax

* Add more checks in python

* Lint
Add Initial Nystrom Attention Mechanism
Moved it to util.py but had forgotten to delete it from core.py in previous PR (facebookresearch#59)
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 24, 2022
@xwhan xwhan changed the title Adding a simple blockwise attention [feat]Adding a simple blockwise attention Jan 24, 2022
@xwhan xwhan marked this pull request as draft January 24, 2022 19:34
@dianaml0 dianaml0 self-requested a review January 24, 2022 20:19
@blefaudeux
Copy link
Contributor

blefaudeux commented Jan 25, 2022

hey @xwhan thanks for the PR ! FYI sparse attention (just pass a block mask) would work also (it's CPU compatible, although not very optimized as of now, cc @fmassa), curious to see a benchmark in between the two. There are pattern preset here if that helps https://github.com/facebookresearch/xformers/blob/main/xformers/components/attention/attention_patterns.py

@blefaudeux
Copy link
Contributor

on another front I've just added some more explanations on pre-commit here, in case that helps ? sorry about all these steps, we're trying to keep the repo coherent and easy to read

@xwhan
Copy link
Contributor Author

xwhan commented Jan 25, 2022

hey @xwhan thanks for the PR ! FYI sparse attention (just pass a block mask) would work also (it's CPU compatible, although not very optimized as of now, cc @fmassa), curious to see a benchmark in between the two. There are pattern preset here if that helps https://github.com/facebookresearch/xformers/blob/main/xformers/components/attention/attention_patterns.py

Thanks, I did not know those are already CPU compatible. I'll try it out with block masks and see if it's faster and stable in my experiments.

@xwhan
Copy link
Contributor Author

xwhan commented Jan 26, 2022

@blefaudeux I spent some time trying out passing a blockwise mask to scaled_dot_product_attention.

        repeat = seq_len // self.block_size
        block_mask = torch.ones(self.block_size, self.block_size)
        mask = torch.kron(torch.eye(repeat), block_mask).bool()

The fp16 support is the main pain point. It uses 1/4 more memory and is 2x slower than the PR with fp16. When compared them in fp32 settings, the sputnik one saves a little memory but is still much slower. IMO, this fp16 bottleneck of sputnik might block most NLP use cases, especially for large-scale pretraining.

@blefaudeux
Copy link
Contributor

@blefaudeux I spent some time trying out passing a blockwise mask to scaled_dot_product_attention.

        repeat = seq_len // self.block_size
        block_mask = torch.ones(self.block_size, self.block_size)
        mask = torch.kron(torch.eye(repeat), block_mask).bool()

The fp16 support is the main pain point. It uses 1/4 more memory and is 2x slower than the PR with fp16. When compared them in fp32 settings, the sputnik one saves a little memory but is still much slower. IMO, this fp16 bottleneck of sputnik might block most NLP use cases, especially for large-scale pretraining.

fair enough, thanks for checking it out ! fp16 support is a known issue for sparse (see #15), we don't have that many cycles or manpower to work on that unfortunately.

Blocksparse handles it though, but in turn it's not CPU compatible (give it a try, it should be pretty fast ! Some example here, this can also be a helper), so IMO your PR has value in this in-between space. Ideally we should have stronger primitives with sparse and blocksparse though, and @fmassa is working towards a single interface for all these

@fmassa
Copy link
Contributor

fmassa commented Feb 2, 2022

As @blefaudeux mentioned, I'm working on refactoring the sparse abstractions so that they are unified.
I can add a naive CPU implementation that will make it possible to run things on both CPU and well as GPU (through Triton). A WIP branch is in https://github.com/facebookresearch/xformers/tree/blocksparse_refactoring_v2

I will look into it today

@fmassa
Copy link
Contributor

fmassa commented Feb 6, 2022

Hi @xwhan

With #202 merged, we now have native support for blocksparse tensors with both CPU and CUDA support.

@fmassa
Copy link
Contributor

fmassa commented Feb 8, 2022

Hi @xwhan

Let me know if you would like some help getting the new BlockSparse Tensor integrated in your pipeline. I believe it might be preferable over implementing multiple (similar) variants using the same base code, as all changes needed would just be on defining the sparse matrix

@xwhan
Copy link
Contributor Author

xwhan commented Feb 8, 2022

Thanks @fmassa for the update. That would be a great feature. I am closing the PR for now

@xwhan xwhan closed this Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet