[feat]Adding a simple blockwise attention #192

xwhan · 2022-01-24T19:29:08Z

What does this PR do?

This attention is used in fairseq to build a long-doc summarization model.

More efficient implementations should be possible with Triton or blocksparse. However, it needs to be compatible with CPU inference for the sake of deployment and demo.

Before submitting

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
@dianaml0

They are not used anymore as we have our own custom implementations

…acebookresearch#68) * Add broadcasting support for matmul_with_mask with sparse matrices * Fix lint

…ebookresearch#69) * Add functions for generating basic sparsity patterns They can be combined together to yield complex patterns * Use attention_patterns in modules * Fix lint

* Add current plots for reference * Remove unnecessary code * Had to do extra casting due to limitations of coalesce on bool tensors * linting

* Add naive CPU implementations for CSR format * Remove CUDA-only headers * Remove trailing cuda device in test

* Enable sputnik kernels in xformers blocks * Revert to previous value for density threshold So that plots can be comparable, as the global attention depends on density_threshold for benchmarks * Default device is now CPU

…ion - fixes flake8 warning

Attention are close. Similar to Global Attention test.

* Bugfix: gradients were wrong due to wrong tests * Add more checks for sddmm * Add checks for spmm * Add checks for sparse_softmax * Add more checks in python * Lint

Add Initial Nystrom Attention Mechanism

Moved it to util.py but had forgotten to delete it from core.py in previous PR (facebookresearch#59)

…research#77)

This reverts commit c83036e, reversing changes made to d747701.

blefaudeux · 2022-01-25T04:27:48Z

hey @xwhan thanks for the PR ! FYI sparse attention (just pass a block mask) would work also (it's CPU compatible, although not very optimized as of now, cc @fmassa), curious to see a benchmark in between the two. There are pattern preset here if that helps https://github.com/facebookresearch/xformers/blob/main/xformers/components/attention/attention_patterns.py

blefaudeux · 2022-01-25T04:33:03Z

on another front I've just added some more explanations on pre-commit here, in case that helps ? sorry about all these steps, we're trying to keep the repo coherent and easy to read

xwhan · 2022-01-25T05:04:18Z

hey @xwhan thanks for the PR ! FYI sparse attention (just pass a block mask) would work also (it's CPU compatible, although not very optimized as of now, cc @fmassa), curious to see a benchmark in between the two. There are pattern preset here if that helps https://github.com/facebookresearch/xformers/blob/main/xformers/components/attention/attention_patterns.py

Thanks, I did not know those are already CPU compatible. I'll try it out with block masks and see if it's faster and stable in my experiments.

xwhan · 2022-01-26T21:57:24Z

@blefaudeux I spent some time trying out passing a blockwise mask to scaled_dot_product_attention.

        repeat = seq_len // self.block_size
        block_mask = torch.ones(self.block_size, self.block_size)
        mask = torch.kron(torch.eye(repeat), block_mask).bool()

The fp16 support is the main pain point. It uses 1/4 more memory and is 2x slower than the PR with fp16. When compared them in fp32 settings, the sputnik one saves a little memory but is still much slower. IMO, this fp16 bottleneck of sputnik might block most NLP use cases, especially for large-scale pretraining.

blefaudeux · 2022-01-26T22:15:35Z

@blefaudeux I spent some time trying out passing a blockwise mask to scaled_dot_product_attention.
        repeat = seq_len // self.block_size
        block_mask = torch.ones(self.block_size, self.block_size)
        mask = torch.kron(torch.eye(repeat), block_mask).bool()
The fp16 support is the main pain point. It uses 1/4 more memory and is 2x slower than the PR with fp16. When compared them in fp32 settings, the sputnik one saves a little memory but is still much slower. IMO, this fp16 bottleneck of sputnik might block most NLP use cases, especially for large-scale pretraining.

fair enough, thanks for checking it out ! fp16 support is a known issue for sparse (see #15), we don't have that many cycles or manpower to work on that unfortunately.

Blocksparse handles it though, but in turn it's not CPU compatible (give it a try, it should be pretty fast ! Some example here, this can also be a helper), so IMO your PR has value in this in-between space. Ideally we should have stronger primitives with sparse and blocksparse though, and @fmassa is working towards a single interface for all these

fmassa · 2022-02-02T13:49:54Z

As @blefaudeux mentioned, I'm working on refactoring the sparse abstractions so that they are unified.
I can add a naive CPU implementation that will make it possible to run things on both CPU and well as GPU (through Triton). A WIP branch is in https://github.com/facebookresearch/xformers/tree/blocksparse_refactoring_v2

I will look into it today

fmassa · 2022-02-06T16:34:39Z

Hi @xwhan

With #202 merged, we now have native support for blocksparse tensors with both CPU and CUDA support.

fmassa · 2022-02-08T10:57:47Z

Hi @xwhan

Let me know if you would like some help getting the new BlockSparse Tensor integrated in your pipeline. I believe it might be preferable over implementing multiple (similar) variants using the same base code, as all changes needed would just be on defining the sparse matrix

xwhan · 2022-02-08T19:13:36Z

Thanks @fmassa for the update. That would be a great feature. I am closing the PR for now

fmassa and others added 30 commits April 28, 2021 10:31

Remove unnecessary files from compilation (facebookresearch#66)

3c9bc88

They are not used anymore as we have our own custom implementations

Add broadcasting support for matmul_with_mask with sparse matrices (f…

4acdd03

…acebookresearch#68) * Add broadcasting support for matmul_with_mask with sparse matrices * Fix lint

Refactor attention patterns to live outside of attention modules (fac…

6f0113a

…ebookresearch#69) * Add functions for generating basic sparsity patterns They can be combined together to yield complex patterns * Use attention_patterns in modules * Fix lint

Cleanup some batching leftovers (facebookresearch#70)

ba0a6b3

* Add current plots for reference * Remove unnecessary code * Had to do extra casting due to limitations of coalesce on bool tensors * linting

Add naive CPU implementations for CSR format (facebookresearch#73)

e31330b

* Add naive CPU implementations for CSR format * Remove CUDA-only headers * Remove trailing cuda device in test

Enable sputnik in xformers blocks (facebookresearch#72)

f6c4474

* Enable sputnik kernels in xformers blocks * Revert to previous value for density threshold So that plots can be comparable, as the global attention depends on density_threshold for benchmarks * Default device is now CPU

using set comprehension instead of calling set() on a list comprehens…

36fd3d5

…ion - fixes flake8 warning

Update configs with fields for nystrom attention

6d5957b

add nystrom config fields

f46b0e3

expose method in core.py to avoid redundancy in nystrom

234e823

implement nystrom attention mechanism

d57510b

add Nystrom Attention

a08bccc

Added test to check that results of Nystrom and Scaled Dot Product

9647beb

Attention are close. Similar to Global Attention test.

try different nystrom config params in test

a522273

fix for typing error

baf0d6c

fix nystrom test

53625fd

updates from review feedback

604ca02

move inverse calculation to core and benchmark

a8579c0

format

f94561a

rename to pinv and apply softmax before pinv

87bf928

Add option for user to pass in module for skip connection.

5a26959

add option to use torch pinv instead of razavi method

8aae1e4

move pytest parameterization

e2913e4

move pinv from core to utils

8b3a399

measure accuracy of pinverse

81660a1

Bugfix and add more checks to sputnik kernels (facebookresearch#74)

c756d53

* Bugfix: gradients were wrong due to wrong tests * Add more checks for sddmm * Add checks for spmm * Add checks for sparse_softmax * Add more checks in python * Lint

Merge pull request facebookresearch#59 from fairinternal/nystrom

399a6d1

Add Initial Nystrom Attention Mechanism

[cleanup] Remove iterative pinv from core.py (facebookresearch#76)

b196040

Moved it to util.py but had forgotten to delete it from core.py in previous PR (facebookresearch#59)

[minor] longer sequence, closer to what people actually use (facebook…

2992d4c

…research#77)

[minor] ease of use for benchmark and profiling (facebookresearch#78)

c3ddb0a

xwhan added 9 commits January 12, 2022 23:43

some models for nlp analysis

61554f6

prepare to rebuase

9d23b50

rebase

c83036e

merging

7d14b60

push blockwise

c7e8f1a

simple block attention for long-doc summarization

63701b5

fix Revert "rebase"

1e99d2f

This reverts commit c83036e, reversing changes made to d747701.

minor doc updates

b6d9a79

fix lint

27d6133

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 24, 2022

xwhan changed the title ~~Adding a simple blockwise attention~~ [feat]Adding a simple blockwise attention Jan 24, 2022

xwhan marked this pull request as draft January 24, 2022 19:34

dianaml0 self-requested a review January 24, 2022 20:19

clean block attention

380b575

xwhan added 4 commits January 26, 2022 22:48

rebase

cf3b436

minor fix on merge_masks

b611d75

rebase from public repo

6b27663

rebase from public main

82d469f

xwhan added 2 commits February 8, 2022 00:23

more efficient local-block attention

4d9e303

Merge branch 'language' of github.com:xwhan/xformers into language

e32e701

xwhan closed this Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat]Adding a simple blockwise attention #192

[feat]Adding a simple blockwise attention #192

xwhan commented Jan 24, 2022 •

edited

blefaudeux commented Jan 25, 2022 •

edited

blefaudeux commented Jan 25, 2022

xwhan commented Jan 25, 2022

xwhan commented Jan 26, 2022

blefaudeux commented Jan 26, 2022

fmassa commented Feb 2, 2022

fmassa commented Feb 6, 2022

fmassa commented Feb 8, 2022

xwhan commented Feb 8, 2022

[feat]Adding a simple blockwise attention #192

[feat]Adding a simple blockwise attention #192

Conversation

xwhan commented Jan 24, 2022 • edited

What does this PR do?

Before submitting

PR review

blefaudeux commented Jan 25, 2022 • edited

blefaudeux commented Jan 25, 2022

xwhan commented Jan 25, 2022

xwhan commented Jan 26, 2022

blefaudeux commented Jan 26, 2022

fmassa commented Feb 2, 2022

fmassa commented Feb 6, 2022

fmassa commented Feb 8, 2022

xwhan commented Feb 8, 2022

xwhan commented Jan 24, 2022 •

edited

blefaudeux commented Jan 25, 2022 •

edited