blocksparse gives RuntimeError: CUDA: Error- illegal address when increase the block size #206

xwhan · 2022-02-05T06:32:18Z

🐛 Bug

I'm not sure whether this is a bug or simply a restriction of triton. But when I follow you doc here https://github.com/facebookresearch/xformers/blob/main/HOWTO.md#blocksparseattention

The sample works fine but it does not work when I increase the block size.

Command

To Reproduce

Steps to reproduce the behavior:

Simply replacing your hyperparameter like this
BATCH = 1
HEADS = 16
SEQ = 8192
EMB = 64 * HEADS
BLOCK_SIZE = 512
DROPOUT = 0.1
should reproduce the error "RuntimeError: CUDA: Error- illegal address"

PyTorch Version (e.g., 1.0): 1.10.2
OS (e.g., Linux): Ubuntu 18.04
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.8
CUDA/cuDNN version: 11.6
GPU models and configuration: A100
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

blefaudeux · 2022-02-05T06:48:13Z

Ah 512 is not an option, it's 16/32/64 top of mind, I need to guard that! Thanks a lot for the report @xwhan , will fix asap

blefaudeux · 2022-02-05T06:55:10Z

Ah 512 is not an option, it's 16/32/64 top of mind, I need to guard that! Thanks a lot for the report @xwhan , will fix asap

I was wrong with 64, needs to be a multiple of 16 (because of tensor cores) but no obvious upper bound, except that you don't really win above a given size (you can repro a 512 block with multiple 64 blocks of course). I'll add an assert to make sure that users stay within reasonable bounds

xwhan · 2022-02-07T07:05:56Z

Thanks @blefaudeux, not sure if I understand correctly about "repro a 512 block with 64 blocks" --- the softmax will still be calculated within each 64-size block size

blefaudeux · 2022-02-07T15:11:34Z

Thanks @blefaudeux, not sure if I understand correctly about "repro a 512 block with 64 blocks" --- the softmax will still be calculated within each 64-size block size

Ah no, typically blocksparse just means that the sparsity pattern is blocky, but the softmax is computed over all the coefficients which are computed, not per tile, unless I misunderstood your question ?

If you want the normalization to be on a neighborhood only, then that's different, typically you could get that by summing blocksparse results with non overlapping patterns

blefaudeux self-assigned this Feb 5, 2022

blefaudeux mentioned this issue Feb 5, 2022

[fix] blocksparse sanity checks #207

Merged

10 tasks

blefaudeux closed this as completed in #207 Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blocksparse gives RuntimeError: CUDA: Error- illegal address when increase the block size #206

blocksparse gives RuntimeError: CUDA: Error- illegal address when increase the block size #206

xwhan commented Feb 5, 2022

blefaudeux commented Feb 5, 2022

blefaudeux commented Feb 5, 2022

xwhan commented Feb 7, 2022

blefaudeux commented Feb 7, 2022

blocksparse gives RuntimeError: CUDA: Error- illegal address when increase the block size #206

blocksparse gives RuntimeError: CUDA: Error- illegal address when increase the block size #206

Comments

xwhan commented Feb 5, 2022

🐛 Bug

Command

To Reproduce

Additional context

blefaudeux commented Feb 5, 2022

blefaudeux commented Feb 5, 2022

xwhan commented Feb 7, 2022

blefaudeux commented Feb 7, 2022