Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flash attention unavailable after 0.0.21 on Windows system #863

Open
KohakuBlueleaf opened this issue Sep 22, 2023 · 6 comments
Open

Flash attention unavailable after 0.0.21 on Windows system #863

KohakuBlueleaf opened this issue Sep 22, 2023 · 6 comments

Comments

@KohakuBlueleaf
Copy link

馃悰 Bug

Command

python -m xformers.info

To Reproduce

Steps to reproduce the behavior:

install xformers 0.0.21 or build from source on latest commit on windows, memory_efficient_attention.flshattF/B are all unavailable.
(Also, the build.env.TORCH_CUDA_ARCH_LIST in pre-built wheel doesn't have 8.6 and 8.9)

Expected behavior

both pre-built wheel and build from source should give us flash attention support.
(If this situation is bcuz windows doesn't support some feature which is needed in flashattn2, plz at least give us flash attn1 support on windows)

I also wondered if this is some bug in xformers.info, but since xformers 0.0.21 actually give me slower result than 0.0.20, I think flash attn just gone.

Environment

  • Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)] (64-bit runtime)
  • Python platform: Windows-10-10.0.22621-SP0
  • Is CUDA available: True
  • CUDA runtime version: 12.1.105
  • CUDA_MODULE_LOADING set to: LAZY
  • GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4060 Ti
  • Nvidia driver version: 537.34
  • cuDNN version: Could not collect
  • HIP runtime version: N/A
  • MIOpen runtime version: N/A
  • Is XNNPACK available: True

Additional context

here is the output of xformers.info on 0.0.21:

xFormers 0.0.21
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.decoderF:               available
memory_efficient_attention.flshattF@0.0.0:         unavailable
memory_efficient_attention.flshattB@0.0.0:         unavailable
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        unavailable
memory_efficient_attention.tritonflashattB:        unavailable
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               False
is_functorch_available:                            False
pytorch.version:                                   2.0.1+cu118
pytorch.cuda:                                      available
gpu.compute_capability:                            8.9
gpu.name:                                          NVIDIA GeForce RTX 4060 Ti
build.info:                                        available
build.cuda_version:                                1108
build.python_version:                              3.11.4
build.torch_version:                               2.0.1+cu118
build.env.TORCH_CUDA_ARCH_LIST:                    5.0+PTX 6.0 6.1 7.0 7.5 8.0+PTX 9.0
build.env.XFORMERS_BUILD_TYPE:                     Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   wheel-v0.0.21
build.nvcc_version:                                11.8.89
source.privacy:                                    open source

Here is the output of 0.0.20:

xFormers 0.0.20
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.flshattF:               available
memory_efficient_attention.flshattB:               available
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        unavailable
memory_efficient_attention.tritonflashattB:        unavailable
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               False
is_functorch_available:                            False
pytorch.version:                                   2.0.1+cu118
pytorch.cuda:                                      available
gpu.compute_capability:                            8.9
gpu.name:                                          NVIDIA GeForce RTX 4060 Ti
build.info:                                        available
build.cuda_version:                                1108
build.python_version:                              3.11.3
build.torch_version:                               2.0.1+cu118
build.env.TORCH_CUDA_ARCH_LIST:                    5.0+PTX 6.0 6.1 7.0 7.5 8.0 8.6
build.env.XFORMERS_BUILD_TYPE:                     Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   wheel-v0.0.20
build.nvcc_version:                                11.8.89
source.privacy:                                    open source
@rltgjqmcpgjadyd
Copy link

af6b866 commit has same problem

xFormers 0.0.22+af6b866.d20230926
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.decoderF:               available
memory_efficient_attention.flshattF@0.0.0:         unavailable
memory_efficient_attention.flshattB@0.0.0:         unavailable
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        unavailable
memory_efficient_attention.tritonflashattB:        unavailable
memory_efficient_attention.triton_splitKF:         unavailable
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               False
pytorch.version:                                   2.1.0.dev20230821+cu121
pytorch.cuda:                                      available
gpu.compute_capability:                            8.9
gpu.name:                                          NVIDIA GeForce RTX 4090
build.info:                                        available
build.cuda_version:                                1201
build.python_version:                              3.11.5
build.torch_version:                               2.1.0.dev20230821+cu121
build.env.TORCH_CUDA_ARCH_LIST:                    8.9
build.env.XFORMERS_BUILD_TYPE:                     None
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              "-allow-unsupported-compiler"
build.env.XFORMERS_PACKAGE_FROM:                   None
build.nvcc_version:                                12.1.66
source.privacy:                                    open source

@danthe3rd
Copy link
Contributor

Hi,
Flash-Attention does not support windows at the moment, so we don't build it on windows (see for instance Dao-AILab/flash-attention#565). We still can run our own implementation which should be a bit faster than Flash v1 (but slower than Flash v2).
Once Flash-Attention v2 has support for windows, we will add it back.

@rbertus2000
Copy link

Hi, Flash-Attention does not support windows at the moment, so we don't build it on windows (see for instance Dao-AILab/flash-attention#565). We still can run our own implementation which should be a bit faster than Flash v1 (but slower than Flash v2). Once Flash-Attention v2 has support for windows, we will add it back.

It seems like flash-attention 2.3.2 supports windows now. Dao-AILab/flash-attention#595 (comment)

@KohakuBlueleaf
Copy link
Author

Hi, Flash-Attention does not support windows at the moment, so we don't build it on windows (see for instance Dao-AILab/flash-attention#565). We still can run our own implementation which should be a bit faster than Flash v1 (but slower than Flash v2). Once Flash-Attention v2 has support for windows, we will add it back.

It seems like flash-attention 2.3.2 supports windows now. Dao-AILab/flash-attention#595 (comment)

I will try to build flash attn with torch2.1.0 and cuda12.1 to see if it worked

@Panchovix
Copy link

Hi, Flash-Attention does not support windows at the moment, so we don't build it on windows (see for instance Dao-AILab/flash-attention#565). We still can run our own implementation which should be a bit faster than Flash v1 (but slower than Flash v2). Once Flash-Attention v2 has support for windows, we will add it back.

It seems like flash-attention 2.3.2 supports windows now. Dao-AILab/flash-attention#595 (comment)

I will try to build flash attn with torch2.1.0 and cuda12.1 to see if it worked

Does xformers automatically uses if FA2 is installed in the venv, or you have to build it with FA2 installed instead?

@KohakuBlueleaf
Copy link
Author

@danthe3rd Flash attention is able to be compiled/installed on windows after 2.3.2
Will xformers update for it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants