Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xFormers-ViT performance degradation in A100 GPU #14

Open
bnabis93 opened this issue Jul 11, 2023 · 5 comments
Open

xFormers-ViT performance degradation in A100 GPU #14

bnabis93 opened this issue Jul 11, 2023 · 5 comments

Comments

@bnabis93
Copy link
Owner

Performance degradation in A100 GPU

  • Vanilla Attention: 3.87ms
  • Sparse Attention: 9.33ms
  • Memory Efficient Attention: 6.34ms
  • Sparse Attention is 2.4x slower than Vanilla Attention
  • Memory Efficient Attention is 1.6x slower than Vanilla Attention
python reproduce.py

ViT Forward only
<torch.utils.benchmark.utils.common.Measurement object at 0x7fe7d59c9780>
profile
  Median: 3.87 ms
  IQR:    0.26 ms (3.76 to 4.02)
  514 measurements, 1 runs per measurement, 1 thread
Memory used: 28.87548828125 MB
Sparse ViT Forward only
<torch.utils.benchmark.utils.common.Measurement object at 0x7fe803ea64a0>
profile
  Median: 9.33 ms
  IQR:    0.21 ms (9.31 to 9.51)
  22 measurements, 10 runs per measurement, 1 thread
Memory used: 32.49267578125 MB
Mem efficient ViT Forward only
<torch.utils.benchmark.utils.common.Measurement object at 0x7fe803de7c10>
profile
  Median: 6.34 ms
  IQR:    0.08 ms (6.31 to 6.39)
  315 measurements, 1 runs per measurement, 1 thread
Memory used: 266.49072265625 MB
ViT average inference time : 3.186824321746826ms
ViT average inference time : 8.32324504852295ms
ViT average inference time : 5.165774822235107ms
@bnabis93 bnabis93 changed the title xFormers performance degradation in A100 GPU xFormers-ViT performance degradation in A100 GPU Jul 11, 2023
@bnabis93
Copy link
Owner Author

#13

@bnabis93
Copy link
Owner Author

bnabis93 commented Jul 12, 2023

Answer 01.

Oh yes - good catch! We have kernels for f32 but they are not really efficient. You should use f16 or bf16 if possible to get the best speed. In fact, it's very likely that xFormers induces a slow-down when training in f32.

Answer 02.

@bnabis93
Copy link
Owner Author

I convert model to fp16 using autocast

Experiment results

python reproduce.py

ViT Forward only
<torch.utils.benchmark.utils.common.Measurement object at 0x7fa99dd81db0>
profile
  Median: 5.06 ms
  IQR:    0.19 ms (4.94 to 5.13)
  395 measurements, 1 runs per measurement, 1 thread
Memory used: 32.6630859375 MB
Sparse ViT Forward only
<torch.utils.benchmark.utils.common.Measurement object at 0x7fa99ddc7b20>
profile
  Median: 9.42 ms
  IQR:    0.26 ms (9.38 to 9.64)
  21 measurements, 10 runs per measurement, 1 thread
Memory used: 33.6650390625 MB
Mem efficient ViT Forward only
<torch.utils.benchmark.utils.common.Measurement object at 0x7fab3620a860>
profile
  Median: 6.62 ms
  IQR:    0.28 ms (6.41 to 6.69)
  301 measurements, 1 runs per measurement, 1 thread
Memory used: 269.17626953125 MB
ViT average inference time : 4.58606481552124ms
ViT average inference time : 9.033639430999756ms
ViT average inference time : 5.58741569519043ms

@bnabis93
Copy link
Owner Author

python -m xformers.info

xFormers 0.0.20
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.flshattF:               available
memory_efficient_attention.flshattB:               available
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        available
memory_efficient_attention.tritonflashattB:        available
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               True
is_functorch_available:                            False
pytorch.version:                                   2.0.1+cu117
pytorch.cuda:                                      available
gpu.compute_capability:                            8.0
gpu.name:                                          A100-SXM4-80GB
build.info:                                        available
build.cuda_version:                                1108
build.python_version:                              3.10.11
build.torch_version:                               2.0.1+cu118
build.env.TORCH_CUDA_ARCH_LIST:                    5.0+PTX 6.0 6.1 7.0 7.5 8.0 8.6
build.env.XFORMERS_BUILD_TYPE:                     Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   wheel-v0.0.20
build.nvcc_version:                                11.8.89
source.privacy:                                    open source

@bnabis93
Copy link
Owner Author

#15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant