xFormers-ViT performance degradation in A100 GPU #14

bnabis93 · 2023-07-11T09:12:30Z

Performance degradation in A100 GPU

Vanilla Attention: 3.87ms
Sparse Attention: 9.33ms
Memory Efficient Attention: 6.34ms
Sparse Attention is 2.4x slower than Vanilla Attention
Memory Efficient Attention is 1.6x slower than Vanilla Attention

python reproduce.py

ViT Forward only
<torch.utils.benchmark.utils.common.Measurement object at 0x7fe7d59c9780>
profile
  Median: 3.87 ms
  IQR:    0.26 ms (3.76 to 4.02)
  514 measurements, 1 runs per measurement, 1 thread
Memory used: 28.87548828125 MB
Sparse ViT Forward only
<torch.utils.benchmark.utils.common.Measurement object at 0x7fe803ea64a0>
profile
  Median: 9.33 ms
  IQR:    0.21 ms (9.31 to 9.51)
  22 measurements, 10 runs per measurement, 1 thread
Memory used: 32.49267578125 MB
Mem efficient ViT Forward only
<torch.utils.benchmark.utils.common.Measurement object at 0x7fe803de7c10>
profile
  Median: 6.34 ms
  IQR:    0.08 ms (6.31 to 6.39)
  315 measurements, 1 runs per measurement, 1 thread
Memory used: 266.49072265625 MB
ViT average inference time : 3.186824321746826ms
ViT average inference time : 8.32324504852295ms
ViT average inference time : 5.165774822235107ms

The text was updated successfully, but these errors were encountered:

bnabis93 · 2023-07-11T09:12:51Z

#13

bnabis93 · 2023-07-12T05:58:03Z

Answer 01.

Memory efficient attention not gain speedups on A10 and V100 facebookresearch/xformers#762
Use fp16 or bf16 model.
The kernel provided by xFormers is not efficient on fp32.

Oh yes - good catch! We have kernels for f32 but they are not really efficient. You should use f16 or bf16 if possible to get the best speed. In fact, it's very likely that xFormers induces a slow-down when training in f32.

Answer 02.

FP16 model benchmark.
https://github.com/facebookresearch/xformers#benchmarks

bnabis93 · 2023-07-12T06:46:41Z

I convert model to fp16 using autocast

https://pytorch.org/docs/stable/amp.html

Experiment results

python reproduce.py

ViT Forward only
<torch.utils.benchmark.utils.common.Measurement object at 0x7fa99dd81db0>
profile
  Median: 5.06 ms
  IQR:    0.19 ms (4.94 to 5.13)
  395 measurements, 1 runs per measurement, 1 thread
Memory used: 32.6630859375 MB
Sparse ViT Forward only
<torch.utils.benchmark.utils.common.Measurement object at 0x7fa99ddc7b20>
profile
  Median: 9.42 ms
  IQR:    0.26 ms (9.38 to 9.64)
  21 measurements, 10 runs per measurement, 1 thread
Memory used: 33.6650390625 MB
Mem efficient ViT Forward only
<torch.utils.benchmark.utils.common.Measurement object at 0x7fab3620a860>
profile
  Median: 6.62 ms
  IQR:    0.28 ms (6.41 to 6.69)
  301 measurements, 1 runs per measurement, 1 thread
Memory used: 269.17626953125 MB
ViT average inference time : 4.58606481552124ms
ViT average inference time : 9.033639430999756ms
ViT average inference time : 5.58741569519043ms

bnabis93 · 2023-07-12T07:20:47Z

python -m xformers.info

xFormers 0.0.20
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.flshattF:               available
memory_efficient_attention.flshattB:               available
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        available
memory_efficient_attention.tritonflashattB:        available
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               True
is_functorch_available:                            False
pytorch.version:                                   2.0.1+cu117
pytorch.cuda:                                      available
gpu.compute_capability:                            8.0
gpu.name:                                          A100-SXM4-80GB
build.info:                                        available
build.cuda_version:                                1108
build.python_version:                              3.10.11
build.torch_version:                               2.0.1+cu118
build.env.TORCH_CUDA_ARCH_LIST:                    5.0+PTX 6.0 6.1 7.0 7.5 8.0 8.6
build.env.XFORMERS_BUILD_TYPE:                     Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   wheel-v0.0.20
build.nvcc_version:                                11.8.89
source.privacy:                                    open source

bnabis93 · 2023-07-12T09:27:26Z

#15

bnabis93 changed the title ~~xFormers performance degradation in A100 GPU~~ xFormers-ViT performance degradation in A100 GPU Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xFormers-ViT performance degradation in A100 GPU #14

xFormers-ViT performance degradation in A100 GPU #14

bnabis93 commented Jul 11, 2023

bnabis93 commented Jul 11, 2023

bnabis93 commented Jul 12, 2023 •

edited

Loading

bnabis93 commented Jul 12, 2023

bnabis93 commented Jul 12, 2023

bnabis93 commented Jul 12, 2023

xFormers-ViT performance degradation in A100 GPU #14

xFormers-ViT performance degradation in A100 GPU #14

Comments

bnabis93 commented Jul 11, 2023

Performance degradation in A100 GPU

bnabis93 commented Jul 11, 2023

bnabis93 commented Jul 12, 2023 • edited Loading

Answer 01.

Answer 02.

bnabis93 commented Jul 12, 2023

Experiment results

bnabis93 commented Jul 12, 2023

bnabis93 commented Jul 12, 2023

bnabis93 commented Jul 12, 2023 •

edited

Loading