Performance gap between Vulkan Backend and CUDA Backend on NVIDIA A100

### Name and Version

```shell
$ build-vk/bin/llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA PG509-210 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
version: 7052 (389ac78b2)
built with cc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-11) for x86_64-redhat-linux
```

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-bench

### Command line

```shell
#!/bin/bash

mkdir -p llm_logs

rm llm_logs/*.logx

for MODEL_GGUF in ~/.cache/llama.cpp/*.gguf; do
    MODEL_NAME=$(basename "$MODEL_GGUF" .gguf)
    echo "Benchmarking: $MODEL_GGUF"
    echo "Model name: $MODEL_NAME"

    for i in {1..3}; do
        echo "Run $i/3"

        ./build-cuda/bin/llama-bench -m $MODEL_GGUF 2>&1 | tee llm_logs/run__${MODEL_NAME}__cuda__${iter}__$(date +%Y%m%d_%H%M%S).logx

        ./build-vk/bin/llama-bench -m $MODEL_GGUF 2>&1 | tee llm_logs/run__${MODEL_NAME}__vk__${iter}__$(date +%Y%m%d_%H%M%S).logx
    done
done
```

### Problem description & steps to reproduce

I was comparing the performance of the Vulkan backend against the CUDA backend on a NVIDIA A100 across a variety of models and saw that the CUDA backend outperformed the Vulkan backend by about 20-30% on average across the board.

I'm submitting this issue to understand if this performance differential expected, and if there is anything obvious that I might be missing that is causing the Vulkan backend to perform worse than the CUDA backend.

Thanks in advance!

cc: @jeffbolznv and @0cc4m perhaps


| model                     | test  | cuda avg t/s | vk avg t/s | cuda/vk avg | cuda max t/s | vk max t/s | cuda/vk max | cuda min t/s | vk min t/s | cuda/vk min |
| ------------------------- | ----- | ------------ | ---------- | ----------- | ------------ | ---------- | ----------- | ------------ | ---------- | ----------- |
| gemma3 4B Q2_K - Medium   | pp512 | 5134.46      | 5427.64    | 0.95        | 5193.05      | 5581.7     | 0.93        | 5021.02      | 5245.92    | 0.96        |
| gemma3 4B Q2_K - Medium   | tg128 | 162.14       | 124.29     | 1.3         | 162.42       | 124.81     | 1.3         | 161.87       | 123.73     | 1.31        |
| gpt-oss 20B Q4_K - Medium | pp512 | 3325.09      | 2764.18    | 1.2         | 3332.9       | 2768.36    | 1.2         | 3318.8       | 2761.7     | 1.2         |
| gpt-oss 20B Q4_K - Medium | tg128 | 189.14       | 145.78     | 1.3         | 189.73       | 146.58     | 1.29        | 188.6        | 144.86     | 1.3         |
| llama 1B Q4_K - Medium    | pp512 | 17428.96     | 15508.11   | 1.12        | 18070.47     | 16289      | 1.11        | 16420.88     | 14255.95   | 1.15        |
| llama 1B Q4_K - Medium    | tg128 | 487.52       | 403.36     | 1.21        | 491.66       | 406.92     | 1.21        | 483.22       | 400.77     | 1.21        |
| llama 3B Q4_K - Medium    | pp512 | 8568.79      | 6218.25    | 1.38        | 8599.35      | 6438.53    | 1.34        | 8508.04      | 6018.65    | 1.41        |
| llama 3B Q4_K - Medium    | tg128 | 244.75       | 200.71     | 1.22        | 245.51       | 201.82     | 1.22        | 243.81       | 199.52     | 1.22        |
| llama 8B Q4_K - Medium    | pp512 | 4462.02      | 2972.28    | 1.5         | 4468.54      | 2991.26    | 1.49        | 4454.99      | 2941.28    | 1.51        |
| llama 8B Q4_K - Medium    | tg128 | 150.67       | 115.68     | 1.3         | 151.69       | 116.18     | 1.31        | 149.76       | 115.19     | 1.3         |
| qwen3 14B Q4_K - Medium   | pp512 | 2522.71      | 1809.1     | 1.39        | 2523.82      | 1811.49    | 1.39        | 2520.6       | 1806.89    | 1.39        |
| qwen3 14B Q4_K - Medium   | tg128 | 87.32        | 70.26      | 1.24        | 87.68        | 70.48      | 1.24        | 87.06        | 70.09      | 1.24        |


### First Bad Commit





### Relevant log output

```shell
Output of nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA PG509-210               On  |   00000000:04:00.0 Off |                    0 |
| N/A   34C    P0             50W /  330W |       7MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Using Vulkan SDK 1.4.321.1
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance gap between Vulkan Backend and CUDA Backend on NVIDIA A100 #17273

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	test	cuda avg t/s	vk avg t/s	cuda/vk avg	cuda max t/s	vk max t/s	cuda/vk max	cuda min t/s	vk min t/s	cuda/vk min
gemma3 4B Q2_K - Medium	pp512	5134.46	5427.64	0.95	5193.05	5581.7	0.93	5021.02	5245.92	0.96
gemma3 4B Q2_K - Medium	tg128	162.14	124.29	1.3	162.42	124.81	1.3	161.87	123.73	1.31
gpt-oss 20B Q4_K - Medium	pp512	3325.09	2764.18	1.2	3332.9	2768.36	1.2	3318.8	2761.7	1.2
gpt-oss 20B Q4_K - Medium	tg128	189.14	145.78	1.3	189.73	146.58	1.29	188.6	144.86	1.3
llama 1B Q4_K - Medium	pp512	17428.96	15508.11	1.12	18070.47	16289	1.11	16420.88	14255.95	1.15
llama 1B Q4_K - Medium	tg128	487.52	403.36	1.21	491.66	406.92	1.21	483.22	400.77	1.21
llama 3B Q4_K - Medium	pp512	8568.79	6218.25	1.38	8599.35	6438.53	1.34	8508.04	6018.65	1.41
llama 3B Q4_K - Medium	tg128	244.75	200.71	1.22	245.51	201.82	1.22	243.81	199.52	1.22
llama 8B Q4_K - Medium	pp512	4462.02	2972.28	1.5	4468.54	2991.26	1.49	4454.99	2941.28	1.51
llama 8B Q4_K - Medium	tg128	150.67	115.68	1.3	151.69	116.18	1.31	149.76	115.19	1.3
qwen3 14B Q4_K - Medium	pp512	2522.71	1809.1	1.39	2523.82	1811.49	1.39	2520.6	1806.89	1.39
qwen3 14B Q4_K - Medium	tg128	87.32	70.26	1.24	87.68	70.48	1.24	87.06	70.09	1.24

Performance gap between Vulkan Backend and CUDA Backend on NVIDIA A100 #17273

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions