Skip to content

Performance gap between Vulkan Backend and CUDA Backend on NVIDIA A100 #17273

@SS-JIA

Description

@SS-JIA

Name and Version

$ build-vk/bin/llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA PG509-210 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
version: 7052 (389ac78b2)
built with cc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-11) for x86_64-redhat-linux

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-bench

Command line

#!/bin/bash

mkdir -p llm_logs

rm llm_logs/*.logx

for MODEL_GGUF in ~/.cache/llama.cpp/*.gguf; do
    MODEL_NAME=$(basename "$MODEL_GGUF" .gguf)
    echo "Benchmarking: $MODEL_GGUF"
    echo "Model name: $MODEL_NAME"

    for i in {1..3}; do
        echo "Run $i/3"

        ./build-cuda/bin/llama-bench -m $MODEL_GGUF 2>&1 | tee llm_logs/run__${MODEL_NAME}__cuda__${iter}__$(date +%Y%m%d_%H%M%S).logx

        ./build-vk/bin/llama-bench -m $MODEL_GGUF 2>&1 | tee llm_logs/run__${MODEL_NAME}__vk__${iter}__$(date +%Y%m%d_%H%M%S).logx
    done
done

Problem description & steps to reproduce

I was comparing the performance of the Vulkan backend against the CUDA backend on a NVIDIA A100 across a variety of models and saw that the CUDA backend outperformed the Vulkan backend by about 20-30% on average across the board.

I'm submitting this issue to understand if this performance differential expected, and if there is anything obvious that I might be missing that is causing the Vulkan backend to perform worse than the CUDA backend.

Thanks in advance!

cc: @jeffbolznv and @0cc4m perhaps

model test cuda avg t/s vk avg t/s cuda/vk avg cuda max t/s vk max t/s cuda/vk max cuda min t/s vk min t/s cuda/vk min
gemma3 4B Q2_K - Medium pp512 5134.46 5427.64 0.95 5193.05 5581.7 0.93 5021.02 5245.92 0.96
gemma3 4B Q2_K - Medium tg128 162.14 124.29 1.3 162.42 124.81 1.3 161.87 123.73 1.31
gpt-oss 20B Q4_K - Medium pp512 3325.09 2764.18 1.2 3332.9 2768.36 1.2 3318.8 2761.7 1.2
gpt-oss 20B Q4_K - Medium tg128 189.14 145.78 1.3 189.73 146.58 1.29 188.6 144.86 1.3
llama 1B Q4_K - Medium pp512 17428.96 15508.11 1.12 18070.47 16289 1.11 16420.88 14255.95 1.15
llama 1B Q4_K - Medium tg128 487.52 403.36 1.21 491.66 406.92 1.21 483.22 400.77 1.21
llama 3B Q4_K - Medium pp512 8568.79 6218.25 1.38 8599.35 6438.53 1.34 8508.04 6018.65 1.41
llama 3B Q4_K - Medium tg128 244.75 200.71 1.22 245.51 201.82 1.22 243.81 199.52 1.22
llama 8B Q4_K - Medium pp512 4462.02 2972.28 1.5 4468.54 2991.26 1.49 4454.99 2941.28 1.51
llama 8B Q4_K - Medium tg128 150.67 115.68 1.3 151.69 116.18 1.31 149.76 115.19 1.3
qwen3 14B Q4_K - Medium pp512 2522.71 1809.1 1.39 2523.82 1811.49 1.39 2520.6 1806.89 1.39
qwen3 14B Q4_K - Medium tg128 87.32 70.26 1.24 87.68 70.48 1.24 87.06 70.09 1.24

First Bad Commit

Relevant log output

Output of nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA PG509-210               On  |   00000000:04:00.0 Off |                    0 |
| N/A   34C    P0             50W /  330W |       7MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Using Vulkan SDK 1.4.321.1

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions