-
Notifications
You must be signed in to change notification settings - Fork 13.8k
Open
Labels
Description
Name and Version
$ build-vk/bin/llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA PG509-210 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
version: 7052 (389ac78b2)
built with cc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-11) for x86_64-redhat-linuxOperating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-bench
Command line
#!/bin/bash
mkdir -p llm_logs
rm llm_logs/*.logx
for MODEL_GGUF in ~/.cache/llama.cpp/*.gguf; do
MODEL_NAME=$(basename "$MODEL_GGUF" .gguf)
echo "Benchmarking: $MODEL_GGUF"
echo "Model name: $MODEL_NAME"
for i in {1..3}; do
echo "Run $i/3"
./build-cuda/bin/llama-bench -m $MODEL_GGUF 2>&1 | tee llm_logs/run__${MODEL_NAME}__cuda__${iter}__$(date +%Y%m%d_%H%M%S).logx
./build-vk/bin/llama-bench -m $MODEL_GGUF 2>&1 | tee llm_logs/run__${MODEL_NAME}__vk__${iter}__$(date +%Y%m%d_%H%M%S).logx
done
doneProblem description & steps to reproduce
I was comparing the performance of the Vulkan backend against the CUDA backend on a NVIDIA A100 across a variety of models and saw that the CUDA backend outperformed the Vulkan backend by about 20-30% on average across the board.
I'm submitting this issue to understand if this performance differential expected, and if there is anything obvious that I might be missing that is causing the Vulkan backend to perform worse than the CUDA backend.
Thanks in advance!
cc: @jeffbolznv and @0cc4m perhaps
| model | test | cuda avg t/s | vk avg t/s | cuda/vk avg | cuda max t/s | vk max t/s | cuda/vk max | cuda min t/s | vk min t/s | cuda/vk min |
|---|---|---|---|---|---|---|---|---|---|---|
| gemma3 4B Q2_K - Medium | pp512 | 5134.46 | 5427.64 | 0.95 | 5193.05 | 5581.7 | 0.93 | 5021.02 | 5245.92 | 0.96 |
| gemma3 4B Q2_K - Medium | tg128 | 162.14 | 124.29 | 1.3 | 162.42 | 124.81 | 1.3 | 161.87 | 123.73 | 1.31 |
| gpt-oss 20B Q4_K - Medium | pp512 | 3325.09 | 2764.18 | 1.2 | 3332.9 | 2768.36 | 1.2 | 3318.8 | 2761.7 | 1.2 |
| gpt-oss 20B Q4_K - Medium | tg128 | 189.14 | 145.78 | 1.3 | 189.73 | 146.58 | 1.29 | 188.6 | 144.86 | 1.3 |
| llama 1B Q4_K - Medium | pp512 | 17428.96 | 15508.11 | 1.12 | 18070.47 | 16289 | 1.11 | 16420.88 | 14255.95 | 1.15 |
| llama 1B Q4_K - Medium | tg128 | 487.52 | 403.36 | 1.21 | 491.66 | 406.92 | 1.21 | 483.22 | 400.77 | 1.21 |
| llama 3B Q4_K - Medium | pp512 | 8568.79 | 6218.25 | 1.38 | 8599.35 | 6438.53 | 1.34 | 8508.04 | 6018.65 | 1.41 |
| llama 3B Q4_K - Medium | tg128 | 244.75 | 200.71 | 1.22 | 245.51 | 201.82 | 1.22 | 243.81 | 199.52 | 1.22 |
| llama 8B Q4_K - Medium | pp512 | 4462.02 | 2972.28 | 1.5 | 4468.54 | 2991.26 | 1.49 | 4454.99 | 2941.28 | 1.51 |
| llama 8B Q4_K - Medium | tg128 | 150.67 | 115.68 | 1.3 | 151.69 | 116.18 | 1.31 | 149.76 | 115.19 | 1.3 |
| qwen3 14B Q4_K - Medium | pp512 | 2522.71 | 1809.1 | 1.39 | 2523.82 | 1811.49 | 1.39 | 2520.6 | 1806.89 | 1.39 |
| qwen3 14B Q4_K - Medium | tg128 | 87.32 | 70.26 | 1.24 | 87.68 | 70.48 | 1.24 | 87.06 | 70.09 | 1.24 |
First Bad Commit
Relevant log output
Output of nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA PG509-210 On | 00000000:04:00.0 Off | 0 |
| N/A 34C P0 50W / 330W | 7MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Using Vulkan SDK 1.4.321.1