vulkan: remove the need for the dryrun #16826

jeffbolznv · 2025-10-28T19:21:01Z

Allocate pipelines and descriptor sets when requested.

Reallocate the prealloc buffers when needed, and flush any pending work before reallocating.

For rms_partials and total_mul_mat_bytes, use the sizes computed the last time the graph was executed.

The dryrun is a small but consistent overhead where the GPU is idle. I get an average of maybe 1-2% improvement with it removed, though my numbers have been noisy lately.

I didn't totally rip out all the logic yet, I wanted to keep the diffs smaller to make it more clear how the new code works.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        237.16 ± 3.72 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        196.74 ± 6.25 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        128.58 ± 3.47 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       828.66 ± 20.21 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        826.18 ± 8.65 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       380.99 ± 23.22 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       263.97 ± 13.83 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       240.75 ± 14.32 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       292.99 ± 37.92 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       275.93 ± 20.96 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |       357.16 ± 16.19 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |       264.03 ± 10.48 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       312.49 ± 20.09 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         92.71 ± 0.30 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         50.06 ± 0.16 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        239.51 ± 7.95 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        197.76 ± 8.86 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        128.50 ± 4.10 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       824.48 ± 11.02 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |      635.92 ± 253.03 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       384.39 ± 23.94 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       260.74 ± 22.91 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       240.84 ± 14.85 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       301.72 ± 31.62 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       282.66 ± 21.94 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |       368.81 ± 12.07 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        270.71 ± 3.88 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        319.78 ± 3.61 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         92.07 ± 1.51 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         49.97 ± 0.16 |

0cc4m · 2025-10-29T09:13:24Z

I do have the optimal setup to test this, with a slow server CPU + 3090:

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	tg128	117.20 ± 0.31	119.50 ± 0.24	+2.0%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	tg128	122.78 ± 8.91	129.57 ± 0.41	+5.5%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	142.21 ± 13.40	151.45 ± 0.45	+6.5%
llama 8B Q4_1	4.77 GiB	8.03 B	99	1	tg128	139.90 ± 0.39	142.20 ± 0.24	+1.6%
llama 8B Q8_0	7.95 GiB	8.03 B	99	1	tg128	93.47 ± 0.31	94.05 ± 2.76	+0.6%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	156.66 ± 13.84	159.81 ± 17.27	+2.0%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	tg128	156.70 ± 14.05	154.48 ± 22.33	-1.4%

The test good, but it's gonna take me a little while to go through the code.

0cc4m · 2025-11-01T06:09:18Z

Can you fix the conflict?

jeffbolznv · 2025-11-01T14:50:31Z

Rebased.

Allocate pipelines and descriptor sets when requested. Reallocate the prealloc buffers when needed, and flush any pending work before reallocating. For rms_partials and total_mul_mat_bytes, use the sizes computed the last time the graph was executed.

0cc4m

Okay, I get what this is doing and results are positive, in my tests. What's your plan now? Clean up the unused code before merge?

jeffbolznv · 2025-11-01T23:10:59Z

Sure, I can do that tomorrow.

Acly · 2025-11-05T14:38:03Z

test-thread-safety crashes quite often now for me when I run it with Vulkan and 1 GPU. I'm using the command line that also runs with ctest. It crashes at various points and not always, but usually because it tries to access some vk_pipeline that has not been initialized or compiled.

The test spawns multiple threads, each creates its own llama_context and independently runs some model. On Vulkan side this creates 1 ggml_backend_vk_context per thread, but they can share the same vk_device, which holds lots of mutable state (like pipelines and queues).

I'm only observing crashes after this PR, but from looking at the code it seems like this should always have been somewhat broken. Almost feels like I must misunderstand something..?

One could argue that it doesn't make much sense to concurrently run multiple llama_context on the same device, it's probably far better to use some form of batching to improve throughput. But then the test shouldn't test that.

jeffbolznv · 2025-11-05T14:44:11Z

Thanks for the tip. I had seen a test-thread-safety failure in an unrelated CI job and was surprised. I'll try to reproduce it locally.

* origin/master: (21 commits) vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (ggml-org#16919) examples(gguf): GGUF example outputs (ggml-org#17025) mtmd: allow QwenVL to process larger image by default (ggml-org#17020) server : do not default to multiple slots with speculative decoding (ggml-org#17017) mtmd: improve struct initialization (ggml-org#16981) docs: Clarify the endpoint that webui uses (ggml-org#17001) model : add openPangu-Embedded (ggml-org#16941) ggml webgpu: minor set rows optimization (ggml-org#16810) sync : ggml ggml : fix conv2d_dw SVE path (ggml/1380) CUDA: update ops.md (ggml-org#17005) opencl: update doc (ggml-org#17011) refactor: replace sprintf with snprintf for safer string handling in dump functions (ggml-org#16913) vulkan: remove the need for the dryrun (ggml-org#16826) server : do context shift only while generating (ggml-org#17000) readme : update hot topics (ggml-org#17002) ggml-cpu : bicubic interpolation (ggml-org#16891) ci : apply model label to models (ggml-org#16994) chore : fix models indent after refactor (ggml-org#16992) Fix garbled output with REPACK at high thread counts (ggml-org#16956) ...

jeffbolznv requested a review from 0cc4m as a code owner October 28, 2025 19:21

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 28, 2025

DajanaV mentioned this pull request Oct 29, 2025

UPSTREAM PR #16826: vulkan: remove the need for the dryrun auroralabs-loci/llama.cpp#13

Closed

jeffbolznv force-pushed the dryrun branch from 5988351 to 03eef62 Compare November 1, 2025 14:50

jeffbolznv force-pushed the dryrun branch from 03eef62 to bb49015 Compare November 1, 2025 18:06

0cc4m approved these changes Nov 1, 2025

View reviewed changes

remove dryrun parameters

a7255d1

jeffbolznv merged commit ad51c0a into ggml-org:master Nov 4, 2025
66 of 70 checks passed

jeffbolznv mentioned this pull request Nov 5, 2025

vulkan: Fix test-thread-safety crashes #17024

Open

DajanaV mentioned this pull request Nov 5, 2025

UPSTREAM PR #17024: vulkan: Fix test-thread-safety crashes auroralabs-loci/llama.cpp#90

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: remove the need for the dryrun #16826

vulkan: remove the need for the dryrun #16826

jeffbolznv commented Oct 28, 2025

Uh oh!

0cc4m commented Oct 29, 2025

Uh oh!

0cc4m commented Nov 1, 2025

Uh oh!

jeffbolznv commented Nov 1, 2025

Uh oh!

0cc4m left a comment

Uh oh!

jeffbolznv commented Nov 1, 2025

Uh oh!

Uh oh!

Acly commented Nov 5, 2025

Uh oh!

jeffbolznv commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vulkan: remove the need for the dryrun #16826

vulkan: remove the need for the dryrun #16826

Conversation

jeffbolznv commented Oct 28, 2025

Uh oh!

0cc4m commented Oct 29, 2025

Uh oh!

0cc4m commented Nov 1, 2025

Uh oh!

jeffbolznv commented Nov 1, 2025

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

jeffbolznv commented Nov 1, 2025

Uh oh!

Uh oh!

Acly commented Nov 5, 2025

Uh oh!

jeffbolznv commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants