Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

Allocate pipelines and descriptor sets when requested.

Reallocate the prealloc buffers when needed, and flush any pending work before reallocating.

For rms_partials and total_mul_mat_bytes, use the sizes computed the last time the graph was executed.

The dryrun is a small but consistent overhead where the GPU is idle. I get an average of maybe 1-2% improvement with it removed, though my numbers have been noisy lately.

I didn't totally rip out all the logic yet, I wanted to keep the diffs smaller to make it more clear how the new code works.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        237.16 ± 3.72 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        196.74 ± 6.25 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        128.58 ± 3.47 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       828.66 ± 20.21 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        826.18 ± 8.65 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       380.99 ± 23.22 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       263.97 ± 13.83 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       240.75 ± 14.32 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       292.99 ± 37.92 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       275.93 ± 20.96 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |       357.16 ± 16.19 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |       264.03 ± 10.48 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       312.49 ± 20.09 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         92.71 ± 0.30 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         50.06 ± 0.16 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        239.51 ± 7.95 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        197.76 ± 8.86 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        128.50 ± 4.10 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       824.48 ± 11.02 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |      635.92 ± 253.03 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       384.39 ± 23.94 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       260.74 ± 22.91 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       240.84 ± 14.85 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       301.72 ± 31.62 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       282.66 ± 21.94 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |       368.81 ± 12.07 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        270.71 ± 3.88 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        319.78 ± 3.61 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         92.07 ± 1.51 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         49.97 ± 0.16 |

@jeffbolznv jeffbolznv requested a review from 0cc4m as a code owner October 28, 2025 19:21
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 28, 2025
@0cc4m
Copy link
Collaborator

0cc4m commented Oct 29, 2025

I do have the optimal setup to test this, with a slow server CPU + 3090:

model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 117.20 ± 0.31 119.50 ± 0.24 +2.0%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 122.78 ± 8.91 129.57 ± 0.41 +5.5%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 142.21 ± 13.40 151.45 ± 0.45 +6.5%
llama 8B Q4_1 4.77 GiB 8.03 B 99 1 tg128 139.90 ± 0.39 142.20 ± 0.24 +1.6%
llama 8B Q8_0 7.95 GiB 8.03 B 99 1 tg128 93.47 ± 0.31 94.05 ± 2.76 +0.6%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 156.66 ± 13.84 159.81 ± 17.27 +2.0%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 156.70 ± 14.05 154.48 ± 22.33 -1.4%

The test good, but it's gonna take me a little while to go through the code.

@0cc4m
Copy link
Collaborator

0cc4m commented Nov 1, 2025

Can you fix the conflict?

@jeffbolznv
Copy link
Collaborator Author

Rebased.

Allocate pipelines and descriptor sets when requested.

Reallocate the prealloc buffers when needed, and flush any pending work
before reallocating.

For rms_partials and total_mul_mat_bytes, use the sizes computed the last time
the graph was executed.
Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I get what this is doing and results are positive, in my tests. What's your plan now? Clean up the unused code before merge?

@jeffbolznv
Copy link
Collaborator Author

Sure, I can do that tomorrow.

@jeffbolznv jeffbolznv merged commit ad51c0a into ggml-org:master Nov 4, 2025
66 of 70 checks passed
@Acly
Copy link
Collaborator

Acly commented Nov 5, 2025

test-thread-safety crashes quite often now for me when I run it with Vulkan and 1 GPU. I'm using the command line that also runs with ctest. It crashes at various points and not always, but usually because it tries to access some vk_pipeline that has not been initialized or compiled.

The test spawns multiple threads, each creates its own llama_context and independently runs some model. On Vulkan side this creates 1 ggml_backend_vk_context per thread, but they can share the same vk_device, which holds lots of mutable state (like pipelines and queues).

I'm only observing crashes after this PR, but from looking at the code it seems like this should always have been somewhat broken. Almost feels like I must misunderstand something..?

One could argue that it doesn't make much sense to concurrently run multiple llama_context on the same device, it's probably far better to use some form of batching to improve throughput. But then the test shouldn't test that.

@jeffbolznv
Copy link
Collaborator Author

Thanks for the tip. I had seen a test-thread-safety failure in an unrelated CI job and was surprised. I'll try to reproduce it locally.

gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Nov 5, 2025
* origin/master: (21 commits)
vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (ggml-org#16919)
examples(gguf): GGUF example outputs (ggml-org#17025)
mtmd: allow QwenVL to process larger image by default (ggml-org#17020)
server : do not default to multiple slots with speculative decoding (ggml-org#17017)
mtmd: improve struct initialization (ggml-org#16981)
docs: Clarify the endpoint that webui uses (ggml-org#17001)
model : add openPangu-Embedded (ggml-org#16941)
ggml webgpu: minor set rows optimization (ggml-org#16810)
sync : ggml
ggml : fix conv2d_dw SVE path (ggml/1380)
CUDA: update ops.md (ggml-org#17005)
opencl: update doc (ggml-org#17011)
refactor: replace sprintf with snprintf for safer string handling in dump functions (ggml-org#16913)
vulkan: remove the need for the dryrun (ggml-org#16826)
server : do context shift only while generating (ggml-org#17000)
readme : update hot topics (ggml-org#17002)
ggml-cpu : bicubic interpolation (ggml-org#16891)
ci : apply model label to models (ggml-org#16994)
chore : fix models indent after refactor (ggml-org#16992)
Fix garbled output with REPACK at high thread counts (ggml-org#16956)
...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants