vulkan: Optimize SSM_SCAN #16645

jeffbolznv · 2025-10-18T02:55:25Z

I'll leave some inline comments motivating some of the changes.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 512 -r 30 --prio 1 -m c:\models\granite-4.0-h-tiny-Q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |  1 |           pp512 |     7110.83 ± 689.95 |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |  1 |           tg128 |        193.79 ± 1.00 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 512 -r 30 --prio 1 -m c:\models\grante-4.0-h-tiny-Q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |  1 |           pp512 |      3862.00 ± 19.50 |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |  1 |           tg128 |        135.66 ± 0.80 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 512 -r 30 --prio 1 -m c:\models\granite-4.0-h-tiny-Q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |  1 |           pp512 |     9851.65 ± 131.68 |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |  1 |           tg128 |        193.69 ± 1.00 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 512 -r 30 --prio 1 -m c:\models\granite-4.0-h-tiny-Q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |  1 |           pp512 |      4488.07 ± 63.72 |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |  1 |           tg128 |        136.61 ± 0.88 |

CC @giuseppe

jeffbolznv · 2025-10-18T02:56:53Z

ggml/src/ggml-vulkan/vulkan-shaders/ssm_scan.comp

-                if (k < SPLIT_H * D_STATE && (k + (w >> 1)) < SPLIT_H * D_STATE) {
-                    stateC[k] += stateC[k + (w >> 1)];
+        [[unroll]]
+        for (uint w = D_STATE / 2; w >= SUBGROUP_SIZE; w >>= 1) {


Shifted the values of w down by a factor of 2, rather than using w>>1 everywhere. This seemed to be causing some weird code to be generated.

jeffbolznv · 2025-10-18T02:59:03Z

ggml/src/ggml-vulkan/vulkan-shaders/ssm_scan.comp

            barrier();
        }

-        [[unroll]] for (uint j = 0; j <= SPLIT_H / (D_STATE / SUBGROUP_SIZE); j++) {


This was one too many iterations most of the time, leading to extra subgroup ops or barriers. But when D_STATE / SUBGROUP_SIZE is greater than SPLIT_H, we need at least one iteration.

thanks for fixing this one!

jeffbolznv · 2025-10-18T03:00:05Z

ggml/src/ggml-vulkan/vulkan-shaders/ssm_scan.comp

-                if (idx + offset < SPLIT_H * D_STATE) {
-                    stateC[idx] += stateC[idx + offset];
+            if (idx < SPLIT_H * D_STATE ||
+                max_idx < SPLIT_H * D_STATE) {


This max_idx comparison should fold away and avoid the need for the branch most of the time.

just curious, is this needed to get the subgroup to run the same code?

giuseppe

LGTM

I had started working on the subgroupAdd optimization, but I got stuck trying to get it to work on the Intel Arc GPU (I’m not yet sure why it was behaving differently on a NVIDIA GPU). I even suspected there might be an issue with the driver. Your version works well there too, so the issue was definitely in my version. So thanks for beating me at it :-) I will take the chance to compare the two versions and understand what I was doing wrong.

0cc4m · 2025-10-18T14:01:48Z

@giuseppe The usual subgroup issue with Intel would be about subgroup size (which can be between 8 and 32 for Intel) and forcing full subgroups. You probably need to do the latter to get the functions to work, and if you expect a specific size in the shader, force it. For performance on Intel 16 has been best in my experience.

0cc4m

LGTM

* model-conversion : add trust_remote_code for orig model run [no ci] (ggml-org#16751) This commit add the trust_remote_code=True argument when loading models using AutoConfig, AutoTokenizer, and AutoModelForCausalLM for the run original model script. The motivation for this is that some models require custom code to be loaded properly, and setting trust_remote_code=True avoids a prompt asking for user confirmation: ```console (venv) $ make causal-run-original-model The repository /path/to/model contains custom code which must be executed to correctly load the model. You can inspect the repository content at /path/to/model. Do you wish to run the custom code? [y/N] N ``` Having this as the default seems like a safe choice as we have to clone or download the models we convert and would be expecting to run any custom code they have. * webui: support q URL parameter (ggml-org#16728) * webui: support q URL parameter Fixes ggml-org#16722 I’ve checked that it works with Firefox’s AI tools * webui: apply suggestions from code review Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: update webui static build --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * CUDA: use CUB for arbitary size argsort (ggml-org#16754) * ggml: fix CUDA grid launch condition for large block_nums.y in binbcast (ggml-org#16742) * Fix CUDA grid launch condition for large block_nums.y * add backend ops test * reduce test repetitions * convert : avoid dequantizing mxfp4 for GPT-OSS (ggml-org#16756) * vulkan: Optimize SSM_SCAN (ggml-org#16645) * vulkan: delete dead code (ggml-org#16732) ggml_vk_create_buffer_temp is not used anywhere, and it is the only caller for ggml_vk_pool_malloc. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> * model : set res->t_embd in PLaMo2 models (ggml-org#16766) --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: Florian Badie <florianbadie@odrling.xyz> Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: leejet <leejet714@gmail.com> Co-authored-by: compilade <git@compilade.net> Co-authored-by: Jeff Bolz <jbolz@nvidia.com> Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com> Co-authored-by: Shunta Saito <shunta.saito@gmail.com>

vulkan: Optimize SSM_SCAN

f6d93d0

jeffbolznv requested a review from 0cc4m as a code owner October 18, 2025 02:55

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 18, 2025

jeffbolznv commented Oct 18, 2025

View reviewed changes

giuseppe approved these changes Oct 18, 2025

View reviewed changes

0cc4m approved these changes Oct 25, 2025

View reviewed changes

0cc4m merged commit 8423d01 into ggml-org:master Oct 25, 2025
69 of 70 checks passed

pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 25, 2025

vulkan: Optimize SSM_SCAN (ggml-org#16645)

3ba6d71

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

vulkan: Optimize SSM_SCAN #16645

vulkan: Optimize SSM_SCAN #16645

jeffbolznv commented Oct 18, 2025

Uh oh!

jeffbolznv Oct 18, 2025

Uh oh!

jeffbolznv Oct 18, 2025

Uh oh!

giuseppe Oct 18, 2025

Uh oh!

jeffbolznv Oct 18, 2025

Uh oh!

giuseppe Oct 18, 2025

Uh oh!

giuseppe left a comment

Uh oh!

0cc4m commented Oct 18, 2025

Uh oh!

0cc4m left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

vulkan: Optimize SSM_SCAN #16645

vulkan: Optimize SSM_SCAN #16645

Conversation

jeffbolznv commented Oct 18, 2025

Uh oh!

jeffbolznv Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

giuseppe Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

giuseppe Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

giuseppe left a comment

Choose a reason for hiding this comment

Uh oh!

0cc4m commented Oct 18, 2025

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants