vulkan: Add State Space Model (SSM) Operations Support #16463

giuseppe · 2025-10-07T13:28:48Z

implement SSM scan and SSM conv for Vulkan.

Intel Arc:

model	size	params	backend	ngl	dev	test	t/s
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	Vulkan0	pp512	5.27 ± 0.00
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	Vulkan0	tg128	2.33 ± 0.00

build: bc07349 (6756)

model	size	params	backend	ngl	dev	test	t/s
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	Vulkan0	pp512	213.36 ± 0.00
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	Vulkan0	tg128	15.46 ± 0.00

build: a4d94598e (6769)

NVIDIA:

model	size	params	backend	ngl	test	t/s
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	pp512	2042.60 ± 5.26
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	tg128	39.57 ± 0.05

build: 554fd57 (6766)

model	size	params	backend	ngl	test	t/s
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	pp512	5736.47 ± 24.70
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	tg128	140.11 ± 0.32

build: a4d94598e (6769)

jeffbolznv

Thanks for this contribution!

ggml/src/ggml-vulkan/ggml-vulkan.cpp

jeffbolznv · 2025-10-07T14:35:10Z

ggml/src/ggml-vulkan/vulkan-shaders/ssm_scan.comp

+    warp_sdata[warp_offset + lane] = val;
+    barrier();
+
+    if (lane < 16) warp_sdata[warp_offset + lane] += warp_sdata[warp_offset + lane + 16];


This seems like it's assuming a subgroup size of 32 (also at line 37).

Do I understand correctly that this doesn't actually rely on a subgroup size of 32, but it's splitting the workgroup into groups of 32 and just reducing those (and it looks like some reduction across groups of 32 has already happened?).

sorry I've missed this one. Yeah I don't think it would work with a size != 32. I need to think more through this one.

Do you've any suggestions on what I could do here?

I think this may work because you're not relying on SubgroupInvocationId or SubgroupID, you've just split the workgroup into groups of 32. Maybe we can just test it on AMD (with wave64) and Intel and verify that it works.

it work on Intel, but I am worried about all these settings that we made configurable. I've not really tried how it behaves with different values of the constants we defined. Or is the assumption that these values should not be tweaked from vulkan-shaders-gen.cpp without also changing the implementation in the shader?

You could change this to a while loop that will handle any power-of-two value of WARP_SIZE. We do want to allow the spec constants to be changeable but it's fine to have limitations like "must be a power of two".

Hmm wave64 AMD and wave8 llvmpipe are failing one test here, possibly due to this. All other tests are passing.

[SSM_SCAN] NMSE = 31335529439335960.000000000 > 0.000000100 SSM_SCAN(type=f32,d_state=16,head_dim=1,n_head=1024,n_group=1,n_seq_tokens=32,n_seqs=4): FAIL

I think Intel also has a subgroup size of 32 so it wouldn't be a good test for this.

ggml/src/ggml-vulkan/vulkan-shaders/ssm_scan.comp

ggml/src/ggml-vulkan/ggml-vulkan.cpp

ggml/src/ggml-vulkan/vulkan-shaders/ssm_scan.comp

jeffbolznv · 2025-10-07T14:50:00Z

ggml/src/ggml-vulkan/vulkan-shaders/ssm_scan.comp

+    return warp_sdata[warp_offset];
+}
+
+void main() {


Do all threads always load/store in bounds? In the host code there was some rounding up going on, which suggests maybe some threads don't correspond to in-bounds locations.

Ping on this one. I don't really understand what this shader does and which locations it should be accessing.

I've tried to follow what the CUDA shader does. I'll spend more time on it and see if there is anything I can improve about memory access and make sure all the assumptions in the code are checked.

ggml/src/ggml-vulkan/vulkan-shaders/ssm_conv.comp

giuseppe · 2025-10-08T10:23:47Z

I've addressed the comments and pushed a new version. The results are even better now:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA L4 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | ---: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           pp512 |      2569.27 ± 19.37 |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           tg128 |         80.77 ± 0.12 |

ggml/src/ggml-vulkan/ggml-vulkan.cpp

jeffbolznv · 2025-10-08T15:11:10Z

ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp


+    string_to_spv("ssm_scan_f32_d16", "ssm_scan.comp", {{"A_TYPE", "float"}});
+    string_to_spv("ssm_scan_f32_d128", "ssm_scan.comp", {{"A_TYPE", "float"}});
+    string_to_spv("ssm_scan_f32_d256", "ssm_scan.comp", {{"A_TYPE", "float"}});


These three are all identical now, you only need one.

jeffbolznv · 2025-10-08T15:12:37Z

ggml/src/ggml-vulkan/vulkan-shaders/ssm_scan.comp

+    warp_sdata[warp_offset + lane] = val;
+    barrier();
+
+    if (lane < 16) warp_sdata[warp_offset + lane] += warp_sdata[warp_offset + lane + 16];


You could change this to a while loop that will handle any power-of-two value of WARP_SIZE. We do want to allow the spec constants to be changeable but it's fine to have limitations like "must be a power of two".

giuseppe · 2025-10-09T09:57:35Z

I've completely replaced the code to reduce sum with subgroupAdd() as sum.comp does. It didn't work earlier because I had to use gl_SubgroupInvocationID == 0 instead of % WARP_SIZE == 0.

In the last version I've also renamed WARP_SIZE to SUBGROUP_SIZE.

0cc4m · 2025-10-09T10:00:30Z

Be aware that not all devices support subgroup commands. If there's a performance advantage to using them, you can do that, but it would still need a fallback to using a shared memory reduction. If there isn't a performance advantage, just use the shared memory reduction for compatibility.

giuseppe · 2025-10-09T10:18:04Z

Be aware that not all devices support subgroup commands. If there's a performance advantage to using them, you can do that, but it would still need a fallback to using a shared memory reduction. If there isn't a performance advantage, just use the shared memory reduction for compatibility.

with the subgroup code I get:

ggml_vulkan: 0 = NVIDIA L4 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | ---: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           pp512 |      2692.95 ± 49.85 |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           tg128 |         81.13 ± 0.08 |

if I revert to the reduction loop, I've:

ggml_vulkan: 0 = NVIDIA L4 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | ---: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           pp512 |      2562.04 ± 31.72 |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           tg128 |         81.14 ± 0.09 |

giuseppe · 2025-10-09T10:23:55Z

I've reverted to the version with a for loop. We can look at the subgroup optimization later

giuseppe · 2025-10-14T13:24:19Z

could it be approved again?

0cc4m · 2025-10-14T13:42:50Z

I'll do a proper review soon.

0cc4m

It works fine now.

0cc4m · 2025-10-14T17:25:02Z

ggml/src/ggml-vulkan/vulkan-shaders/ssm_conv.comp

+};
+
+void main() {
+    const uint global_thread_id = gl_WorkGroupID.x * gl_WorkGroupSize.x + gl_LocalInvocationID.x;


In Vulkan you can shorten this to gl_GlobalInvocationID.x

0cc4m · 2025-10-14T17:31:11Z

ggml/src/ggml-vulkan/vulkan-shaders/ssm_scan.comp

+    const int stride_dt = int(src2_nb1) / 4;
+    const int stride_B = int(src4_nb2) / 4;
+    const int stride_C = int(src5_nb2) / 4;
+    const int stride_y = int(n_head * d_head);


Why use int everywhere? It leads to a lot of casting, and the values don't look like they can/should be negative. Indices should be uints.

0cc4m · 2025-10-14T17:33:20Z

ggml/src/ggml-vulkan/vulkan-shaders/ssm_scan.comp

+        state[j] = s0[s0_base_idx + j * D_STATE + tid];
+    }
+
+    if (tid >= D_STATE) {


Isn't D_STATE the workgroup size as well? If that's the case, this can't be true.

I've added only to make sure reading the code that there can't be any OOB access, but I agree it is superfluous. I'll drop it

0cc4m · 2025-10-14T17:34:56Z

ggml/src/ggml-vulkan/vulkan-shaders/ssm_scan.comp

+        float dt_soft_plus = dt[dt_base_idx + i * stride_dt];
+        dt_soft_plus = softplus(dt_soft_plus);


Suggested change

float dt_soft_plus = dt[dt_base_idx + i * stride_dt];

dt_soft_plus = softplus(dt_soft_plus);

const float dt_soft_plus = softplus(dt[dt_base_idx + i * stride_dt]);

Not sure how much of a difference this kind of stuff makes, but with the large variety of Vulkan compilers, I prefer to make it as easy as possible for them with const.

0cc4m · 2025-10-14T17:45:31Z

ggml/src/ggml-vulkan/vulkan-shaders/ssm_scan.comp

+
+            int lane = tid % SUBGROUP_SIZE;
+
+            warp_sdata[tid] = y;


Just curious, I don't understand or know the algorithm. Is the switch to the second shared buffer here to save on a barrier? It should be possible to continue the reduction with only stateC, or am I missing something?

Of course it also makes replacing the step with subgroup operations easier.

no reason, I've tried different ways before I got it to work. Your version is much better, thanks for the suggestion

giuseppe · 2025-10-14T21:08:13Z

thanks for the review, pushed an updated version

giuseppe · 2025-10-15T10:41:28Z

updated results on Intel Arc and NVIDIA L40S:

Intel Arc:

model	size	params	backend	ngl	dev	test	t/s
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	Vulkan0	pp512	5.27 ± 0.00
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	Vulkan0	tg128	2.33 ± 0.00

build: bc07349 (6756)

model	size	params	backend	ngl	dev	test	t/s
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	Vulkan0	pp512	213.36 ± 0.00
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	Vulkan0	tg128	15.46 ± 0.00

build: a4d94598e (6769)

NVIDIA:

model	size	params	backend	ngl	test	t/s
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	pp512	2042.60 ± 5.26
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	tg128	39.57 ± 0.05

build: 554fd57 (6766)

model	size	params	backend	ngl	test	t/s
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	pp512	5736.47 ± 24.70
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	tg128	140.11 ± 0.32

build: a4d94598e (6769)

jeffbolznv · 2025-10-15T13:35:06Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

 };
+struct vk_op_ssm_scan_push_constants {
+    uint32_t src0_nb2, src0_nb3, src1_nb2, src1_nb3;
+    uint32_t src2_nb1, src2_nb2, src3_nb1;


These are usually named like src2_nb1 -> nb21.

jeffbolznv · 2025-10-15T13:37:04Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

+            const uint32_t nr  = src0->ne[1];
+            const uint32_t n_t = dst->ne[1];
+            const uint32_t n_s = dst->ne[2];
+            elements = { CEIL_DIV(nr, 32), n_t, n_s };


Seems like you could use wg_denoms for this one (make wg_denoms = {32, 1, 1} and not explicitly use CEIL_DIV here).

jeffbolznv · 2025-10-15T13:39:03Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

+                if (ggml_is_quantized(op->src[0]->type) || ggml_is_quantized(op->src[1]->type) || ggml_is_quantized(op->src[2]->type)) {
+                    return false;
+                }
+                if (op->src[3] && ggml_is_quantized(op->src[3]->type)) {


You could maybe do these ggml_is_quantized checks in a loop or using any_of.

giuseppe · 2025-10-15T20:56:17Z

@jeffbolznv thanks again, fixed the last comments

jeffbolznv

LGTM. All my comments have been addressed. I haven't had a chance to actually run it yet, I can try to do that tomorrow, but I don't want to block the change.

giuseppe · 2025-10-16T14:04:09Z

@0cc4m would you mind another look?

jeffbolznv · 2025-10-16T14:49:56Z

I ran the backend tests locally and they passed and had no validation errors. I noticed one test was unsupported and I thought this was supported in earlier versions of the change. Was it intentional to remove it?

  SSM_SCAN(type=f32,d_state=16,head_dim=1,n_head=1024,n_group=1,n_seq_tokens=32,n_seqs=4): not supported [Vulkan0]
  SSM_SCAN(type=f32,d_state=128,head_dim=64,n_head=16,n_group=2,n_seq_tokens=32,n_seqs=4): OK
  SSM_SCAN(type=f32,d_state=256,head_dim=64,n_head=8,n_group=2,n_seq_tokens=32,n_seqs=4): OK

jeffbolznv · 2025-10-16T14:54:08Z

Oh, there's a failure in the lavapipe CI:

2025-10-16T04:27:04.7561155Z 31: [SSM_SCAN] NMSE = 0.581159257 > 0.000000100   SSM_SCAN(type=f32,d_state=256,head_dim=64,n_head=8,n_group=2,n_seq_tokens=32,n_seqs=4): �[1;31mFAIL�[0m

lavapipe uses an unusual subgroup size, that might be related.

giuseppe · 2025-10-16T15:04:46Z

I ran the backend tests locally and they passed and had no validation errors. I noticed one test was unsupported and I thought this was supported in earlier versions of the change. Was it intentional to remove it?
  SSM_SCAN(type=f32,d_state=16,head_dim=1,n_head=1024,n_group=1,n_seq_tokens=32,n_seqs=4): not supported [Vulkan0]
  SSM_SCAN(type=f32,d_state=128,head_dim=64,n_head=16,n_group=2,n_seq_tokens=32,n_seqs=4): OK
  SSM_SCAN(type=f32,d_state=256,head_dim=64,n_head=8,n_group=2,n_seq_tokens=32,n_seqs=4): OK

yeah I've dropped support for Mamba 1 from this PR as I've not got it to work yet. I can add it as a follow up PR

jeffbolznv · 2025-10-16T15:05:09Z

I did a quick experiment and changing spec constant 1 from device->subgroup_size to 32 fixed the lavapipe failure.

giuseppe · 2025-10-16T15:07:08Z

Oh, there's a failure in the lavapipe CI:
2025-10-16T04:27:04.7561155Z 31: [SSM_SCAN] NMSE = 0.581159257 > 0.000000100   SSM_SCAN(type=f32,d_state=256,head_dim=64,n_head=8,n_group=2,n_seq_tokens=32,n_seqs=4): �[1;31mFAIL�[0m
lavapipe uses an unusual subgroup size, that might be related.

how do I run this locally? In what CI task is it happening?

jeffbolznv · 2025-10-16T15:11:46Z

It's the ubuntu-24-cmake-vulkan CI job. I think you'll need either Linux or WSL to run lavapipe (I run it on Windows with WSL). You'll probably need to set the env var GGML_VK_VISIBLE_DEVICES=0 (assuming this is enumerated as the first device) to enable lavapipe.

giuseppe · 2025-10-16T15:11:58Z

Oh, there's a failure in the lavapipe CI:
2025-10-16T04:27:04.7561155Z 31: [SSM_SCAN] NMSE = 0.581159257 > 0.000000100   SSM_SCAN(type=f32,d_state=256,head_dim=64,n_head=8,n_group=2,n_seq_tokens=32,n_seqs=4): �[1;31mFAIL�[0m
lavapipe uses an unusual subgroup size, that might be related.
how do I run this locally? In what CI task is it happening?

nevermind, I see it now. Taking a look

giuseppe · 2025-10-16T16:30:55Z

to simplify the review, I post here the patch I've applied on the top of the previous version:

diff --git a/ggml/src/ggml-vulkan/vulkan-shaders/ssm_scan.comp b/ggml/src/ggml-vulkan/vulkan-shaders/ssm_scan.comp
index 06aa10bfe..daebb5dc0 100644
--- a/ggml/src/ggml-vulkan/vulkan-shaders/ssm_scan.comp
+++ b/ggml/src/ggml-vulkan/vulkan-shaders/ssm_scan.comp
@@ -96,7 +96,7 @@ void main() {
             barrier();
         }

-        [[unroll]] for (uint j = 0; j < SPLIT_H / (D_STATE / SUBGROUP_SIZE); j++) {
+        [[unroll]] for (uint j = 0; j <= SPLIT_H / (D_STATE / SUBGROUP_SIZE); j++) {
             const uint idx = (tid % SUBGROUP_SIZE) +
                             D_STATE * (tid / SUBGROUP_SIZE) +
                             j * D_STATE * (D_STATE / SUBGROUP_SIZE);
@@ -104,13 +104,13 @@ void main() {
             uint lane = tid % SUBGROUP_SIZE;

             [[unroll]] for (uint offset = SUBGROUP_SIZE / 2; offset > 0; offset >>= 1) {
-                if (lane < offset) {
+                if (idx < SPLIT_H * D_STATE) {
                     stateC[idx] += stateC[idx + offset];
                 }
                 barrier();
             }

-            if (tid % SUBGROUP_SIZE == 0) {
+            if (idx < SPLIT_H * D_STATE && tid % SUBGROUP_SIZE == 0) {
                 const uint k = tid / SUBGROUP_SIZE + j * (D_STATE / SUBGROUP_SIZE);
                 d[y_base_idx + i * stride_y + k] = stateC[idx];
             }

the test passes locally now

jeffbolznv · 2025-10-16T16:38:29Z

Passes for me, too.

freebsd-nils-level1 · 2025-10-16T19:36:18Z

Wow, that really is a game changer for Vulkan on nvidia cards. Here an example of my 4060 Mobile GPU:
.
Original:

·• env GGML_VK_VISIBLE_DEVICES=1 .build-vulkan/bin/llama-bench -m ~/work/.models/Huihui-granite-4.0-h-tiny-abliterated.Q8_0.gguf 
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4060 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |           pp512 |      1277.04 ± 51.92 |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |           tg128 |         28.11 ± 1.00 |

build: 683fa6ba4 (6781)

··• env GGML_VK_VISIBLE_DEVICES=1 .build-cublas/bin/llama-bench -m ~/work/.models/Huihui-granite-4.0-h-tiny-abliterated.Q8_0.gguf 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | CUDA       |  99 |           pp512 |      3358.79 ± 16.26 |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | CUDA       |  99 |           tg128 |         85.83 ± 0.61 |

build: 683fa6ba4 (6781)

.
Modified for SSM support:

··• env GGML_VK_VISIBLE_DEVICES=1 bin/llama-bench -m ~/work/.models/Huihui-granite-4.0-h-tiny-abliterated.Q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4060 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |           pp512 |      2410.88 ± 26.99 |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |           tg128 |         93.11 ± 0.68 |

build: c3313203 (6776)

.
It really gets close to Cuda; even surpasses its text generation speed. Prompt processing still lacks a bit behind.
.
.
And it even helps on an AMD 780M GPU - although not that much for prompt processing unfortunately.

Original:

··• env GGML_VK_VISIBLE_DEVICES=0 .build-vulkan/bin/llama-bench -m ~/work/.models/Huihui-granite-4.0-h-tiny-abliterated.Q8_0.gguf 
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 780M Graphics (RADV PHOENIX) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |           pp512 |        472.07 ± 3.40 |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |           tg128 |         21.56 ± 0.40 |

build: 683fa6ba4 (6781)

.
Modified:

··• env GGML_VK_VISIBLE_DEVICES=0 bin/llama-bench -m ~/work/.models/Huihui-granite-4.0-h-tiny-abliterated.Q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 780M Graphics (RADV PHOENIX) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |           pp512 |        513.78 ± 6.68 |
| granitehybrid ?B Q8_0          |   6.88 GiB |     6.94 B | Vulkan     |  99 |           tg128 |         33.78 ± 0.21 |

build: c3313203 (6776)

I understand that one needs special SSM/Mamba(?) LLM models in order to get this. Does anyone know how to specifically search for these kind of models on Huggingface?

Thanks so much for your work on this, giuseppe, and please get this into upstream...

Regards,
Nils

0cc4m

It works as intended, I just have some comments about the supports_op code.

0cc4m · 2025-10-17T08:58:51Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

+                const uint32_t MAX_D_STATE = 256;
+
+                size_t stateC_size = SPLIT_H * MAX_D_STATE * sizeof(float);
+                size_t warp_sdata_size = MAX_D_STATE * sizeof(float);


You forgot to update this calculation.

0cc4m · 2025-10-17T09:04:50Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

+                const uint32_t SPLIT_H = 16;
+                const uint32_t MAX_D_STATE = 256;
+
+                size_t stateC_size = SPLIT_H * MAX_D_STATE * sizeof(float);


Shouldn't this be using d_state instead of MAX_D_STATE, since the smaller shader may fit even if the large one does not?

Add State Space Model scan operation to the Vulkan backend. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Add State Space Model conv operation to the Vulkan backend. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe · 2025-10-17T09:26:59Z

@0cc4m fixed the last comments and also added the new operations to docs/ops.md, that I forgot to do earlier

0cc4m

Thank you, looks good now.

giuseppe · 2025-10-17T12:12:18Z

@0cc4m thanks! CI is green now

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 7, 2025

giuseppe mentioned this pull request Oct 7, 2025

Misc. bug: Slower performance on newer models Apriel-1.5-15b, granite-4.0-h-tiny #16454

Closed

giuseppe marked this pull request as ready for review October 7, 2025 14:05

giuseppe requested a review from 0cc4m as a code owner October 7, 2025 14:05

jeffbolznv requested changes Oct 7, 2025

View reviewed changes

giuseppe force-pushed the ssm-vulkan branch from db8b8bc to 3a58b42 Compare October 8, 2025 10:23

0cc4m reviewed Oct 8, 2025

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp Show resolved Hide resolved

giuseppe force-pushed the ssm-vulkan branch from 3a58b42 to 694f01d Compare October 8, 2025 12:54

jeffbolznv reviewed Oct 8, 2025

View reviewed changes

giuseppe force-pushed the ssm-vulkan branch from 694f01d to c53c551 Compare October 9, 2025 09:44

giuseppe force-pushed the ssm-vulkan branch from c53c551 to a03a716 Compare October 9, 2025 10:23

giuseppe force-pushed the ssm-vulkan branch 4 times, most recently from c467631 to 6e70718 Compare October 9, 2025 12:28

giuseppe marked this pull request as draft October 14, 2025 07:43

giuseppe force-pushed the ssm-vulkan branch 2 times, most recently from c9d79db to b5ed953 Compare October 14, 2025 12:51

giuseppe marked this pull request as ready for review October 14, 2025 13:23

0cc4m reviewed Oct 14, 2025

View reviewed changes

giuseppe force-pushed the ssm-vulkan branch from b5ed953 to 23c36a4 Compare October 14, 2025 20:59

jeffbolznv reviewed Oct 15, 2025

View reviewed changes

giuseppe force-pushed the ssm-vulkan branch from 23c36a4 to c331320 Compare October 15, 2025 20:56

jeffbolznv approved these changes Oct 16, 2025

View reviewed changes

giuseppe force-pushed the ssm-vulkan branch from c331320 to cd3dc24 Compare October 16, 2025 16:30

0cc4m reviewed Oct 17, 2025

View reviewed changes

giuseppe added 2 commits October 17, 2025 11:24

vulkan: implement SSM scan operation

5f52c3a

Add State Space Model scan operation to the Vulkan backend. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

vulkan: implement SSM conv operation

0ef6148

Add State Space Model conv operation to the Vulkan backend. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe force-pushed the ssm-vulkan branch from cd3dc24 to 0ef6148 Compare October 17, 2025 09:26

github-actions bot added the documentation Improvements or additions to documentation label Oct 17, 2025

0cc4m approved these changes Oct 17, 2025

View reviewed changes

0cc4m merged commit 3d4e86b into ggml-org:master Oct 17, 2025
70 checks passed

giuseppe mentioned this pull request Oct 17, 2025

ggml: CUMSUM and TRI (CPU, Metal, CUDA) #16623

Open

		float dt_soft_plus = dt[dt_base_idx + i * stride_dt];
		dt_soft_plus = softplus(dt_soft_plus);

	float dt_soft_plus = dt[dt_base_idx + i * stride_dt];
	dt_soft_plus = softplus(dt_soft_plus);
	const float dt_soft_plus = softplus(dt[dt_base_idx + i * stride_dt]);

vulkan: Add State Space Model (SSM) Operations Support #16463

vulkan: Add State Space Model (SSM) Operations Support #16463

Conversation

giuseppe commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

netrunnereve Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

giuseppe commented Oct 8, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

giuseppe commented Oct 9, 2025

Uh oh!

0cc4m commented Oct 9, 2025

Uh oh!

giuseppe commented Oct 9, 2025

Uh oh!

giuseppe commented Oct 9, 2025

Uh oh!

giuseppe commented Oct 14, 2025

Uh oh!

0cc4m commented Oct 14, 2025

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

giuseppe commented Oct 14, 2025

Uh oh!

giuseppe commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

giuseppe commented Oct 7, 2025 •

edited

Loading

netrunnereve Oct 8, 2025 •

edited

Loading

giuseppe commented Oct 15, 2025 •

edited

Loading

freebsd-nils-level1 commented Oct 16, 2025 •

edited

Loading