OpenCL: OP_GATED_DELTA_NET by ymcki · Pull Request #23312 · ggml-org/llama.cpp

ymcki · 2026-05-19T03:35:13Z

I was the author of the backend agnostic KDA support, so I am highly familiar
with the math and algorithms of GDN and KDA.
#18755

Since support is not here for OpenCL, so I might as well give it a try to implement it.

pp gain is about 9.5%. tg gain is about 22.3%.

Master

Details

model	size	params	backend	ngl	threads	cpu_mask	cpu_strict	poll	n_ubatch	fa	test	t/s
qwen35 4B Q4_0	2.44 GiB	4.21 B	OpenCL	99	6	0xfc	1	1000	1024	1	pp64	86.91 ± 0.54
qwen35 4B Q4_0	2.44 GiB	4.21 B	OpenCL	99	6	0xfc	1	1000	1024	1	pp128	98.48 ± 0.34
qwen35 4B Q4_0	2.44 GiB	4.21 B	OpenCL	99	6	0xfc	1	1000	1024	1	tg64	7.46 ± 0.10
qwen35 4B Q4_0	2.44 GiB	4.21 B	OpenCL	99	6	0xfc	1	1000	1024	1	tg128	7.31 ± 0.03

build: 1ec7ba0 (9113)

This PR

Details

model	size	params	backend	ngl	threads	cpu_mask	cpu_strict	poll	n_ubatch	fa	test	t/s
qwen35 4B Q4_0	2.44 GiB	4.21 B	OpenCL	99	6	0xfc	1	1000	1024	1	pp64	95.43 ± 0.26
qwen35 4B Q4_0	2.44 GiB	4.21 B	OpenCL	99	6	0xfc	1	1000	1024	1	pp128	107.83 ± 0.31
qwen35 4B Q4_0	2.44 GiB	4.21 B	OpenCL	99	6	0xfc	1	1000	1024	1	tg64	9.05 ± 0.13
qwen35 4B Q4_0	2.44 GiB	4.21 B	OpenCL	99	6	0xfc	1	1000	1024	1	tg128	8.94 ± 0.02

build: 1ec7ba0 (9113)

Correctness: LD_LIBRARY_PATH=./lib:/vendor/lib64 ./bin/test-backend-ops test -o GATED_DELTA_NET

Details

Testing 2 devices

Backend 1/2: GPUOpenCL
Device description: QUALCOMM Adreno(TM) 750
Device memory: 11584 MB (10560 MB free)

GATED_DELTA_NET(type=f32,head_count=32,head_size=128,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=1,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=16,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=1,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=32,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=64,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=127,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=256,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=65,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=100,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=200,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=127,n_seqs=2,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=64,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=33,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=100,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
28/28 tests passed
Backend OpenCL: OK
Backend 2/2: CPU
Skipping CPU backend
2/2 backends passed
OK

I have read and agree with the contributing guidelines
AI usage disclosure: YES

AI is used to first evaluate the current implementation in vulkan, hexagon and CPU. AI found vulkan and hexagon has the best implementation. Due to proximity between Vulkan and OpenCL, so Vulkan implementation was used as a reference. Since Adreno GPU doesn't support all the features of OpenCL, direct translation is not possible. Therefore, I manually inspected what is supported by the Adreno GPU and modify the code manually. I also tuned the COLUMNS_PER_LANE_GROUP (cpl) and SUBGROUP_PER_WG (spw) parameters manually.

lhez · 2026-05-21T04:27:28Z

@ymcki Thank you for adding GDN. There are whitespace errors - please fix them. I updated the backend initialization so you will need to resolve; I think you can move the check for qcom subgroup shuffle to ggml_opencl_is_device_supported.

…ze for Adreno and Intel. Return not supported when K>1 state snapshot

ymcki · 2026-05-21T07:34:19Z

@ymcki Thank you for adding GDN. There are whitespace errors - please fix them. I updated the backend initialization so you will need to resolve; I think you can move the check for qcom subgroup shuffle to ggml_opencl_is_device_supported.

Fixed trailing spaces.

Moved check for qcom subgroup shuffle to ggml_cl_init instead of ggml_opencl_is_device_supported because I think the flag better stay in struct ggml_backend_opencl_context than struct ggml_backend_opencl_device_context.

lhez · 2026-05-27T18:07:45Z

@max-krasnyansky Could you take a look and ack when you get a chance?

max-krasnyansky

Nice!

* OP_GATED_DELTA_NET impl * add back lanes_per_column declaration * removed has_subgroup_arithmetic and has_subgroup_clustered_reduce * removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot * support for K>1 state snapshot * removed picky indent multiple of 4 fixes * removed return that won\'t be executed

ymcki added 3 commits May 19, 2026 08:33

OP_GATED_DELTA_NET impl

6623435

add back lanes_per_column declaration

748c76f

removed has_subgroup_arithmetic and has_subgroup_clustered_reduce

f12e625

ymcki requested a review from a team as a code owner May 19, 2026 03:35

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels May 19, 2026

Merge branch 'ggml-org:master' into OpenCL

cfd56aa

lhez reviewed May 21, 2026

View reviewed changes

Comment thread ggml/src/ggml-opencl/ggml-opencl.cpp Outdated

Comment thread ggml/src/ggml-opencl/ggml-opencl.cpp Outdated

Comment thread ggml/src/ggml-opencl/ggml-opencl.cpp Outdated

ymcki and others added 3 commits May 21, 2026 14:45

removed trailing spaces and fixes indentation. Hard coded subgroup si…

ac971b6

…ze for Adreno and Intel. Return not supported when K>1 state snapshot

conflic resolution

a9fdada

Merge branch 'ggml-org:master' into OpenCL

5c1c5ea

ymcki and others added 4 commits May 21, 2026 21:51

support for K>1 state snapshot

623fbf5

Merge branch 'ggml-org:master' into OpenCL

7c4ebbd

Merge branch 'master' of github.com:ymcki/llama.cpp into OpenCL

19ea258

Merge branch 'OpenCL' of github.com:ymcki/llama.cpp into OpenCL

2eccfb9

ymcki mentioned this pull request May 22, 2026

Hexagon: OP_GATED_DELTA_NET K>1 support #23531

Merged

lhez reviewed May 26, 2026

View reviewed changes

Comment thread ggml/src/ggml-opencl/ggml-opencl.cpp

ymcki and others added 3 commits May 26, 2026 17:08

Merge branch 'ggml-org:master' into OpenCL

a051b7b

removed picky indent multiple of 4 fixes

3256212

Merge branch 'OpenCL' of github.com:ymcki/llama.cpp into OpenCL

7a2edbd

lhez reviewed May 26, 2026

View reviewed changes

Comment thread ggml/src/ggml-opencl/ggml-opencl.cpp Outdated

ymcki and others added 2 commits May 27, 2026 07:38

Merge branch 'ggml-org:master' into OpenCL

9ebc470

removed return that won\'t be executed

2176f95

lhez approved these changes May 27, 2026

View reviewed changes

lhez requested a review from max-krasnyansky May 27, 2026 18:07

max-krasnyansky approved these changes May 28, 2026

View reviewed changes

lhez merged commit 8ad8aef into ggml-org:master May 28, 2026
33 of 34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenCL: OP_GATED_DELTA_NET#23312

OpenCL: OP_GATED_DELTA_NET#23312
lhez merged 16 commits into
ggml-org:masterfrom
ymcki:OpenCL

ymcki commented May 19, 2026

Uh oh!

lhez commented May 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ymcki commented May 21, 2026

Uh oh!

Uh oh!

Uh oh!

lhez commented May 27, 2026

Uh oh!

max-krasnyansky left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ymcki commented May 19, 2026

Uh oh!

lhez commented May 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ymcki commented May 21, 2026

Uh oh!

Uh oh!

Uh oh!

lhez commented May 27, 2026

Uh oh!

max-krasnyansky left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants