Skip to content

OpenCL: OP_GATED_DELTA_NET#23312

Merged
lhez merged 16 commits into
ggml-org:masterfrom
ymcki:OpenCL
May 28, 2026
Merged

OpenCL: OP_GATED_DELTA_NET#23312
lhez merged 16 commits into
ggml-org:masterfrom
ymcki:OpenCL

Conversation

@ymcki
Copy link
Copy Markdown
Contributor

@ymcki ymcki commented May 19, 2026

I was the author of the backend agnostic KDA support, so I am highly familiar
with the math and algorithms of GDN and KDA.
#18755

Since support is not here for OpenCL, so I might as well give it a try to implement it.

pp gain is about 9.5%. tg gain is about 22.3%.

Master
Details
model size params backend ngl threads cpu_mask cpu_strict poll n_ubatch fa mmap test t/s
qwen35 4B Q4_0 2.44 GiB 4.21 B OpenCL 99 6 0xfc 1 1000 1024 1 0 pp64 86.91 ± 0.54
qwen35 4B Q4_0 2.44 GiB 4.21 B OpenCL 99 6 0xfc 1 1000 1024 1 0 pp128 98.48 ± 0.34
qwen35 4B Q4_0 2.44 GiB 4.21 B OpenCL 99 6 0xfc 1 1000 1024 1 0 tg64 7.46 ± 0.10
qwen35 4B Q4_0 2.44 GiB 4.21 B OpenCL 99 6 0xfc 1 1000 1024 1 0 tg128 7.31 ± 0.03

build: 1ec7ba0 (9113)

This PR
Details
model size params backend ngl threads cpu_mask cpu_strict poll n_ubatch fa mmap test t/s
qwen35 4B Q4_0 2.44 GiB 4.21 B OpenCL 99 6 0xfc 1 1000 1024 1 0 pp64 95.43 ± 0.26
qwen35 4B Q4_0 2.44 GiB 4.21 B OpenCL 99 6 0xfc 1 1000 1024 1 0 pp128 107.83 ± 0.31
qwen35 4B Q4_0 2.44 GiB 4.21 B OpenCL 99 6 0xfc 1 1000 1024 1 0 tg64 9.05 ± 0.13
qwen35 4B Q4_0 2.44 GiB 4.21 B OpenCL 99 6 0xfc 1 1000 1024 1 0 tg128 8.94 ± 0.02

build: 1ec7ba0 (9113)

Correctness: LD_LIBRARY_PATH=./lib:/vendor/lib64 ./bin/test-backend-ops test -o GATED_DELTA_NET
Details

Testing 2 devices

Backend 1/2: GPUOpenCL
Device description: QUALCOMM Adreno(TM) 750
Device memory: 11584 MB (10560 MB free)

GATED_DELTA_NET(type=f32,head_count=32,head_size=128,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=1,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=16,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=1,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=32,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=64,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=127,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=256,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=65,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=100,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=200,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=127,n_seqs=2,v_repeat=1,permuted=0,kda=0): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=64,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=33,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=100,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
28/28 tests passed
Backend OpenCL: OK
Backend 2/2: CPU
Skipping CPU backend
2/2 backends passed
OK

AI is used to first evaluate the current implementation in vulkan, hexagon and CPU. AI found vulkan and hexagon has the best implementation. Due to proximity between Vulkan and OpenCL, so Vulkan implementation was used as a reference. Since Adreno GPU doesn't support all the features of OpenCL, direct translation is not possible. Therefore, I manually inspected what is supported by the Adreno GPU and modify the code manually. I also tuned the COLUMNS_PER_LANE_GROUP (cpl) and SUBGROUP_PER_WG (spw) parameters manually.

@ymcki ymcki requested a review from a team as a code owner May 19, 2026 03:35
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels May 19, 2026
@lhez
Copy link
Copy Markdown
Contributor

lhez commented May 21, 2026

@ymcki Thank you for adding GDN. There are whitespace errors - please fix them. I updated the backend initialization so you will need to resolve; I think you can move the check for qcom subgroup shuffle to ggml_opencl_is_device_supported.

Comment thread ggml/src/ggml-opencl/ggml-opencl.cpp Outdated
Comment thread ggml/src/ggml-opencl/ggml-opencl.cpp Outdated
Comment thread ggml/src/ggml-opencl/ggml-opencl.cpp Outdated
@ymcki
Copy link
Copy Markdown
Contributor Author

ymcki commented May 21, 2026

@ymcki Thank you for adding GDN. There are whitespace errors - please fix them. I updated the backend initialization so you will need to resolve; I think you can move the check for qcom subgroup shuffle to ggml_opencl_is_device_supported.

Fixed trailing spaces.

Moved check for qcom subgroup shuffle to ggml_cl_init instead of ggml_opencl_is_device_supported because I think the flag better stay in struct ggml_backend_opencl_context than struct ggml_backend_opencl_device_context.

Comment thread ggml/src/ggml-opencl/ggml-opencl.cpp
Comment thread ggml/src/ggml-opencl/ggml-opencl.cpp Outdated
@lhez lhez requested a review from max-krasnyansky May 27, 2026 18:07
@lhez
Copy link
Copy Markdown
Contributor

lhez commented May 27, 2026

@max-krasnyansky Could you take a look and ack when you get a chance?

Copy link
Copy Markdown
Member

@max-krasnyansky max-krasnyansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@lhez lhez merged commit 8ad8aef into ggml-org:master May 28, 2026
33 of 34 checks passed
adrianhoehne pushed a commit to adrianhoehne/llama.cpp that referenced this pull request May 28, 2026
* OP_GATED_DELTA_NET impl

* add back lanes_per_column declaration

* removed has_subgroup_arithmetic and has_subgroup_clustered_reduce

* removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot

* support for K>1 state snapshot

* removed picky indent multiple of 4 fixes

* removed return that won\'t be executed
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* OP_GATED_DELTA_NET impl

* add back lanes_per_column declaration

* removed has_subgroup_arithmetic and has_subgroup_clustered_reduce

* removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot

* support for K>1 state snapshot

* removed picky indent multiple of 4 fixes

* removed return that won\'t be executed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants