opencl: add kernel to handle mat mul in attention to improve encoding speed #17181

shaofeiqi · 2025-11-11T22:00:19Z

This PR adds a new kernel to specifically handle the matrix multiply in attention. This should improve encoding performance for most models.

lhez · 2025-11-14T19:31:06Z

On X Elite (X1-85),

master

model	size	params	backend	ngl	test	t/s
qwen2 1.5B Q4_0	828.59 MiB	1.54 B	OpenCL	99	pp512	399.49 ± 1.87
qwen2 1.5B Q4_0	828.59 MiB	1.54 B	OpenCL	99	pp1024	304.23 ± 2.80
qwen2 1.5B Q4_0	828.59 MiB	1.54 B	OpenCL	99	pp2048	209.09 ± 0.26
qwen2 1.5B Q4_0	828.59 MiB	1.54 B	OpenCL	99	tg256	33.85 ± 0.08
qwen2 3B Q4_0	1.62 GiB	3.09 B	OpenCL	99	pp512	217.37 ± 0.95
qwen2 3B Q4_0	1.62 GiB	3.09 B	OpenCL	99	pp1024	168.51 ± 0.40
qwen2 3B Q4_0	1.62 GiB	3.09 B	OpenCL	99	pp2048	117.27 ± 0.32
qwen2 3B Q4_0	1.62 GiB	3.09 B	OpenCL	99	tg256	20.86 ± 0.23
qwen3 8B Q4_0	4.29 GiB	8.19 B	OpenCL	99	pp512	103.78 ± 0.28
qwen3 8B Q4_0	4.29 GiB	8.19 B	OpenCL	99	pp1024	80.74 ± 0.58
qwen3 8B Q4_0	4.29 GiB	8.19 B	OpenCL	99	pp2048	56.83 ± 0.09
qwen3 8B Q4_0	4.29 GiB	8.19 B	OpenCL	99	tg256	12.82 ± 0.02

this PR,

model	size	params	backend	ngl	test	t/s
qwen2 1.5B Q4_0	828.59 MiB	1.54 B	OpenCL	99	pp512	658.22 ± 4.60
qwen2 1.5B Q4_0	828.59 MiB	1.54 B	OpenCL	99	pp1024	613.10 ± 3.58
qwen2 1.5B Q4_0	828.59 MiB	1.54 B	OpenCL	99	pp2048	525.78 ± 1.90
qwen2 1.5B Q4_0	828.59 MiB	1.54 B	OpenCL	99	tg256	33.74 ± 0.03
qwen2 3B Q4_0	1.62 GiB	3.09 B	OpenCL	99	pp512	342.08 ± 0.66
qwen2 3B Q4_0	1.62 GiB	3.09 B	OpenCL	99	pp1024	317.64 ± 1.18
qwen2 3B Q4_0	1.62 GiB	3.09 B	OpenCL	99	pp2048	274.53 ± 0.61
qwen2 3B Q4_0	1.62 GiB	3.09 B	OpenCL	99	tg256	21.08 ± 0.07
qwen3 8B Q4_0	4.29 GiB	8.19 B	OpenCL	99	pp512	158.72 ± 0.54
qwen3 8B Q4_0	4.29 GiB	8.19 B	OpenCL	99	pp1024	138.81 ± 0.17
qwen3 8B Q4_0	4.29 GiB	8.19 B	OpenCL	99	pp2048	114.03 ± 0.10
qwen3 8B Q4_0	4.29 GiB	8.19 B	OpenCL	99	tg256	12.81 ± 0.06

… speed (ggml-org#17181) * Add mul_mm_f16_f32_kq_kqv kernel * Add ggml_cl_mul_mat_kq_kqv_adreno func * fix whitespace * remove unused variable * remove redundant * refactor and clean up * remove trailing whitespace

lippman1125 · 2025-11-18T14:17:19Z

@shaofeiqi @max-krasnyansky on 8GEN3, this PR will decrese decoding performance。

without this pr

PP	TG	B	repeat	N_KV	t_tg ms	e2e ms	TTFT ms	TPOT ms	TPS(pp) t/s	TPS(tg) t/s
512	64	1	5	576	2888.96	5222.09	2333.12	45.86	219.45	22.88

with this pr

PP	TG	B	repeat	N_KV	t_tg ms	e2e ms	TTFT ms	TPOT ms	TPS(pp) t/s	TPS(tg) t/s
512	64	1	5	576	3744.50	5014.40	1269.90	59.44	403.19	17.68

test comand
LD_LIBRARY_PATH=./lib ./bin/llama-batched-bench -m ../Qwen3_0.6B_Q4_0.gguf -c 2304 -b 2048 -npp 512 -ntg 64 -npl 1 -ngl 99 --flash-attn off

lhez · 2025-11-19T21:49:50Z

@lippman1125 I suppose you are referring to tg. This PR should not affect tg.

On master, using 8Gen3,

ggml_opencl: device: 'QUALCOMM Adreno(TM) 750 (OpenCL 3.0 Adreno(TM) 750)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.45.02.16

without this change (manually comment out lines 6897 - 6902),

model	size	params	backend	ngl	test	t/s
qwen3 0.6B Q4_0	403.42 MiB	751.63 M	OpenCL	99	pp512	349.43 ± 1.44
qwen3 0.6B Q4_0	403.42 MiB	751.63 M	OpenCL	99	tg64	28.65 ± 2.84

build: 7d77f07 (7108)

with this change,

model	size	params	backend	ngl	test	t/s
qwen3 0.6B Q4_0	403.42 MiB	751.63 M	OpenCL	99	pp512	798.45 ± 8.33
qwen3 0.6B Q4_0	403.42 MiB	751.63 M	OpenCL	99	tg64	28.48 ± 1.21

build: 7d77f07 (7108)

tg numbers seem about the same for my setup.

lippman1125 · 2025-11-20T11:45:45Z

@lhez Thanks for your reply , I verify it again, It's no problem. Good Job!

shaofeiqi added 6 commits November 7, 2025 11:18

Add mul_mm_f16_f32_kq_kqv kernel

9e5c596

Add ggml_cl_mul_mat_kq_kqv_adreno func

24f32df

fix whitespace

dada517

remove unused variable

0fc4b8b

remove redundant

301662b

refactor and clean up

41bf54f

shaofeiqi requested review from lhez and max-krasnyansky as code owners November 11, 2025 22:00

DajanaV mentioned this pull request Nov 11, 2025

UPSTREAM PR #17181: opencl: add kernel to handle mat mul in attention to improve encoding speed auroralabs-loci/llama.cpp#174

Closed

github-actions bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Nov 11, 2025

remove trailing whitespace

b3ee2ab

max-krasnyansky approved these changes Nov 16, 2025

View reviewed changes

max-krasnyansky merged commit 4db5641 into ggml-org:master Nov 16, 2025
76 of 81 checks passed

lhez mentioned this pull request Nov 19, 2025

opencl: refine condition for using kqv mm kernel #17392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

opencl: add kernel to handle mat mul in attention to improve encoding speed #17181

opencl: add kernel to handle mat mul in attention to improve encoding speed #17181

shaofeiqi commented Nov 11, 2025

Uh oh!

lhez commented Nov 14, 2025

Uh oh!

Uh oh!

lippman1125 commented Nov 18, 2025

Uh oh!

lhez commented Nov 19, 2025

Uh oh!

lippman1125 commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

opencl: add kernel to handle mat mul in attention to improve encoding speed #17181

opencl: add kernel to handle mat mul in attention to improve encoding speed #17181

Conversation

shaofeiqi commented Nov 11, 2025

Uh oh!

lhez commented Nov 14, 2025

Uh oh!

Uh oh!

lippman1125 commented Nov 18, 2025

Uh oh!

lhez commented Nov 19, 2025

Uh oh!

lippman1125 commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants