Skip to content

Conversation

@shaofeiqi
Copy link
Contributor

This PR adds a new kernel to specifically handle the matrix multiply in attention. This should improve encoding performance for most models.

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Nov 11, 2025
@lhez
Copy link
Collaborator

lhez commented Nov 14, 2025

On X Elite (X1-85),

master

model size params backend ngl test t/s
qwen2 1.5B Q4_0 828.59 MiB 1.54 B OpenCL 99 pp512 399.49 ± 1.87
qwen2 1.5B Q4_0 828.59 MiB 1.54 B OpenCL 99 pp1024 304.23 ± 2.80
qwen2 1.5B Q4_0 828.59 MiB 1.54 B OpenCL 99 pp2048 209.09 ± 0.26
qwen2 1.5B Q4_0 828.59 MiB 1.54 B OpenCL 99 tg256 33.85 ± 0.08
qwen2 3B Q4_0 1.62 GiB 3.09 B OpenCL 99 pp512 217.37 ± 0.95
qwen2 3B Q4_0 1.62 GiB 3.09 B OpenCL 99 pp1024 168.51 ± 0.40
qwen2 3B Q4_0 1.62 GiB 3.09 B OpenCL 99 pp2048 117.27 ± 0.32
qwen2 3B Q4_0 1.62 GiB 3.09 B OpenCL 99 tg256 20.86 ± 0.23
qwen3 8B Q4_0 4.29 GiB 8.19 B OpenCL 99 pp512 103.78 ± 0.28
qwen3 8B Q4_0 4.29 GiB 8.19 B OpenCL 99 pp1024 80.74 ± 0.58
qwen3 8B Q4_0 4.29 GiB 8.19 B OpenCL 99 pp2048 56.83 ± 0.09
qwen3 8B Q4_0 4.29 GiB 8.19 B OpenCL 99 tg256 12.82 ± 0.02

this PR,

model size params backend ngl test t/s
qwen2 1.5B Q4_0 828.59 MiB 1.54 B OpenCL 99 pp512 658.22 ± 4.60
qwen2 1.5B Q4_0 828.59 MiB 1.54 B OpenCL 99 pp1024 613.10 ± 3.58
qwen2 1.5B Q4_0 828.59 MiB 1.54 B OpenCL 99 pp2048 525.78 ± 1.90
qwen2 1.5B Q4_0 828.59 MiB 1.54 B OpenCL 99 tg256 33.74 ± 0.03
qwen2 3B Q4_0 1.62 GiB 3.09 B OpenCL 99 pp512 342.08 ± 0.66
qwen2 3B Q4_0 1.62 GiB 3.09 B OpenCL 99 pp1024 317.64 ± 1.18
qwen2 3B Q4_0 1.62 GiB 3.09 B OpenCL 99 pp2048 274.53 ± 0.61
qwen2 3B Q4_0 1.62 GiB 3.09 B OpenCL 99 tg256 21.08 ± 0.07
qwen3 8B Q4_0 4.29 GiB 8.19 B OpenCL 99 pp512 158.72 ± 0.54
qwen3 8B Q4_0 4.29 GiB 8.19 B OpenCL 99 pp1024 138.81 ± 0.17
qwen3 8B Q4_0 4.29 GiB 8.19 B OpenCL 99 pp2048 114.03 ± 0.10
qwen3 8B Q4_0 4.29 GiB 8.19 B OpenCL 99 tg256 12.81 ± 0.06

@max-krasnyansky max-krasnyansky merged commit 4db5641 into ggml-org:master Nov 16, 2025
76 of 81 checks passed
basnijholt pushed a commit to basnijholt/llama.cpp that referenced this pull request Nov 16, 2025
… speed (ggml-org#17181)

* Add mul_mm_f16_f32_kq_kqv kernel

* Add ggml_cl_mul_mat_kq_kqv_adreno func

* fix whitespace

* remove unused variable

* remove redundant

* refactor and clean up

* remove trailing whitespace
@lippman1125
Copy link

@shaofeiqi @max-krasnyansky on 8GEN3, this PR will decrese decoding performance。

without this pr

PP TG B repeat N_KV t_tg ms e2e ms TTFT ms TPOT ms TPS(pp) t/s TPS(tg) t/s
512 64 1 5 576 2888.96 5222.09 2333.12 45.86 219.45 22.88

with this pr

PP TG B repeat N_KV t_tg ms e2e ms TTFT ms TPOT ms TPS(pp) t/s TPS(tg) t/s
512 64 1 5 576 3744.50 5014.40 1269.90 59.44 403.19 17.68

test comand
LD_LIBRARY_PATH=./lib ./bin/llama-batched-bench -m ../Qwen3_0.6B_Q4_0.gguf -c 2304 -b 2048 -npp 512 -ntg 64 -npl 1 -ngl 99 --flash-attn off

@lhez
Copy link
Collaborator

lhez commented Nov 19, 2025

@lippman1125 I suppose you are referring to tg. This PR should not affect tg.

On master, using 8Gen3,

ggml_opencl: device: 'QUALCOMM Adreno(TM) 750 (OpenCL 3.0 Adreno(TM) 750)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.45.02.16

without this change (manually comment out lines 6897 - 6902),

model size params backend ngl test t/s
qwen3 0.6B Q4_0 403.42 MiB 751.63 M OpenCL 99 pp512 349.43 ± 1.44
qwen3 0.6B Q4_0 403.42 MiB 751.63 M OpenCL 99 tg64 28.65 ± 2.84

build: 7d77f07 (7108)

with this change,

model size params backend ngl test t/s
qwen3 0.6B Q4_0 403.42 MiB 751.63 M OpenCL 99 pp512 798.45 ± 8.33
qwen3 0.6B Q4_0 403.42 MiB 751.63 M OpenCL 99 tg64 28.48 ± 1.21

build: 7d77f07 (7108)

tg numbers seem about the same for my setup.

@lippman1125
Copy link

@lhez Thanks for your reply , I verify it again, It's no problem. Good Job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants