[SYCL] Use native subgroup size for K-quant DMMV kernels on Intel by PMZFX · Pull Request #21700 · ggml-org/llama.cpp

PMZFX · 2026-04-09T23:43:43Z

Summary

Use WARP_SIZE (16) instead of QK_WARP_SIZE (32) for K-quant DMMV kernels
(Q2_K through Q6_K).

These kernels were migrated from CUDA via DPCT and kept a 32-wide subgroup
size. On Intel targets, native subgroup size is 16, and DPCT itself flagged
all five kernels with register pressure warnings recommending a smaller
sub-group size. The non-K-quant DMMV path already uses WARP_SIZE (16).

Each thread now processes both halves of the QK_K=256 block via a
for (int im = 0; im < 2; ++im) loop. The inner dot-product computation
is unchanged — the diff is mostly re-indentation from wrapping existing
code in that loop.

Results

Update: My original benchmarks had a build configuration mismatch
(GGML_SYCL_F16=ON in the test build vs OFF in the baseline), which
inflated the pp numbers. The corrected results below use identical build
flags. Thanks to @maxious for the independent testing that helped surface
this.

DMMV is the single-token (tg) kernel path — it is only dispatched when
src1->ne[1] == 1, so pp is unaffected by this change.

Arc Pro B70 (Xe2), single GPU, matched builds (GGML_SYCL_F16=OFF)

Model	Quant	pp512	tg128
Qwen3.5-9B	Q4_K_M	1042 → 1041 t/s (0%)	60.1 → 60.7 t/s (+0.8%)
Qwen3.5-27B	Q2_K	292 → 292 t/s (0%)	13.3 → 15.8 t/s (+18.7%)
Qwen3.5-27B	Q5_K_M	302 → 302 t/s (0%)	13.3 → 13.5 t/s (+1.8%)
Qwen3.5-27B	Q6_K	301 → 300 t/s (0%)	15.1 → 15.1 t/s (0%)

The Q2_K improvement is because at 2 bits/weight the DMMV kernel is
compute-bound rather than memory-bandwidth-bound, so subgroup efficiency
matters. Heavier K-quants are bandwidth-limited during tg, making the
compute path irrelevant.

@maxious — Arc B60, independently tested

Model	Quant	pp512	tg128
Qwen2.5-1.5B	Q2_K (1 GPU)	+0.4%	+19.5%
Llama-2-7B	Q2_K (2 GPU)	+1.3%	+37.1%
Qwen3.5-27B	Q2_K_XL (2 GPU)	0%	0%

Consistent pattern: Q2_K tg improvement on smaller models, no pp change.

Validation

test-backend-ops passes for q2_K through q6_K individually, in both
debug and release builds (79/79 tests)
Perplexity unchanged within error on Q4_K_M (7.5617 vs 7.5631) and
Q6_K (6.2160 vs 6.2169), wikitext-2-raw
Dual-GPU inference produces correct output (Qwen3.5-27B Q4_K_M,
both SYCL devices active)
Multi-turn conversation tested (5 successive prompts, no degradation)
Full MUL_MAT backend-ops suite crashes at a q8_0→next-type transition,
but this is pre-existing — baseline build crashes at the same point

Scope

This PR does not touch the non-K-quant DMMV path, the #else QK_K!=256
branches (dead code, QK_K is always 256), or the QK_WARP_SIZE definition
in presets.hpp. Those are intentionally left for separate cleanup.

No dependency on my other open PRs (#21638, #21597, #21580). #21580 also
touches dmmv.cpp but in a different section (BF16 dispatch), so they
merge cleanly.

I have read the contributing guidelines
AI-assisted: Yes — Claude Code (Claude Opus 4.6) was used for code
generation, analysis, and testing. All benchmarks and validation were
run on our hardware (Intel Arc Pro B70).

Use WARP_SIZE (16) instead of QK_WARP_SIZE (32) for K-quant DMMV kernel dispatch (Q2_K through Q6_K) on Intel SYCL targets. The original kernels were migrated from CUDA via DPCT and retained a 32-wide subgroup size. Intel Xe2 natively uses 16-lane subgroups, and the DPCT tool itself flagged these kernels with register pressure warnings recommending a smaller subgroup size. Each kernel thread now processes both halves of the QK_K=256 block via a loop, preserving identical total work and numerical results. Tested on Intel Arc Pro B70 (Xe2/Battlemage): - test-backend-ops: all K-quant types pass (debug + release) - perplexity: unchanged (Q4_K_M and Q6_K, wikitext-2) - llama-bench: 2.3-2.7x prefill improvement, neutral tg Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

maxious · 2026-04-10T03:35:21Z

Oddly I saw a TG improvement but not so much a PP with the B60 🤔
But PR looks good to merge 👍

Llama-2-7B Q2_K (dual GPU)

Test	Master (`d12cc3d`)	PR #21700 (`839c7e2`)	Improvement
pp512	1465.5 ± 23 t/s	1484.8 ± 30 t/s	+1.3%
tg128	15.6 t/s	21.3 t/s	+37.1%

Qwen3.5-27B Q2_K_XL (dual GPU)

Test	Master (`d12cc3d`)	PR #21700 (`839c7e2`)	Improvement
pp512	430.7 t/s	430.9 t/s	~0%
tg128	8.10 t/s	8.10 t/s	~0%

Qwen2.5-1.5B-Instruct Q2_K (single GPU)

Test	Master (`d12cc3d`)	PR #21700 (`839c7e2`)	Improvement
pp512	6931 t/s	6960 t/s	+0.4%
tg128	85.7 t/s	102.4 t/s	+19.5%

NeoZhangJianyu · 2026-04-10T07:48:10Z

It needs to be verified on more GPUs: iGPU, Arc7xx, BMG and Xe iGPU (meteor lake or newer).
I will feedback later.

Thank you!

PMZFX · 2026-04-10T09:31:04Z

Corrected the benchmark results; my original numbers compared builds with different GGML_SYCL_F16 settings, which inflated the pp numbers significantly.
I Re-ran with matched builds and updated the description.

@maxious thanks for testing on the B60. Your clean A/B comparison is what made the mismatch obvious. The real effect is a tg improvement on compute-bound K-quants (primarily Q2_K), not the pp speedup I originally claimed.

Updated title and description to reflect this.

The change is still architecturally correct; these are the only DMMV kernels still using the non-native subgroup size, and the DPCT register pressure warnings confirm 32 is too wide for Intel (at least, our cards).

arthw · 2026-04-10T13:55:43Z

We use 32 as warp_size in some kernel for better performance by test.

arthw · 2026-04-11T06:44:02Z

Sorry, it's my mistake to close this PR.
I reopen it.

arthw · 2026-04-11T14:18:28Z

The Arc770,BMG580, iGPU (UHD) are not impacted obviously.
PVC has 0% on PP, +12.5% on tg on Q2_K_XL.
PVC has 0% on PP, -4% on tg on Q4_K.

I think it's acceptable.

Thank you!

arthw

It's good job.

The sub-group size 16 is more useful on Intel GPUs.
The legacy code use 32 is based on test result.

Based on the latest driver and compiler, change them to 16 will make the code be clear and easy maintained.

It also approves the value 16 is better value to all Intel existed GPUs.

There are increase impact on B70/B60/PVC.
There is no impact to most of Intel old dGPU and iGPU (Arc770, BMG580, iGPU).
Except PVC has -4% of TG on Q4_K, there are no bad impact of performance on other Intel GPUs.

Thank you!

PMZFX requested a review from a team as a code owner April 9, 2026 23:43

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 9, 2026

PMZFX changed the title ~~[SYCL] Use subgroup size 16 for K-quant DMMV kernels on Intel (2.3x–2.7x pp on Arc B70)~~ [SYCL] Use native subgroup size for K-quant DMMV kernels on Intel Apr 10, 2026

arthw closed this Apr 10, 2026

arthw reopened this Apr 11, 2026

arthw approved these changes Apr 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Use native subgroup size for K-quant DMMV kernels on Intel#21700

[SYCL] Use native subgroup size for K-quant DMMV kernels on Intel#21700
PMZFX wants to merge 1 commit intoggml-org:masterfrom
PMZFX:opt/kquant-dmmv-subgroup16

PMZFX commented Apr 9, 2026 •

edited

Loading

Uh oh!

maxious commented Apr 10, 2026

Uh oh!

NeoZhangJianyu commented Apr 10, 2026

Uh oh!

PMZFX commented Apr 10, 2026

Uh oh!

arthw commented Apr 10, 2026

Uh oh!

arthw commented Apr 11, 2026

Uh oh!

arthw commented Apr 11, 2026

Uh oh!

arthw left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

PMZFX commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Arc Pro B70 (Xe2), single GPU, matched builds (GGML_SYCL_F16=OFF)

@maxious — Arc B60, independently tested

Validation

Scope

Uh oh!

maxious commented Apr 10, 2026

Llama-2-7B Q2_K (dual GPU)

Qwen3.5-27B Q2_K_XL (dual GPU)

Qwen2.5-1.5B-Instruct Q2_K (single GPU)

Uh oh!

NeoZhangJianyu commented Apr 10, 2026

Uh oh!

PMZFX commented Apr 10, 2026

Uh oh!

arthw commented Apr 10, 2026

Uh oh!

arthw commented Apr 11, 2026

Uh oh!

arthw commented Apr 11, 2026

Uh oh!

arthw left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PMZFX commented Apr 9, 2026 •

edited

Loading