Skip to content

[SYCL] Use native subgroup size for K-quant DMMV kernels on Intel#21700

Open
PMZFX wants to merge 1 commit intoggml-org:masterfrom
PMZFX:opt/kquant-dmmv-subgroup16
Open

[SYCL] Use native subgroup size for K-quant DMMV kernels on Intel#21700
PMZFX wants to merge 1 commit intoggml-org:masterfrom
PMZFX:opt/kquant-dmmv-subgroup16

Conversation

@PMZFX
Copy link
Copy Markdown
Contributor

@PMZFX PMZFX commented Apr 9, 2026

Summary

Use WARP_SIZE (16) instead of QK_WARP_SIZE (32) for K-quant DMMV kernels
(Q2_K through Q6_K).

These kernels were migrated from CUDA via DPCT and kept a 32-wide subgroup
size. On Intel targets, native subgroup size is 16, and DPCT itself flagged
all five kernels with register pressure warnings recommending a smaller
sub-group size. The non-K-quant DMMV path already uses WARP_SIZE (16).

Each thread now processes both halves of the QK_K=256 block via a
for (int im = 0; im < 2; ++im) loop. The inner dot-product computation
is unchanged — the diff is mostly re-indentation from wrapping existing
code in that loop.

Results

Update: My original benchmarks had a build configuration mismatch
(GGML_SYCL_F16=ON in the test build vs OFF in the baseline), which
inflated the pp numbers. The corrected results below use identical build
flags. Thanks to @maxious for the independent testing that helped surface
this.

DMMV is the single-token (tg) kernel path — it is only dispatched when
src1->ne[1] == 1, so pp is unaffected by this change.

Arc Pro B70 (Xe2), single GPU, matched builds (GGML_SYCL_F16=OFF)

Model Quant pp512 tg128
Qwen3.5-9B Q4_K_M 1042 → 1041 t/s (0%) 60.1 → 60.7 t/s (+0.8%)
Qwen3.5-27B Q2_K 292 → 292 t/s (0%) 13.3 → 15.8 t/s (+18.7%)
Qwen3.5-27B Q5_K_M 302 → 302 t/s (0%) 13.3 → 13.5 t/s (+1.8%)
Qwen3.5-27B Q6_K 301 → 300 t/s (0%) 15.1 → 15.1 t/s (0%)

The Q2_K improvement is because at 2 bits/weight the DMMV kernel is
compute-bound rather than memory-bandwidth-bound, so subgroup efficiency
matters. Heavier K-quants are bandwidth-limited during tg, making the
compute path irrelevant.

@maxious — Arc B60, independently tested

Model Quant pp512 tg128
Qwen2.5-1.5B Q2_K (1 GPU) +0.4% +19.5%
Llama-2-7B Q2_K (2 GPU) +1.3% +37.1%
Qwen3.5-27B Q2_K_XL (2 GPU) 0% 0%

Consistent pattern: Q2_K tg improvement on smaller models, no pp change.

Validation

  • test-backend-ops passes for q2_K through q6_K individually, in both
    debug and release builds (79/79 tests)
  • Perplexity unchanged within error on Q4_K_M (7.5617 vs 7.5631) and
    Q6_K (6.2160 vs 6.2169), wikitext-2-raw
  • Dual-GPU inference produces correct output (Qwen3.5-27B Q4_K_M,
    both SYCL devices active)
  • Multi-turn conversation tested (5 successive prompts, no degradation)
  • Full MUL_MAT backend-ops suite crashes at a q8_0→next-type transition,
    but this is pre-existing — baseline build crashes at the same point

Scope

This PR does not touch the non-K-quant DMMV path, the #else QK_K!=256
branches (dead code, QK_K is always 256), or the QK_WARP_SIZE definition
in presets.hpp. Those are intentionally left for separate cleanup.

No dependency on my other open PRs (#21638, #21597, #21580). #21580 also
touches dmmv.cpp but in a different section (BF16 dispatch), so they
merge cleanly.


  • I have read the contributing guidelines
  • AI-assisted: Yes — Claude Code (Claude Opus 4.6) was used for code
    generation, analysis, and testing. All benchmarks and validation were
    run on our hardware (Intel Arc Pro B70).

Use WARP_SIZE (16) instead of QK_WARP_SIZE (32) for K-quant DMMV
kernel dispatch (Q2_K through Q6_K) on Intel SYCL targets.

The original kernels were migrated from CUDA via DPCT and retained
a 32-wide subgroup size. Intel Xe2 natively uses 16-lane subgroups,
and the DPCT tool itself flagged these kernels with register pressure
warnings recommending a smaller subgroup size.

Each kernel thread now processes both halves of the QK_K=256 block
via a loop, preserving identical total work and numerical results.

Tested on Intel Arc Pro B70 (Xe2/Battlemage):
- test-backend-ops: all K-quant types pass (debug + release)
- perplexity: unchanged (Q4_K_M and Q6_K, wikitext-2)
- llama-bench: 2.3-2.7x prefill improvement, neutral tg

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@PMZFX PMZFX requested a review from a team as a code owner April 9, 2026 23:43
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 9, 2026
@maxious
Copy link
Copy Markdown

maxious commented Apr 10, 2026

Oddly I saw a TG improvement but not so much a PP with the B60 🤔
But PR looks good to merge 👍

Llama-2-7B Q2_K (dual GPU)

Test Master (d12cc3d) PR #21700 (839c7e2) Improvement
pp512 1465.5 ± 23 t/s 1484.8 ± 30 t/s +1.3%
tg128 15.6 t/s 21.3 t/s +37.1%

Qwen3.5-27B Q2_K_XL (dual GPU)

Test Master (d12cc3d) PR #21700 (839c7e2) Improvement
pp512 430.7 t/s 430.9 t/s ~0%
tg128 8.10 t/s 8.10 t/s ~0%

Qwen2.5-1.5B-Instruct Q2_K (single GPU)

Test Master (d12cc3d) PR #21700 (839c7e2) Improvement
pp512 6931 t/s 6960 t/s +0.4%
tg128 85.7 t/s 102.4 t/s +19.5%

@NeoZhangJianyu
Copy link
Copy Markdown
Contributor

It needs to be verified on more GPUs: iGPU, Arc7xx, BMG and Xe iGPU (meteor lake or newer).
I will feedback later.

Thank you!

@PMZFX PMZFX changed the title [SYCL] Use subgroup size 16 for K-quant DMMV kernels on Intel (2.3x–2.7x pp on Arc B70) [SYCL] Use native subgroup size for K-quant DMMV kernels on Intel Apr 10, 2026
@PMZFX
Copy link
Copy Markdown
Contributor Author

PMZFX commented Apr 10, 2026

Corrected the benchmark results; my original numbers compared builds with different GGML_SYCL_F16 settings, which inflated the pp numbers significantly.
I Re-ran with matched builds and updated the description.

@maxious thanks for testing on the B60. Your clean A/B comparison is what made the mismatch obvious. The real effect is a tg improvement on compute-bound K-quants (primarily Q2_K), not the pp speedup I originally claimed.

Updated title and description to reflect this.

The change is still architecturally correct; these are the only DMMV kernels still using the non-native subgroup size, and the DPCT register pressure warnings confirm 32 is too wide for Intel (at least, our cards).

@arthw arthw closed this Apr 10, 2026
@arthw
Copy link
Copy Markdown
Contributor

arthw commented Apr 10, 2026

We use 32 as warp_size in some kernel for better performance by test.

@arthw
Copy link
Copy Markdown
Contributor

arthw commented Apr 11, 2026

Sorry, it's my mistake to close this PR.
I reopen it.

@arthw arthw reopened this Apr 11, 2026
@arthw
Copy link
Copy Markdown
Contributor

arthw commented Apr 11, 2026

The Arc770,BMG580, iGPU (UHD) are not impacted obviously.
PVC has 0% on PP, +12.5% on tg on Q2_K_XL.
PVC has 0% on PP, -4% on tg on Q4_K.

I think it's acceptable.

Thank you!

Copy link
Copy Markdown
Contributor

@arthw arthw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good job.

The sub-group size 16 is more useful on Intel GPUs.
The legacy code use 32 is based on test result.

Based on the latest driver and compiler, change them to 16 will make the code be clear and easy maintained.

It also approves the value 16 is better value to all Intel existed GPUs.

There are increase impact on B70/B60/PVC.
There is no impact to most of Intel old dGPU and iGPU (Arc770, BMG580, iGPU).
Except PVC has -4% of TG on Q4_K, there are no bad impact of performance on other Intel GPUs.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants