[SYCL] Use native subgroup size for K-quant DMMV kernels on Intel#21700
[SYCL] Use native subgroup size for K-quant DMMV kernels on Intel#21700PMZFX wants to merge 1 commit intoggml-org:masterfrom
Conversation
Use WARP_SIZE (16) instead of QK_WARP_SIZE (32) for K-quant DMMV kernel dispatch (Q2_K through Q6_K) on Intel SYCL targets. The original kernels were migrated from CUDA via DPCT and retained a 32-wide subgroup size. Intel Xe2 natively uses 16-lane subgroups, and the DPCT tool itself flagged these kernels with register pressure warnings recommending a smaller subgroup size. Each kernel thread now processes both halves of the QK_K=256 block via a loop, preserving identical total work and numerical results. Tested on Intel Arc Pro B70 (Xe2/Battlemage): - test-backend-ops: all K-quant types pass (debug + release) - perplexity: unchanged (Q4_K_M and Q6_K, wikitext-2) - llama-bench: 2.3-2.7x prefill improvement, neutral tg Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Oddly I saw a TG improvement but not so much a PP with the B60 🤔 Llama-2-7B Q2_K (dual GPU)
Qwen3.5-27B Q2_K_XL (dual GPU)
Qwen2.5-1.5B-Instruct Q2_K (single GPU)
|
|
It needs to be verified on more GPUs: iGPU, Arc7xx, BMG and Xe iGPU (meteor lake or newer). Thank you! |
|
Corrected the benchmark results; my original numbers compared builds with different GGML_SYCL_F16 settings, which inflated the pp numbers significantly. @maxious thanks for testing on the B60. Your clean A/B comparison is what made the mismatch obvious. The real effect is a tg improvement on compute-bound K-quants (primarily Q2_K), not the pp speedup I originally claimed. Updated title and description to reflect this. The change is still architecturally correct; these are the only DMMV kernels still using the non-native subgroup size, and the DPCT register pressure warnings confirm 32 is too wide for Intel (at least, our cards). |
|
We use 32 as warp_size in some kernel for better performance by test. |
|
Sorry, it's my mistake to close this PR. |
|
The Arc770,BMG580, iGPU (UHD) are not impacted obviously. I think it's acceptable. Thank you! |
arthw
left a comment
There was a problem hiding this comment.
It's good job.
The sub-group size 16 is more useful on Intel GPUs.
The legacy code use 32 is based on test result.
Based on the latest driver and compiler, change them to 16 will make the code be clear and easy maintained.
It also approves the value 16 is better value to all Intel existed GPUs.
There are increase impact on B70/B60/PVC.
There is no impact to most of Intel old dGPU and iGPU (Arc770, BMG580, iGPU).
Except PVC has -4% of TG on Q4_K, there are no bad impact of performance on other Intel GPUs.
Thank you!
Summary
Use
WARP_SIZE(16) instead ofQK_WARP_SIZE(32) for K-quant DMMV kernels(Q2_K through Q6_K).
These kernels were migrated from CUDA via DPCT and kept a 32-wide subgroup
size. On Intel targets, native subgroup size is 16, and DPCT itself flagged
all five kernels with register pressure warnings recommending a smaller
sub-group size. The non-K-quant DMMV path already uses
WARP_SIZE(16).Each thread now processes both halves of the QK_K=256 block via a
for (int im = 0; im < 2; ++im)loop. The inner dot-product computationis unchanged — the diff is mostly re-indentation from wrapping existing
code in that loop.
Results
Update: My original benchmarks had a build configuration mismatch
(
GGML_SYCL_F16=ONin the test build vsOFFin the baseline), whichinflated the pp numbers. The corrected results below use identical build
flags. Thanks to @maxious for the independent testing that helped surface
this.
DMMV is the single-token (tg) kernel path — it is only dispatched when
src1->ne[1] == 1, so pp is unaffected by this change.Arc Pro B70 (Xe2), single GPU, matched builds (GGML_SYCL_F16=OFF)
The Q2_K improvement is because at 2 bits/weight the DMMV kernel is
compute-bound rather than memory-bandwidth-bound, so subgroup efficiency
matters. Heavier K-quants are bandwidth-limited during tg, making the
compute path irrelevant.
@maxious — Arc B60, independently tested
Consistent pattern: Q2_K tg improvement on smaller models, no pp change.
Validation
test-backend-opspasses for q2_K through q6_K individually, in bothdebug and release builds (79/79 tests)
Q6_K (6.2160 vs 6.2169), wikitext-2-raw
both SYCL devices active)
MUL_MATbackend-ops suite crashes at a q8_0→next-type transition,but this is pre-existing — baseline build crashes at the same point
Scope
This PR does not touch the non-K-quant DMMV path, the
#elseQK_K!=256branches (dead code, QK_K is always 256), or the
QK_WARP_SIZEdefinitionin
presets.hpp. Those are intentionally left for separate cleanup.No dependency on my other open PRs (#21638, #21597, #21580). #21580 also
touches
dmmv.cppbut in a different section (BF16 dispatch), so theymerge cleanly.
generation, analysis, and testing. All benchmarks and validation were
run on our hardware (Intel Arc Pro B70).