SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc) by PMZFX · Pull Request #21580 · ggml-org/llama.cpp

PMZFX · 2026-04-07T19:53:04Z

Summary

BF16 models currently have no dedicated token generation (tg) kernel in the SYCL backend. During single-token generation, BF16 falls through to the generic ggml_sycl_op_mul_mat_sycl GEMM path, which dequantizes to FP32 and runs a full matrix multiply — far too heavy for a memory-bound batch=1 operation.

This adds BF16 to the DMMV (dequantize mul-mat-vec) path, following the existing F16 pattern.

Changes

ggml/src/ggml-sycl/dmmv.cpp:

convert_bf16() — reads sycl::ext::oneapi::bfloat16, casts to float (mirrors convert_f16)
convert_mul_mat_vec_bf16_sycl() — kernel launcher (mirrors F16 version)
Added BF16 to the DMMV dispatch switch
Added BF16 to the src1_convert_f16 list for half-precision intrinsics when GGML_SYCL_F16 is enabled
All BF16 code guarded behind GGML_SYCL_DMMV_HAS_BF16 (compile-time bfloat16 header detection)

ggml/src/ggml-sycl/ggml-sycl.cpp:

Added GGML_TYPE_BF16 to ggml_sycl_supports_dmmv()

Benchmark — Qwen2.5-1.5B, Intel Arc Pro B70 (Xe2), single GPU

Format	Size	pp512 (before)	pp512 (after)	tg128 (before)	tg128 (after)	tg speedup
Q4_K_M	1.04 GiB	8777	8778	202.6	202.6	—
Q8_0	1.76 GiB	9304	9304	180.6	180.6	—
BF16	2.88 GiB	2580	4887	29.7	123.9	4.2x

BF16 bandwidth utilization goes from ~14% to ~58% of theoretical (608 GB/s).

Testing

Builds cleanly with -DGGML_SYCL=ON -DGGML_SYCL_F16=ON
Token generation produces correct output (verified text coherence)
No regressions on Q4_K_M, Q8_0, or larger 9B models
Tested on Qwen2.5-1.5B and Qwen3.5-9B
Not yet tested on Intel Arc A-series (Alchemist) — would appreciate community testing

Hardware

Intel Arc Pro B70 (BMG-G31, 32 GB GDDR6 ECC, 608 GB/s)
Driver: libze-intel-gpu1 26.09.37435.1, IGC 2.30.1
oneAPI DPC++ 2025.3.3

Note

This addresses the tg (token generation) path only. BF16 is still not included in the F16-specific special paths for permuted/batched operations (KQ, KQV). Those are separate and would be a broader change.

Fixes #20478

AI Disclosure

AI (Claude) assisted with investigating the dispatch path and drafting the kernel code. All code was human-reviewed, tested, and benchmarked on real hardware.

BF16 models had no dedicated token generation kernel — they fell through to the generic full-GEMM path, resulting in ~14% memory bandwidth utilization on Intel Arc GPUs. This adds BF16 support to the DMMV (dequantize mul-mat-vec) path, matching the existing F16 implementation. Fixes ggml-org#20478

PMZFX requested a review from a team as a code owner April 7, 2026 19:53

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 7, 2026

arthw mentioned this pull request Apr 8, 2026

Misc. bug: SYCL: BF16 falling to CPU #20478

Open

PMZFX mentioned this pull request Apr 9, 2026

[SYCL] Use native subgroup size for K-quant DMMV kernels on Intel #21700

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc)#21580

SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc)#21580
PMZFX wants to merge 1 commit intoggml-org:masterfrom
PMZFX:sycl-bf16-dmmv

PMZFX commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PMZFX commented Apr 7, 2026

Summary

Changes

Benchmark — Qwen2.5-1.5B, Intel Arc Pro B70 (Xe2), single GPU

Testing

Hardware

Note

AI Disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant