Skip to content

SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc)#21580

Open
PMZFX wants to merge 1 commit intoggml-org:masterfrom
PMZFX:sycl-bf16-dmmv
Open

SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc)#21580
PMZFX wants to merge 1 commit intoggml-org:masterfrom
PMZFX:sycl-bf16-dmmv

Conversation

@PMZFX
Copy link
Copy Markdown
Contributor

@PMZFX PMZFX commented Apr 7, 2026

Summary

BF16 models currently have no dedicated token generation (tg) kernel in the SYCL backend. During single-token generation, BF16 falls through to the generic ggml_sycl_op_mul_mat_sycl GEMM path, which dequantizes to FP32 and runs a full matrix multiply — far too heavy for a memory-bound batch=1 operation.

This adds BF16 to the DMMV (dequantize mul-mat-vec) path, following the existing F16 pattern.

Changes

ggml/src/ggml-sycl/dmmv.cpp:

  • convert_bf16() — reads sycl::ext::oneapi::bfloat16, casts to float (mirrors convert_f16)
  • convert_mul_mat_vec_bf16_sycl() — kernel launcher (mirrors F16 version)
  • Added BF16 to the DMMV dispatch switch
  • Added BF16 to the src1_convert_f16 list for half-precision intrinsics when GGML_SYCL_F16 is enabled
  • All BF16 code guarded behind GGML_SYCL_DMMV_HAS_BF16 (compile-time bfloat16 header detection)

ggml/src/ggml-sycl/ggml-sycl.cpp:

  • Added GGML_TYPE_BF16 to ggml_sycl_supports_dmmv()

Benchmark — Qwen2.5-1.5B, Intel Arc Pro B70 (Xe2), single GPU

Format Size pp512 (before) pp512 (after) tg128 (before) tg128 (after) tg speedup
Q4_K_M 1.04 GiB 8777 8778 202.6 202.6
Q8_0 1.76 GiB 9304 9304 180.6 180.6
BF16 2.88 GiB 2580 4887 29.7 123.9 4.2x

BF16 bandwidth utilization goes from ~14% to ~58% of theoretical (608 GB/s).

Testing

  • Builds cleanly with -DGGML_SYCL=ON -DGGML_SYCL_F16=ON
  • Token generation produces correct output (verified text coherence)
  • No regressions on Q4_K_M, Q8_0, or larger 9B models
  • Tested on Qwen2.5-1.5B and Qwen3.5-9B
  • Not yet tested on Intel Arc A-series (Alchemist) — would appreciate community testing

Hardware

  • Intel Arc Pro B70 (BMG-G31, 32 GB GDDR6 ECC, 608 GB/s)
  • Driver: libze-intel-gpu1 26.09.37435.1, IGC 2.30.1
  • oneAPI DPC++ 2025.3.3

Note

This addresses the tg (token generation) path only. BF16 is still not included in the F16-specific special paths for permuted/batched operations (KQ, KQV). Those are separate and would be a broader change.

Fixes #20478

AI Disclosure

AI (Claude) assisted with investigating the dispatch path and drafting the kernel code. All code was human-reviewed, tested, and benchmarked on real hardware.

BF16 models had no dedicated token generation kernel — they fell through
to the generic full-GEMM path, resulting in ~14% memory bandwidth
utilization on Intel Arc GPUs. This adds BF16 support to the DMMV
(dequantize mul-mat-vec) path, matching the existing F16 implementation.

Fixes ggml-org#20478
@PMZFX PMZFX requested a review from a team as a code owner April 7, 2026 19:53
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: SYCL: BF16 falling to CPU

1 participant