SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc)#21580
Open
PMZFX wants to merge 1 commit intoggml-org:masterfrom
Open
SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc)#21580PMZFX wants to merge 1 commit intoggml-org:masterfrom
PMZFX wants to merge 1 commit intoggml-org:masterfrom
Conversation
BF16 models had no dedicated token generation kernel — they fell through to the generic full-GEMM path, resulting in ~14% memory bandwidth utilization on Intel Arc GPUs. This adds BF16 support to the DMMV (dequantize mul-mat-vec) path, matching the existing F16 implementation. Fixes ggml-org#20478
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BF16 models currently have no dedicated token generation (tg) kernel in the SYCL backend. During single-token generation, BF16 falls through to the generic
ggml_sycl_op_mul_mat_syclGEMM path, which dequantizes to FP32 and runs a full matrix multiply — far too heavy for a memory-bound batch=1 operation.This adds BF16 to the DMMV (dequantize mul-mat-vec) path, following the existing F16 pattern.
Changes
ggml/src/ggml-sycl/dmmv.cpp:convert_bf16()— readssycl::ext::oneapi::bfloat16, casts to float (mirrorsconvert_f16)convert_mul_mat_vec_bf16_sycl()— kernel launcher (mirrors F16 version)src1_convert_f16list for half-precision intrinsics whenGGML_SYCL_F16is enabledGGML_SYCL_DMMV_HAS_BF16(compile-time bfloat16 header detection)ggml/src/ggml-sycl/ggml-sycl.cpp:GGML_TYPE_BF16toggml_sycl_supports_dmmv()Benchmark — Qwen2.5-1.5B, Intel Arc Pro B70 (Xe2), single GPU
BF16 bandwidth utilization goes from ~14% to ~58% of theoretical (608 GB/s).
Testing
-DGGML_SYCL=ON -DGGML_SYCL_F16=ONHardware
Note
This addresses the tg (token generation) path only. BF16 is still not included in the F16-specific special paths for permuted/batched operations (KQ, KQV). Those are separate and would be a broader change.
Fixes #20478
AI Disclosure
AI (Claude) assisted with investigating the dispatch path and drafting the kernel code. All code was human-reviewed, tested, and benchmarked on real hardware.