vulkan: add cpy bf16 -> f32 pipelines by ServeurpersoCom · Pull Request #22677 · ggml-org/llama.cpp

ServeurpersoCom · 2026-05-04T11:34:20Z

Overview

Add the missing reverse direction "cpy bf16 -> f32" to the Vulkan backend. Currently only "cpy f32 -> bf16" is supported, which causes runtime aborts when models or LoRAs stored in BF16 need to be transferred back to F32 buffers

(typical case: BF16-trained LoRA merge at runtime, yes, I'm merging with the GPU, it's much faster: same code work on CUDA)

Downstream issue (Successfully tested by me, awaiting user feedback): ServeurpersoCom/acestep.cpp#69

Testing

With PR :

root@pod:/mnt/workspace/llama.cpp# ./build/bin/test-backend-ops -b Vulkan0 -o CPY 2>&1 | grep -E "type_src=bf16,type_dst=f32"
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): OK
root@pod:/mnt/workspace/llama.cpp#

No PR :

root@pod:/mnt/workspace/llama.cpp# ./build/bin/test-backend-ops -b Vulkan0 -o CPY 2>&1 | grep -E "type_src=bf16,type_dst=f32"
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
root@pod:/mnt/workspace/llama.cpp#

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES with Opus 4.7 and disposables rootless podman.

jeffbolznv · 2026-05-04T14:25:17Z

Please add a test to test-backend-ops that would reproduce the original failure without this change. Otherwise, it looks good to me.

@jeffbolznv

Add explicit cpy test cases for BF16 <-> F32 in both directions. Address review feedback from @jeffbolznv

jeffbolznv · 2026-05-04T15:28:34Z

It looks like there was some existing test coverage that was correctly reporting "not supported":

  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]

How did you run into the assert? Was it through some other op that expands bf16 to f32? Or were you not running through the graph API?

ServeurpersoCom · 2026-05-04T15:46:07Z

It looks like there was some existing test coverage that was correctly reporting "not supported":
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
How did you run into the assert? Was it through some other op that expands bf16 to f32? Or were you not running through the graph API?

You're right, I'm dropping the redundant test commit. The assert came from a non-CPY path: ggml_vk_get_cpy_pipeline is called directly inside several matmul variants to materialize a contiguous copy before the matmul runs, and none of those callsites consult supports_op since the surface op is MUL_MAT (Victor214's case is a BF16 LoRA hitting the matmul non-contig path with src0->type == BF16).

root@pod:/mnt/workspace/llama.cpp# ./build-vulkan/bin/test-backend-ops -b Vulkan0 -o CPY 2>&1 | grep -E "type_src=(bf16,type_dst=f32|f32,type_dst=bf16),ne=\[256,2,3,4\]"
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=f32,type_dst=bf16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=f32,type_dst=bf16,ne=[256,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=f32,type_dst=bf16,ne=[256,2,3,4],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]

This reverts commit fee98e9.

ServeurpersoCom · 2026-05-04T15:49:52Z

I check my LoRA/LoKr merge code...

https://github.com/ServeurpersoCom/acestep.cpp/blob/master/src/adapter-merge.h
It is the graph API path what makes you think otherwise?

jeffbolznv · 2026-05-04T16:34:57Z

If we're missing test coverage for matmul with noncontig sources, that would be good to add.

ServeurpersoCom · 2026-05-04T16:40:33Z

If we're missing test coverage for matmul with noncontig sources, that would be good to add.

I check this

ServeurpersoCom · 2026-05-04T17:31:05Z

Ah, I understand! Actually, there are 3 relevant states:

OK -> correct math (output matches CPU reference within tolerance)
FAIL -> incorrect math (NaN, Inf mismatch, or error above threshold)
Not supported -> backend declares it can't handle the op, test is skipped
(not a failure) -> Case before this PR

With my patch I can confirm that LoRA loads correctly and the type is now supported ("OK"). Before the patch, "not supported" was the correct report.

ServeurpersoCom · 2026-05-04T17:46:17Z

This is my final word :

With PR :

root@pod:/mnt/workspace/llama.cpp# ./build/bin/test-backend-ops -b Vulkan0 -o CPY 2>&1 | grep -E "type_src=bf16,type_dst=f32"
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): OK
root@pod:/mnt/workspace/llama.cpp#

No PR :

root@pod:/mnt/workspace/llama.cpp# ./build/bin/test-backend-ops -b Vulkan0 -o CPY 2>&1 | grep -E "type_src=bf16,type_dst=f32"
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
root@pod:/mnt/workspace/llama.cpp#

jeffbolznv · 2026-05-04T18:05:03Z

"Not supported" wouldn't have triggered the assertion failure. I think you were on the right track that there's some matmul permutation case that wasn't handled. But if it's hard to find, ultimately I can live without it (we do have coverage for CPY).

ServeurpersoCom · 2026-05-04T18:11:39Z

"Not supported" wouldn't have triggered the assertion failure. I think you were on the right track that there's some matmul permutation case that wasn't handled. But if it's hard to find, ultimately I can live without it (we do have coverage for CPY).

I'm still looking at it! But I'll create another follow up PR to extend coverage.

The actual error in my logs is "Missing CPY op for types: bf16 f32".
I removed the "assert" world, must have been a hallucination of my LLM (Opus 4.7) for coding and translation, that might have made you think of a different bug ?
I've updated the first message of the PR. It's atomic and clean.

Original log (no assert) :

load_backend: loaded Vulkan backend from C:\acestep.cpp\build\Release\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\acestep.cpp\build\Release\ggml-cpu-cascadelake.dll
[Load] DiT backend: Vulkan0 (CPU threads: 8)
[GGUF] .\models\AceStep_v15_XL_SFT_Q8.gguf: 830 tensors, data at offset 69056
[DiT] Self-attn: Q+K+V fused
[DiT] Cross-attn: Q+K+V fused
[DiT] MLP: gate+up fused
[Load] null_condition_emb found (CFG available)
[Safetensors] .\adapters\SideStepAdapter/adapter_model.safetensors: 704 tensors
[Adapter] adapter_config.json: alpha=128
Missing CPY op for types: bf16 f32
D:\A\acestep.cpp\ggml\src\ggml-vulkan\ggml-vulkan.cpp:7451: fatal error

vulkan: add cpy bf16 -> f32 pipelines

4a4ef9c

ServeurpersoCom requested a review from a team as a code owner May 4, 2026 11:34

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 4, 2026

test-backend-ops: explicit BF16 <-> F32 cpy coverage

fee98e9

Add explicit cpy test cases for BF16 <-> F32 in both directions. Address review feedback from @jeffbolznv

ServeurpersoCom requested a review from ggerganov as a code owner May 4, 2026 15:23

Revert "test-backend-ops: explicit BF16 <-> F32 cpy coverage"

be31af9

This reverts commit fee98e9.

ServeurpersoCom mentioned this pull request May 4, 2026

tests: add BF16 non-contig coverage for MUL_MAT permutations #22689

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: add cpy bf16 -> f32 pipelines#22677

vulkan: add cpy bf16 -> f32 pipelines#22677
ServeurpersoCom wants to merge 3 commits intoggml-org:masterfrom
ServeurpersoCom:ggml/vulkan-add-cpy-bf16-f32

ServeurpersoCom commented May 4, 2026 •

edited

Loading

Uh oh!

jeffbolznv commented May 4, 2026

Uh oh!

jeffbolznv commented May 4, 2026

Uh oh!

ServeurpersoCom commented May 4, 2026

Uh oh!

ServeurpersoCom commented May 4, 2026 •

edited

Loading

Uh oh!

jeffbolznv commented May 4, 2026

Uh oh!

ServeurpersoCom commented May 4, 2026

Uh oh!

ServeurpersoCom commented May 4, 2026

Uh oh!

ServeurpersoCom commented May 4, 2026 •

edited

Loading

Uh oh!

jeffbolznv commented May 4, 2026

Uh oh!

ServeurpersoCom commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ServeurpersoCom commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Testing

With PR :

No PR :

Requirements

Uh oh!

jeffbolznv commented May 4, 2026

Uh oh!

jeffbolznv commented May 4, 2026

Uh oh!

ServeurpersoCom commented May 4, 2026

Uh oh!

ServeurpersoCom commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented May 4, 2026

Uh oh!

ServeurpersoCom commented May 4, 2026

Uh oh!

ServeurpersoCom commented May 4, 2026

Uh oh!

ServeurpersoCom commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

With PR :

No PR :

Uh oh!

jeffbolznv commented May 4, 2026

Uh oh!

ServeurpersoCom commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ServeurpersoCom commented May 4, 2026 •

edited

Loading

ServeurpersoCom commented May 4, 2026 •

edited

Loading

ServeurpersoCom commented May 4, 2026 •

edited

Loading

ServeurpersoCom commented May 4, 2026 •

edited

Loading