Skip to content

vulkan: add cpy bf16 -> f32 pipelines#22677

Open
ServeurpersoCom wants to merge 3 commits intoggml-org:masterfrom
ServeurpersoCom:ggml/vulkan-add-cpy-bf16-f32
Open

vulkan: add cpy bf16 -> f32 pipelines#22677
ServeurpersoCom wants to merge 3 commits intoggml-org:masterfrom
ServeurpersoCom:ggml/vulkan-add-cpy-bf16-f32

Conversation

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

@ServeurpersoCom ServeurpersoCom commented May 4, 2026

Overview

Add the missing reverse direction "cpy bf16 -> f32" to the Vulkan backend. Currently only "cpy f32 -> bf16" is supported, which causes runtime aborts when models or LoRAs stored in BF16 need to be transferred back to F32 buffers

(typical case: BF16-trained LoRA merge at runtime, yes, I'm merging with the GPU, it's much faster: same code work on CUDA)

Downstream issue (Successfully tested by me, awaiting user feedback): ServeurpersoCom/acestep.cpp#69

Testing

With PR :

root@pod:/mnt/workspace/llama.cpp# ./build/bin/test-backend-ops -b Vulkan0 -o CPY 2>&1 | grep -E "type_src=bf16,type_dst=f32"
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): OK
root@pod:/mnt/workspace/llama.cpp#

No PR :

root@pod:/mnt/workspace/llama.cpp# ./build/bin/test-backend-ops -b Vulkan0 -o CPY 2>&1 | grep -E "type_src=bf16,type_dst=f32"
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
root@pod:/mnt/workspace/llama.cpp#

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES with Opus 4.7 and disposables rootless podman.

@ServeurpersoCom ServeurpersoCom requested a review from a team as a code owner May 4, 2026 11:34
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 4, 2026
@jeffbolznv
Copy link
Copy Markdown
Contributor

Please add a test to test-backend-ops that would reproduce the original failure without this change. Otherwise, it looks good to me.

Add explicit cpy test cases for BF16 <-> F32 in both directions.

Address review feedback from @jeffbolznv
@ServeurpersoCom ServeurpersoCom requested a review from ggerganov as a code owner May 4, 2026 15:23
@jeffbolznv
Copy link
Copy Markdown
Contributor

It looks like there was some existing test coverage that was correctly reporting "not supported":

  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]

How did you run into the assert? Was it through some other op that expands bf16 to f32? Or were you not running through the graph API?

@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

It looks like there was some existing test coverage that was correctly reporting "not supported":

  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]

How did you run into the assert? Was it through some other op that expands bf16 to f32? Or were you not running through the graph API?

You're right, I'm dropping the redundant test commit. The assert came from a non-CPY path: ggml_vk_get_cpy_pipeline is called directly inside several matmul variants to materialize a contiguous copy before the matmul runs, and none of those callsites consult supports_op since the surface op is MUL_MAT (Victor214's case is a BF16 LoRA hitting the matmul non-contig path with src0->type == BF16).

root@pod:/mnt/workspace/llama.cpp# ./build-vulkan/bin/test-backend-ops -b Vulkan0 -o CPY 2>&1 | grep -E "type_src=(bf16,type_dst=f32|f32,type_dst=bf16),ne=\[256,2,3,4\]"
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=f32,type_dst=bf16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=f32,type_dst=bf16,ne=[256,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=f32,type_dst=bf16,ne=[256,2,3,4],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]

@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

ServeurpersoCom commented May 4, 2026

I check my LoRA/LoKr merge code...

https://github.com/ServeurpersoCom/acestep.cpp/blob/master/src/adapter-merge.h
It is the graph API path what makes you think otherwise?

@jeffbolznv
Copy link
Copy Markdown
Contributor

If we're missing test coverage for matmul with noncontig sources, that would be good to add.

@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

If we're missing test coverage for matmul with noncontig sources, that would be good to add.

I check this

@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

Ah, I understand! Actually, there are 3 relevant states:

  • OK -> correct math (output matches CPU reference within tolerance)
  • FAIL -> incorrect math (NaN, Inf mismatch, or error above threshold)
  • Not supported -> backend declares it can't handle the op, test is skipped
    (not a failure) -> Case before this PR

With my patch I can confirm that LoRA loads correctly and the type is now supported ("OK"). Before the patch, "not supported" was the correct report.

@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

ServeurpersoCom commented May 4, 2026

This is my final word :

With PR :

root@pod:/mnt/workspace/llama.cpp# ./build/bin/test-backend-ops -b Vulkan0 -o CPY 2>&1 | grep -E "type_src=bf16,type_dst=f32"
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): OK
root@pod:/mnt/workspace/llama.cpp#

No PR :

root@pod:/mnt/workspace/llama.cpp# ./build/bin/test-backend-ops -b Vulkan0 -o CPY 2>&1 | grep -E "type_src=bf16,type_dst=f32"
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0],_src_transpose=0): not supported [Vulkan0]
root@pod:/mnt/workspace/llama.cpp#

@jeffbolznv
Copy link
Copy Markdown
Contributor

"Not supported" wouldn't have triggered the assertion failure. I think you were on the right track that there's some matmul permutation case that wasn't handled. But if it's hard to find, ultimately I can live without it (we do have coverage for CPY).

@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

ServeurpersoCom commented May 4, 2026

"Not supported" wouldn't have triggered the assertion failure. I think you were on the right track that there's some matmul permutation case that wasn't handled. But if it's hard to find, ultimately I can live without it (we do have coverage for CPY).

I'm still looking at it! But I'll create another follow up PR to extend coverage.

The actual error in my logs is "Missing CPY op for types: bf16 f32".
I removed the "assert" world, must have been a hallucination of my LLM (Opus 4.7) for coding and translation, that might have made you think of a different bug ?
I've updated the first message of the PR. It's atomic and clean.

Original log (no assert) :

load_backend: loaded Vulkan backend from C:\acestep.cpp\build\Release\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\acestep.cpp\build\Release\ggml-cpu-cascadelake.dll
[Load] DiT backend: Vulkan0 (CPU threads: 8)
[GGUF] .\models\AceStep_v15_XL_SFT_Q8.gguf: 830 tensors, data at offset 69056
[DiT] Self-attn: Q+K+V fused
[DiT] Cross-attn: Q+K+V fused
[DiT] MLP: gate+up fused
[Load] null_condition_emb found (CFG available)
[Safetensors] .\adapters\SideStepAdapter/adapter_model.safetensors: 704 tensors
[Adapter] adapter_config.json: alpha=128
Missing CPY op for types: bf16 f32
D:\A\acestep.cpp\ggml\src\ggml-vulkan\ggml-vulkan.cpp:7451: fatal error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants