metal : FA support F32 K and V #16531

ggerganov · 2025-10-12T09:36:31Z

Add Metal FA kernels for F32 K and V
Add Metal FA kernels for head size 32
~~Remove K and V casts with cacheless contexts~~ (we should keep the casts for now)

Sample command for testing:

llama-embedding -hf ggml-org/bge-small-en-v1.5-Q8_0-GGUF -e -p "$(printf 'hello %.0s' {1..510})" --pooling cls -c 512 -fa on

ggerganov · 2025-10-12T10:13:39Z

@JohannesGaessler @jeffbolznv Would it be possible to add support for F32 K and V tensors in the respective backends?

The issue is that these casts on master increase significantly the compute buffer size for embedding models (see #15586 (comment)):

llama.cpp/src/llama-graph.cpp

Lines 1313 to 1326 in 4b2dae3

    
           // this can happen when KV cache is not used (e.g. an embedding model with non-causal attn) 
        
           if (k->type == GGML_TYPE_F32) { 
        
               k = ggml_cast(ctx0, k, GGML_TYPE_F16); 
        
           } 
        
           if (v->type == GGML_TYPE_F32) { 
        
               v = ggml_cast(ctx0, v, GGML_TYPE_F16); 
        
           } 
        
           cur = ggml_flash_attn_ext(ctx0, q, k, v, kq_mask, kq_scale, hparams.f_max_alibi_bias, 
        
                                     hparams.attn_soft_cap ? hparams.f_attn_logit_softcapping : 0.0f); 
        
           cb(cur, LLAMA_TENSOR_NAME_FATTN, il);

If we remove the casts, the memory usage should be significantly reduced for this use case. But to remove them, the FA implementation has to support k->type == GGML_TYPE_F32 && v->type == GGML_TYPE_F32 in the ggml_flash_attn_ext() operator.

JohannesGaessler · 2025-10-12T10:31:47Z

It's definitely possible but it will require additional considerations w.r.t. SRAM limits. For the tile kernel what would need to be done is to determine FP16 vs. FP32 use via a template parameter rather than the FAST_FP16_AVAILABLE macro and to determine which kernel parameters, if any, can be made to fit in SRAM if you now need twice as much to store KV data vs. FP16.

JohannesGaessler · 2025-10-12T10:33:13Z

Do the models for which this is relevant use GQA?

JohannesGaessler · 2025-10-12T10:49:04Z

We could also consider making the operations preceding FA write back their data as FP16 in the first place. In terms of performance that would definitely preferable for all CUDA/ROCm GPUs except for Pascal.

ggerganov · 2025-10-12T11:08:39Z

Do the models for which this is relevant use GQA?

Generally yes.

If it would make the implementation simpler, maybe we can treat F32 K and V as just another "quantization" type, where the dequantize function is a cast to F16?

JohannesGaessler · 2025-10-12T11:23:13Z

For CUDA that can definitely be done with comparatively little effort but it would not eliminate the additional memory use, it would just shift it from the compute buffer to the buffer pool in the CUDA backend.

jeffbolznv · 2025-10-12T17:51:29Z

I think this should be relatively straightforward in the vulkan backend, I'll look into it. This comment is how I'd expect to implement it (we dequantize while loading, so no extra memory usage):

If it would make the implementation simpler, maybe we can treat F32 K and V as just another "quantization" type, where the dequantize function is a cast to F16?

jeffbolznv · 2025-10-12T20:04:51Z

Done for Vulkan in #16543

JohannesGaessler · 2025-10-12T21:24:33Z

Basic CUDA support in #16546 .

ggerganov added 2 commits October 12, 2025 10:35

graph : support cacheless embeddings with FA and iSWA

d4d465b

metal : FA support F32 K and V

6dc9479

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Oct 12, 2025

metal : add FA support for head size = 32

34f95c3

ggerganov mentioned this pull request Oct 12, 2025

context : print graph stats for memory-less contexts #15586

Merged

graph : continue to explicitly cast K and V to F16

f027196

ggerganov marked this pull request as ready for review October 12, 2025 13:58

ggerganov requested a review from slaren as a code owner October 12, 2025 13:58

slaren approved these changes Oct 12, 2025

View reviewed changes

JohannesGaessler mentioned this pull request Oct 12, 2025

CUDA: enable FA for FP32 KV cache #16546

Open

ggerganov force-pushed the gg/cacheless-embd branch from c308925 to 5734546 Compare October 13, 2025 14:30

ggerganov requested a review from CISC as a code owner October 13, 2025 14:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metal : FA support F32 K and V #16531

metal : FA support F32 K and V #16531

ggerganov commented Oct 12, 2025 •

edited

Loading

Uh oh!

ggerganov commented Oct 12, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented Oct 12, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented Oct 12, 2025

Uh oh!

JohannesGaessler commented Oct 12, 2025

Uh oh!

ggerganov commented Oct 12, 2025

Uh oh!

JohannesGaessler commented Oct 12, 2025

Uh oh!

jeffbolznv commented Oct 12, 2025

Uh oh!

jeffbolznv commented Oct 12, 2025

Uh oh!

JohannesGaessler commented Oct 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

metal : FA support F32 K and V #16531

Are you sure you want to change the base?

metal : FA support F32 K and V #16531

Conversation

ggerganov commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Oct 12, 2025

Uh oh!

JohannesGaessler commented Oct 12, 2025

Uh oh!

ggerganov commented Oct 12, 2025

Uh oh!

JohannesGaessler commented Oct 12, 2025

Uh oh!

jeffbolznv commented Oct 12, 2025

Uh oh!

jeffbolznv commented Oct 12, 2025

Uh oh!

JohannesGaessler commented Oct 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggerganov commented Oct 12, 2025 •

edited

Loading

ggerganov commented Oct 12, 2025 •

edited

Loading

JohannesGaessler commented Oct 12, 2025 •

edited

Loading