Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Oct 12, 2025

target #16528

  • Add Metal FA kernels for F32 K and V
  • Add Metal FA kernels for head size 32
  • Remove K and V casts with cacheless contexts (we should keep the casts for now)

Sample command for testing:

llama-embedding -hf ggml-org/bge-small-en-v1.5-Q8_0-GGUF -e -p "$(printf 'hello %.0s' {1..510})" --pooling cls -c 512 -fa on

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Oct 12, 2025
@ggerganov
Copy link
Member Author

ggerganov commented Oct 12, 2025

@JohannesGaessler @jeffbolznv Would it be possible to add support for F32 K and V tensors in the respective backends?

The issue is that these casts on master increase significantly the compute buffer size for embedding models (see #15586 (comment)):

llama.cpp/src/llama-graph.cpp

Lines 1313 to 1326 in 4b2dae3

// this can happen when KV cache is not used (e.g. an embedding model with non-causal attn)
if (k->type == GGML_TYPE_F32) {
k = ggml_cast(ctx0, k, GGML_TYPE_F16);
}
if (v->type == GGML_TYPE_F32) {
v = ggml_cast(ctx0, v, GGML_TYPE_F16);
}
cur = ggml_flash_attn_ext(ctx0, q, k, v, kq_mask, kq_scale, hparams.f_max_alibi_bias,
hparams.attn_soft_cap ? hparams.f_attn_logit_softcapping : 0.0f);
cb(cur, LLAMA_TENSOR_NAME_FATTN, il);

If we remove the casts, the memory usage should be significantly reduced for this use case. But to remove them, the FA implementation has to support k->type == GGML_TYPE_F32 && v->type == GGML_TYPE_F32 in the ggml_flash_attn_ext() operator.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Oct 12, 2025

It's definitely possible but it will require additional considerations w.r.t. SRAM limits. For the tile kernel what would need to be done is to determine FP16 vs. FP32 use via a template parameter rather than the FAST_FP16_AVAILABLE macro and to determine which kernel parameters, if any, can be made to fit in SRAM if you now need twice as much to store KV data vs. FP16.

@JohannesGaessler
Copy link
Collaborator

Do the models for which this is relevant use GQA?

@JohannesGaessler
Copy link
Collaborator

We could also consider making the operations preceding FA write back their data as FP16 in the first place. In terms of performance that would definitely preferable for all CUDA/ROCm GPUs except for Pascal.

@ggerganov
Copy link
Member Author

Do the models for which this is relevant use GQA?

Generally yes.

If it would make the implementation simpler, maybe we can treat F32 K and V as just another "quantization" type, where the dequantize function is a cast to F16?

@JohannesGaessler
Copy link
Collaborator

For CUDA that can definitely be done with comparatively little effort but it would not eliminate the additional memory use, it would just shift it from the compute buffer to the buffer pool in the CUDA backend.

@ggerganov ggerganov marked this pull request as ready for review October 12, 2025 13:58
@ggerganov ggerganov requested a review from slaren as a code owner October 12, 2025 13:58
@jeffbolznv
Copy link
Collaborator

I think this should be relatively straightforward in the vulkan backend, I'll look into it. This comment is how I'd expect to implement it (we dequantize while loading, so no extra memory usage):

If it would make the implementation simpler, maybe we can treat F32 K and V as just another "quantization" type, where the dequantize function is a cast to F16?

@jeffbolznv
Copy link
Collaborator

Done for Vulkan in #16543

@JohannesGaessler
Copy link
Collaborator

Basic CUDA support in #16546 .

@ggerganov ggerganov requested a review from CISC as a code owner October 13, 2025 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants