Skip to content

Feature Request: Add kv-quant fa kernel variants for head sizes other than 128 #12989

@pl752

Description

@pl752

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Currently llama cpp has many variants of fa kernel for hs 128, but only a few for hs 64 and 256, which causes fallback to the CPU in case of using -ctk != f16 with models with hs != 128

Motivation

Llama 3.2 1b and gemma 3-12b use hs 64 and 256 respectively, and these seem to be quite popular models for some applications

Possible Implementation

More kernel templates need to be added, also flag like GGML_CUDA_FA_ALL_KVQ_HS can be added for enabling these templates, because I understand that adding these templates will increase compilation times dramatically

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions