-
Notifications
You must be signed in to change notification settings - Fork 14.2k
Closed
Labels
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Currently llama cpp has many variants of fa kernel for hs 128, but only a few for hs 64 and 256, which causes fallback to the CPU in case of using -ctk != f16 with models with hs != 128
Motivation
Llama 3.2 1b and gemma 3-12b use hs 64 and 256 respectively, and these seem to be quite popular models for some applications
Possible Implementation
More kernel templates need to be added, also flag like GGML_CUDA_FA_ALL_KVQ_HS can be added for enabling these templates, because I understand that adding these templates will increase compilation times dramatically
Dampfinchen and betweenus