b9828

Latest

Latest

github-actions released this 27 Jun 23:15

ebd048f

opencl: flash attention improvement (#25069)

opencl: rework FA kernel for f16 and f32
opencl: flash-attention prefill prepass kernels

flash_attn_kv_pad_f16 pads the tail KV tile to a BLOCK_N multiple
flash_attn_mask_pad_f16 pads the matching mask tile
flash_attn_blk_f16 classifies each KV tile per query block as
fully masked / mixed / fully unmasked, so
the main kernel can skip fully-masked tiles
and the mask lookup for fully-unmasked ones

opencl: FA kernels for q4_0 and q8_0
opencl: set_rows for f32 to q8_0/q4_0
opencl: dequant kernels for q4_0 and q8_0
opencl: add FA tile tuning table with override
opencl: wire host side for FA
opencl: q4_0 MoE tensors are also SOA'ed
opencl: cosmetic fix
opencl: refactor, also clarify some code paths in comments
opencl: fix inifity for -cl-finite-math-only

Co-authored-by: Li He lih@qti.qualcomm.com

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

UI

Assets 27