Skip to content

b9828

Latest

Choose a tag to compare

@github-actions github-actions released this 27 Jun 23:15
ebd048f

opencl: flash attention improvement (#25069)

  • opencl: rework FA kernel for f16 and f32

  • opencl: flash-attention prefill prepass kernels

  • flash_attn_kv_pad_f16 pads the tail KV tile to a BLOCK_N multiple
  • flash_attn_mask_pad_f16 pads the matching mask tile
  • flash_attn_blk_f16 classifies each KV tile per query block as
    fully masked / mixed / fully unmasked, so
    the main kernel can skip fully-masked tiles
    and the mask lookup for fully-unmasked ones
  • opencl: FA kernels for q4_0 and q8_0

  • opencl: set_rows for f32 to q8_0/q4_0

  • opencl: dequant kernels for q4_0 and q8_0

  • opencl: add FA tile tuning table with override

  • opencl: wire host side for FA

  • opencl: q4_0 MoE tensors are also SOA'ed

  • opencl: cosmetic fix

  • opencl: refactor, also clarify some code paths in comments

  • opencl: fix inifity for -cl-finite-math-only


Co-authored-by: Li He lih@qti.qualcomm.com

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI: