Workspace #6

LoserCheems · 2025-05-19T02:58:00Z

This pull request introduces significant enhancements to the Flash Attention forward pass implementation, focusing on kernel definitions, architecture-specific optimizations, and memory layout improvements. The changes aim to improve code maintainability, support for dynamic masks, and performance across different GPU architectures.

Kernel and Architecture Enhancements:

Introduced macros to streamline kernel definitions and handle unsupported architectures in flash_fwd_launch_template.h. This includes defining DEFINE_FLASH_FORWARD_KERNEL for cleaner kernel declarations and FLASH_UNSUPPORTED_ARCH for centralized error messaging.
Added architecture-specific optimizations for compute capabilities (e.g., SM80+) by adjusting kernel configurations and memory usage based on GPU capabilities.

Dynamic Mask Support:

Enhanced dynamic mask memory allocation in Flash_fwd_kernel_traits by splitting kDynamicMaskBufferPerQuery into separate components (kMaskValuesSize, kNonZeroIndicesSize, etc.) for better modularity and clarity.
Defined shared memory layouts (SmemLayoutDynamicMaskValues, SmemLayoutNonZeroIndices, etc.) to support dynamic masks with improved memory organization.

Performance Improvements:

Optimized kernel configurations for specific head dimensions (e.g., 32, 64, 96, 128, 192, 256) in flash_fwd_launch_template.h, ensuring efficient memory and thread usage based on GPU architecture and workload characteristics.
Adjusted shared memory size calculations to account for dynamic mask buffers and non-zero indices, ensuring efficient use of shared memory resources.

LoserCheems · 2025-05-19T02:58:21Z

@wubingheng111

algo-home added 17 commits May 17, 2025 19:48

Add SmemLayout for Dynamic Mask

109063f

Add SmemLayout for Dynamic Mask

aafb1f7

Add forward kernel

b2b322e

Add forward launch template

3423ea1

Remove Invalid Params from launch template

9cc8066

Check if there are any queries to process in the block

e12b9c6

We exit early and write 0 to gO and gLSE

ba0f1f1

We exit early and write 0 to gO and gLSE

33dceb1

Compute the actual range of N blocks to process

157275a

Update Golobal memory tensor configuration

76bf08a

Compute the actual range of N blocks to process

58c09f2

Add judging condition for causal mask

3cb16ad

Shared memory layout configuration

a5c84da

Golobal to Shared Memory operation

72fde78

Add Matrix Multiply Accumulate

8224b24

Add Copy Atom retiling

74bd9d7

Update Golobal memory tensor configuration

60e3d19

Update Dynamic mask related shared memory

1352882

LoserCheems merged commit a843e46 into main May 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Workspace #6

Workspace #6

Uh oh!

LoserCheems commented May 19, 2025

Uh oh!

LoserCheems commented May 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Workspace #6

Workspace #6

Uh oh!

Conversation

LoserCheems commented May 19, 2025

Kernel and Architecture Enhancements:

Dynamic Mask Support:

Performance Improvements:

Uh oh!

LoserCheems commented May 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants