Workspace #10

LoserCheems · 2025-05-22T02:37:55Z

This pull request introduces several changes to the flash_attention codebase, focusing on simplifying code, removing unused functionality, and improving memory layout and kernel generation. The most significant changes include the removal of split-key-value (split-KV) functionality, the introduction of a kernel generation script, and updates to memory layout and masking logic for better performance and maintainability.

Removal of Split-KV Functionality:

Removed split-KV-related kernel definitions and dispatch logic from flash_fwd_launch_template.h, simplifying the forward pass implementation. [1] [2]

Introduction of Kernel Generation Script:

Added generate_kernels.py, a script to auto-generate kernel instantiations for different head dimensions, data types, and causal configurations. This reduces manual kernel definition and improves maintainability.

Memory Layout and Masking Improvements:

Updated shared memory layouts in kernel_traits.h to optimize dynamic masking and zero-hold states, including new layouts for mask values, sort keys, and indices.
Refined causal masking logic in mask.h to eliminate unnecessary pointer checks and improve clarity.
Enhanced the dynamic mask application function in mask.h to use tensor-based shared memory buffers, improving type safety and flexibility.

Code Simplification:

Consolidated index_t type definition in flash.h to a single global typedef, removing redundant definitions in multiple structs. [1] [2]
Removed unused fields and pointers, such as causal_mask_ptr, from Flash_fwd_params and Flash_bwd_params structs in flash.h. [1] [2]

Documentation and References:

Added a reference link in kernel_traits.h to clarify the source of a memory layout definition, improving code documentation.

Copilot

Pull Request Overview

This pull request simplifies the flash_attention codebase by removing unused split-KV functionality, introducing a kernel generation script, and refactoring shared memory layouts and masking logic to improve performance and maintainability.

Removed split-KV kernel definitions and dispatch logic.
Added an auto-generation script for kernel instantiation.
Refined memory layout declarations and masking functions, and consolidated the index type definition in flash.h.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
csrc/src/static_switch.h	Introduced BOOL_SWITCH and related macros for conditional code.
csrc/src/softmax.h	Updated softmax reduction and scaling implementations.
csrc/src/philox.cuh	Added Philox RNG implementation for CUDA.
csrc/src/mask.h	Refactored causal and dynamic mask functions to remove pointer usage.
csrc/src/kernel_traits.h	Updated shared memory layout names and calculations for dynamic masking.
csrc/src/generate_kernels.py	New script for auto-generating kernel instantiations.
csrc/src/flash_fwd_launch_template.h	Removed split-KV kernel instantiations to simplify forward pass.
csrc/src/flash.h	Consolidated index type definition and removed unused causal mask pointer.

Copilot · 2025-05-22T02:38:23Z

csrc/src/flash.h

 constexpr int TOTAL_DIM = 0;
 constexpr int H_DIM = 1;
 constexpr int D_DIM = 2;
+typedef int64_t index_t;


The global typedef for index_t now makes local alias definitions in structs redundant. Ensure any previous local definitions are removed so that all modules use the global index_t consistently.

Copilot · 2025-05-22T02:38:24Z

csrc/src/kernel_traits.h


    // Dynamic mask related definitions
-    using SmemLayoutAtomZeroHold = decltype(
+    using SmemLayoutAtomMask = decltype(


[nitpick] The associated comment still refers to zero-hold states. Consider updating the comment to clarify that this layout is now used for dynamic mask values to avoid any confusion.

algo-home added 2 commits May 22, 2025 10:34

Support compute_attn_1rowblock

f8198ca

Great progress

fbce45c

LoserCheems requested a review from Copilot May 22, 2025 02:37

Copilot AI reviewed May 22, 2025

View reviewed changes

LoserCheems merged commit 9655697 into main May 22, 2025

LoserCheems deleted the workspace branch November 13, 2025 04:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Workspace #10

Workspace #10

Uh oh!

LoserCheems commented May 22, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 22, 2025

Uh oh!

Copilot AI May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Workspace #10

Workspace #10

Uh oh!

Conversation

LoserCheems commented May 22, 2025

Removal of Split-KV Functionality:

Introduction of Kernel Generation Script:

Memory Layout and Masking Improvements:

Code Simplification:

Documentation and References:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants