Skip to content

Conversation

@LoserCheems
Copy link
Collaborator

Introduce dedicated CUDA kernel implementations for causal multi-head attention across multiple head dimensions and precision types. Separate files improve compilation performance by reducing template instantiation overhead.

Creates dedicated CUDA kernel implementation for causal multi-head attention with bfloat16 precision and head dimension 32.

Separates kernel specializations into individual files to improve compilation performance by reducing template instantiation overhead.
Introduces a dedicated compilation unit for bfloat16 data type with 32-dimensional head size to improve build performance by parallelizing kernel compilation across multiple files.

The template specialization delegates to the appropriate implementation while maintaining the existing interface.
Implements specialized template instantiation for causal multi-head attention with 32-dimensional heads using half precision.

Separates kernel implementations into dedicated files to improve compilation performance by reducing template instantiation overhead.
Introduces a dedicated compilation unit for flash attention forward pass with 32-dimensional heads using half precision on SM80 architecture.

Splits kernel implementations into separate files to improve compilation speed by reducing template instantiation overhead.
Implements template specialization for bfloat16 multi-head attention
with 64 head dimensions and causal masking on SM80 architecture.

Splits kernel implementations into separate files to improve
compilation performance and enable parallel builds.
Introduces specialized kernel implementation to improve compilation times by splitting different head dimensions into separate files.

The kernel handles multi-head attention forward pass with 64-dimensional heads using bfloat16 precision for SM80 architecture.
Implements template specialization for FP16 causal multi-head attention with 64-dimensional heads on SM80 architecture.

Splits kernel implementations into separate files to improve compilation speed and maintainability.
Introduces specialized CUDA kernel for multi-head attention forward pass with 64-dimensional heads using half-precision floating point on SM80 architecture.

Splits kernel implementations by head dimension to accelerate compilation process as noted in the auto-generated template specialization.
Implements template specialization for multi-head attention forward pass
with 96-dimensional heads using bfloat16 precision and causal masking.

Splits kernel implementations across separate files to improve
compilation performance for SM80 architecture.
Implements template specialization for bfloat16 data type with head dimension 96 to support modular kernel compilation.

Splits kernel implementations across separate files to reduce compilation time and improve build efficiency.
Introduces specialized CUDA kernel implementation for multi-head attention forward pass with 96-dimensional heads using half-precision floating point and causal masking.

Splits kernel implementations by head dimension into separate compilation units to improve build performance and reduce compilation time.
Implements specialized template instantiation for 96-dimensional head size using half precision on SM80 architecture.

Separates kernel implementations by head dimension to improve compilation performance as noted in the auto-generated file structure.
Introduces auto-generated CUDA kernel implementation for multi-head attention forward pass with 128-dimensional heads using bfloat16 precision and causal masking.

Separates kernel specializations into individual files to improve compilation performance by reducing template instantiation overhead.
Introduces a dedicated compilation unit for bfloat16 multi-head attention forward kernels with head dimension 128 on SM80 architecture.

Splits kernel implementations into separate files to reduce compilation time and improve build parallelization.
Splits compilation by creating dedicated file for FP16 causal attention with 128 head dimensions on SM80 architecture.

Improves build performance by isolating template instantiation into separate compilation unit.
Introduces dedicated compilation unit for SM80 architecture to optimize build times by splitting kernel implementations across separate files.

Implements template specialization for half-precision floating point operations with 128-dimensional attention heads.
Implements specialized CUDA kernel for multi-head attention forward pass
with bfloat16 precision, 192 head dimension, and causal masking.

Separates kernel implementations by head dimension to improve compilation
speed and enables SM80 GPU architecture optimization.
Implements template specialization for multi-head attention forward pass
with bfloat16 precision and 192 head dimensions on SM80 architecture.

Splits kernel implementations into separate files to improve compilation
speed and enables targeted optimization for specific configurations.
Implements specialized CUDA kernel for flash attention forward pass with 192-dimensional heads using half-precision floating point and causal masking.

Splits kernel implementations by head dimension to improve compilation speed and follows auto-generation pattern for SM80 architecture.
Implements template specialization for half-precision floating point operations with 192-dimensional attention heads on SM80 architecture.

Splits kernel implementations across separate files to improve compilation performance as noted in the auto-generated code structure.
Introduces a new auto-generated CUDA file that implements a specialized forward pass kernel for multi-head attention with bfloat16 precision, 256 head dimensions, and causal masking on SM80 architecture.

Splits kernel implementations across separate files to improve compilation performance and maintainability.
Creates specialized CUDA kernel for bfloat16 flash attention with 256-dimensional heads on SM80 architecture.

Splits kernel implementations by head dimension to improve compilation speed as noted in the auto-generated file structure.
Introduces specialized CUDA kernel implementation for flash attention forward pass with 256-dimensional heads using half-precision floating point and causal masking.

Supports SM80 architecture and follows the pattern of splitting different head dimensions into separate files to improve compilation performance.
Splits kernel implementation into separate file to improve compilation speed for flash attention forward pass with 256 head dimensions using half precision on SM80 architecture.

The template specialization enables optimized execution path for this specific configuration while maintaining modularity in the codebase.
Splits flash attention kernels by head dimension to reduce compilation time.

Creates dedicated instantiation for 32-dimensional heads with bfloat16 precision and causal masking on SM80 architecture.
Introduces auto-generated CUDA kernel specialization to improve compilation performance by splitting different head dimensions into separate files.

Implements template instantiation for split-k attention forward pass with 32-dimensional heads using bfloat16 precision on SM80 architecture.
Introduces specialized kernel file for head dimension 32 with FP16 precision and causal masking to accelerate compilation times.

Splits kernel instantiations across separate files as part of compilation optimization strategy.
Splits head dimension 32 flash attention kernel into separate file to improve compilation speed.

Auto-generated template instantiation for half precision on SM80 architecture with split key-value dispatch.
Introduces specialized kernel file for head dimension 64 with bfloat16 precision and causal masking to improve compilation performance.

Separates kernel instantiations into dedicated files to reduce build times and enable parallel compilation of different attention configurations.
Splits flash attention forward kernels by head dimension to reduce compilation time.

Creates dedicated kernel instantiation for bfloat16 with 64-dimensional heads on SM80 architecture.
Introduces specialized kernel file for bfloat16 data type with 128 head dimension and causal masking targeting SM80 architecture.

Separates kernel implementations into individual files to improve compilation performance by reducing build times through parallel compilation and selective recompilation.
Splits kernel compilation by head dimension to improve build times.

Auto-generated file contains template instantiation for bfloat16
with 128 head dimensions targeting SM80 architecture.
Introduces auto-generated kernel file for head dimension 128 with FP16 precision and causal masking on SM80 architecture.

Splits kernel implementations into separate files to accelerate compilation times by reducing template instantiation overhead per compilation unit.
Introduces a new compilation unit for FlashAttention forward pass kernels with specific parameters (fp16, head dimension 128, SM80 architecture).

Separates kernel instantiations into individual files to reduce compilation time and improve build parallelization.
Introduces a dedicated CUDA kernel file for flash attention forward pass with specific parameters: head dimension 192, bfloat16 data type, causal masking, and SM80 architecture.

Splits kernel implementations into separate files to improve compilation performance by reducing compilation time through modular organization.
Introduces template specialization to improve compilation performance by splitting different head dimensions into separate files.

Supports SM80 architecture with split-kv dispatch functionality.
Introduces dedicated CUDA kernel file for flash attention forward pass
with 192 head dimensions, FP16 precision, and causal masking on SM80.

Splits kernel implementations into separate files to improve compilation
speed and reduce build times for the flash attention library.
Introduces a specialized kernel file for head dimension 192 using half precision to improve compilation performance through file splitting.

The template instantiation targets SM80 architecture and supports split-KV dispatch functionality.
Splits flash attention kernels by head dimension to improve compilation speed.

Creates dedicated compilation unit for 256-dimensional bfloat16 causal attention on SM80 architecture.
Creates a dedicated compilation unit for the specific configuration of head dimension 256 using bfloat16 precision on SM80 architecture.

Improves build performance by separating kernel instantiations across multiple files, reducing compilation time for the flash attention implementation.
Introduces a dedicated CUDA kernel file for flash attention forward pass with 256 head dimensions and causal masking on SM80 architecture.

Splits kernel implementations into separate files to accelerate compilation times by enabling parallel compilation of different head dimension variants.
Splits flash attention forward kernels by head dimension to reduce compilation time.

Creates dedicated compilation unit for 256-dimensional heads using half precision on SM80 architecture.
Corrects is_causal field type from string to boolean for proper type safety.

Uncomments fwd_split template function to enable forward split kernel generation.

Extends kernel generation to include both fwd and fwd_split directions instead of only fwd.

Removes unused os import to clean up dependencies.
@LoserCheems LoserCheems requested review from Evanwu1125, SNHuan, Copilot and wubingheng111 and removed request for Copilot June 26, 2025 02:57
@LoserCheems LoserCheems added the feature New feature request label Jun 26, 2025

This comment was marked as outdated.

Updates the data type of the is_causal field in the Kernel class to support string-based causal configurations instead of simple boolean values.

This change enables more flexible causal masking options beyond just enabled/disabled states.
Improves memory usage by returning a generator instead of materializing all kernel objects in memory at once.

Updates import statement to remove unused List type and add Generator type.
@LoserCheems LoserCheems requested a review from Copilot June 26, 2025 03:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds specialized CUDA kernel implementations for causal multi-head attention via dedicated auto‐generated files to improve compilation performance. Key changes include updating the kernel generation script in generate_kernels.py (e.g., switching the return type for get_all_kernels to a Generator), and introducing numerous auto‐generated CUDA files that instantiate template functions for different head dimensions, precisions, and causal configurations.

Reviewed Changes

Copilot reviewed 49 out of 49 changed files in this pull request and generated 1 comment.

File Description
csrc/src/generate_kernels.py Updated import list and adjusted get_all_kernels to return a Generator instead of a List; uncommented the "fwd_split" kernel template usage.
All auto-generated CUDA files Added new CUDA kernel instantiations (both non-causal and causal variants) for various head dimensions and precision types.

def get_all_kernels() -> List[Kernel]:
for direction in ["fwd"]: #, "fwd_split", "bwd"]:
def get_all_kernels() -> Generator[Kernel, None, None]:
# for direction in ["fwd", "fwd_split", "bwd"]:
Copy link

Copilot AI Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing or updating the commented-out options (e.g., 'bwd') if they are no longer supported to improve code clarity.

Suggested change
# for direction in ["fwd", "fwd_split", "bwd"]:

Copilot uses AI. Check for mistakes.
@LoserCheems LoserCheems merged commit f2aa162 into main Jun 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants