Add specialized CUDA kernels for multi-head attention with various head dimensions #30

LoserCheems · 2025-06-26T02:57:16Z

Introduce dedicated CUDA kernel implementations for causal multi-head attention across multiple head dimensions and precision types. Separate files improve compilation performance by reducing template instantiation overhead.

Creates dedicated CUDA kernel implementation for causal multi-head attention with bfloat16 precision and head dimension 32. Separates kernel specializations into individual files to improve compilation performance by reducing template instantiation overhead.

Introduces a dedicated compilation unit for bfloat16 data type with 32-dimensional head size to improve build performance by parallelizing kernel compilation across multiple files. The template specialization delegates to the appropriate implementation while maintaining the existing interface.

Implements specialized template instantiation for causal multi-head attention with 32-dimensional heads using half precision. Separates kernel implementations into dedicated files to improve compilation performance by reducing template instantiation overhead.

Introduces a dedicated compilation unit for flash attention forward pass with 32-dimensional heads using half precision on SM80 architecture. Splits kernel implementations into separate files to improve compilation speed by reducing template instantiation overhead.

Implements template specialization for bfloat16 multi-head attention with 64 head dimensions and causal masking on SM80 architecture. Splits kernel implementations into separate files to improve compilation performance and enable parallel builds.

Introduces specialized kernel implementation to improve compilation times by splitting different head dimensions into separate files. The kernel handles multi-head attention forward pass with 64-dimensional heads using bfloat16 precision for SM80 architecture.

Implements template specialization for FP16 causal multi-head attention with 64-dimensional heads on SM80 architecture. Splits kernel implementations into separate files to improve compilation speed and maintainability.

Introduces specialized CUDA kernel for multi-head attention forward pass with 64-dimensional heads using half-precision floating point on SM80 architecture. Splits kernel implementations by head dimension to accelerate compilation process as noted in the auto-generated template specialization.

Implements template specialization for multi-head attention forward pass with 96-dimensional heads using bfloat16 precision and causal masking. Splits kernel implementations across separate files to improve compilation performance for SM80 architecture.

Implements template specialization for bfloat16 data type with head dimension 96 to support modular kernel compilation. Splits kernel implementations across separate files to reduce compilation time and improve build efficiency.

Introduces specialized CUDA kernel implementation for multi-head attention forward pass with 96-dimensional heads using half-precision floating point and causal masking. Splits kernel implementations by head dimension into separate compilation units to improve build performance and reduce compilation time.

Implements specialized template instantiation for 96-dimensional head size using half precision on SM80 architecture. Separates kernel implementations by head dimension to improve compilation performance as noted in the auto-generated file structure.

Introduces auto-generated CUDA kernel implementation for multi-head attention forward pass with 128-dimensional heads using bfloat16 precision and causal masking. Separates kernel specializations into individual files to improve compilation performance by reducing template instantiation overhead.

Introduces a dedicated compilation unit for bfloat16 multi-head attention forward kernels with head dimension 128 on SM80 architecture. Splits kernel implementations into separate files to reduce compilation time and improve build parallelization.

Splits compilation by creating dedicated file for FP16 causal attention with 128 head dimensions on SM80 architecture. Improves build performance by isolating template instantiation into separate compilation unit.

Introduces dedicated compilation unit for SM80 architecture to optimize build times by splitting kernel implementations across separate files. Implements template specialization for half-precision floating point operations with 128-dimensional attention heads.

Implements specialized CUDA kernel for multi-head attention forward pass with bfloat16 precision, 192 head dimension, and causal masking. Separates kernel implementations by head dimension to improve compilation speed and enables SM80 GPU architecture optimization.

Implements template specialization for multi-head attention forward pass with bfloat16 precision and 192 head dimensions on SM80 architecture. Splits kernel implementations into separate files to improve compilation speed and enables targeted optimization for specific configurations.

Implements specialized CUDA kernel for flash attention forward pass with 192-dimensional heads using half-precision floating point and causal masking. Splits kernel implementations by head dimension to improve compilation speed and follows auto-generation pattern for SM80 architecture.

Implements template specialization for half-precision floating point operations with 192-dimensional attention heads on SM80 architecture. Splits kernel implementations across separate files to improve compilation performance as noted in the auto-generated code structure.

Introduces a new auto-generated CUDA file that implements a specialized forward pass kernel for multi-head attention with bfloat16 precision, 256 head dimensions, and causal masking on SM80 architecture. Splits kernel implementations across separate files to improve compilation performance and maintainability.

Creates specialized CUDA kernel for bfloat16 flash attention with 256-dimensional heads on SM80 architecture. Splits kernel implementations by head dimension to improve compilation speed as noted in the auto-generated file structure.

Introduces specialized CUDA kernel implementation for flash attention forward pass with 256-dimensional heads using half-precision floating point and causal masking. Supports SM80 architecture and follows the pattern of splitting different head dimensions into separate files to improve compilation performance.

Splits kernel implementation into separate file to improve compilation speed for flash attention forward pass with 256 head dimensions using half precision on SM80 architecture. The template specialization enables optimized execution path for this specific configuration while maintaining modularity in the codebase.

Splits flash attention kernels by head dimension to reduce compilation time. Creates dedicated instantiation for 32-dimensional heads with bfloat16 precision and causal masking on SM80 architecture.

Introduces auto-generated CUDA kernel specialization to improve compilation performance by splitting different head dimensions into separate files. Implements template instantiation for split-k attention forward pass with 32-dimensional heads using bfloat16 precision on SM80 architecture.

Introduces specialized kernel file for head dimension 32 with FP16 precision and causal masking to accelerate compilation times. Splits kernel instantiations across separate files as part of compilation optimization strategy.

Splits head dimension 32 flash attention kernel into separate file to improve compilation speed. Auto-generated template instantiation for half precision on SM80 architecture with split key-value dispatch.

Introduces specialized kernel file for head dimension 64 with bfloat16 precision and causal masking to improve compilation performance. Separates kernel instantiations into dedicated files to reduce build times and enable parallel compilation of different attention configurations.

Splits flash attention forward kernels by head dimension to reduce compilation time. Creates dedicated kernel instantiation for bfloat16 with 64-dimensional heads on SM80 architecture.

Introduces specialized kernel file for bfloat16 data type with 128 head dimension and causal masking targeting SM80 architecture. Separates kernel implementations into individual files to improve compilation performance by reducing build times through parallel compilation and selective recompilation.

Splits kernel compilation by head dimension to improve build times. Auto-generated file contains template instantiation for bfloat16 with 128 head dimensions targeting SM80 architecture.

Introduces auto-generated kernel file for head dimension 128 with FP16 precision and causal masking on SM80 architecture. Splits kernel implementations into separate files to accelerate compilation times by reducing template instantiation overhead per compilation unit.

Introduces a new compilation unit for FlashAttention forward pass kernels with specific parameters (fp16, head dimension 128, SM80 architecture). Separates kernel instantiations into individual files to reduce compilation time and improve build parallelization.

Introduces a dedicated CUDA kernel file for flash attention forward pass with specific parameters: head dimension 192, bfloat16 data type, causal masking, and SM80 architecture. Splits kernel implementations into separate files to improve compilation performance by reducing compilation time through modular organization.

Introduces template specialization to improve compilation performance by splitting different head dimensions into separate files. Supports SM80 architecture with split-kv dispatch functionality.

Introduces dedicated CUDA kernel file for flash attention forward pass with 192 head dimensions, FP16 precision, and causal masking on SM80. Splits kernel implementations into separate files to improve compilation speed and reduce build times for the flash attention library.

Introduces a specialized kernel file for head dimension 192 using half precision to improve compilation performance through file splitting. The template instantiation targets SM80 architecture and supports split-KV dispatch functionality.

Splits flash attention kernels by head dimension to improve compilation speed. Creates dedicated compilation unit for 256-dimensional bfloat16 causal attention on SM80 architecture.

Creates a dedicated compilation unit for the specific configuration of head dimension 256 using bfloat16 precision on SM80 architecture. Improves build performance by separating kernel instantiations across multiple files, reducing compilation time for the flash attention implementation.

Introduces a dedicated CUDA kernel file for flash attention forward pass with 256 head dimensions and causal masking on SM80 architecture. Splits kernel implementations into separate files to accelerate compilation times by enabling parallel compilation of different head dimension variants.

Splits flash attention forward kernels by head dimension to reduce compilation time. Creates dedicated compilation unit for 256-dimensional heads using half precision on SM80 architecture.

Corrects is_causal field type from string to boolean for proper type safety. Uncomments fwd_split template function to enable forward split kernel generation. Extends kernel generation to include both fwd and fwd_split directions instead of only fwd. Removes unused os import to clean up dependencies.

Updates the data type of the is_causal field in the Kernel class to support string-based causal configurations instead of simple boolean values. This change enables more flexible causal masking options beyond just enabled/disabled states.

Improves memory usage by returning a generator instead of materializing all kernel objects in memory at once. Updates import statement to remove unused List type and add Generator type.

Copilot

Pull Request Overview

This PR adds specialized CUDA kernel implementations for causal multi-head attention via dedicated auto‐generated files to improve compilation performance. Key changes include updating the kernel generation script in generate_kernels.py (e.g., switching the return type for get_all_kernels to a Generator), and introducing numerous auto‐generated CUDA files that instantiate template functions for different head dimensions, precisions, and causal configurations.

Reviewed Changes

Copilot reviewed 49 out of 49 changed files in this pull request and generated 1 comment.

File	Description
csrc/src/generate_kernels.py	Updated import list and adjusted get_all_kernels to return a Generator instead of a List; uncommented the "fwd_split" kernel template usage.
All auto-generated CUDA files	Added new CUDA kernel instantiations (both non-causal and causal variants) for various head dimensions and precision types.

Copilot · 2025-06-26T03:05:03Z

csrc/src/generate_kernels.py

-def get_all_kernels() -> List[Kernel]:
-    for direction in ["fwd"]: #, "fwd_split", "bwd"]:
+def get_all_kernels() -> Generator[Kernel, None, None]:
+    # for direction in ["fwd", "fwd_split", "bwd"]:


Consider removing or updating the commented-out options (e.g., 'bwd') if they are no longer supported to improve code clarity.

Suggested change

# for direction in ["fwd", "fwd_split", "bwd"]:

LoserCheems added 30 commits June 26, 2025 10:38

Adds specialized kernel for causal attention with 64-dim heads

40fb47a

Implements template specialization for FP16 causal multi-head attention with 64-dimensional heads on SM80 architecture. Splits kernel implementations into separate files to improve compilation speed and maintainability.

Adds specialized flash attention kernel for hdim96 bf16

abf0c5a

Implements template specialization for bfloat16 data type with head dimension 96 to support modular kernel compilation. Splits kernel implementations across separate files to reduce compilation time and improve build efficiency.

Adds specialized kernel for 128-dim causal attention

92a3395

Splits compilation by creating dedicated file for FP16 causal attention with 128 head dimensions on SM80 architecture. Improves build performance by isolating template instantiation into separate compilation unit.

Adds specialized kernel for 32-dim BF16 causal attention

b069e4b

Splits flash attention kernels by head dimension to reduce compilation time. Creates dedicated instantiation for 32-dimensional heads with bfloat16 precision and causal masking on SM80 architecture.

Adds split kernel for FP16 causal attention

07777f4

Introduces specialized kernel file for head dimension 32 with FP16 precision and causal masking to accelerate compilation times. Splits kernel instantiations across separate files as part of compilation optimization strategy.

Adds auto-generated kernel for FP16 SM80 hdim32

860c579

Splits head dimension 32 flash attention kernel into separate file to improve compilation speed. Auto-generated template instantiation for half precision on SM80 architecture with split key-value dispatch.

Adds split kernel for bfloat16 head dimension 64

c67b1ef

Splits flash attention forward kernels by head dimension to reduce compilation time. Creates dedicated kernel instantiation for bfloat16 with 64-dimensional heads on SM80 architecture.

LoserCheems added 13 commits June 26, 2025 10:51

Adds split kernel for bf16 hdim128 on SM80

d8ff35a

Splits kernel compilation by head dimension to improve build times. Auto-generated file contains template instantiation for bfloat16 with 128 head dimensions targeting SM80 architecture.

Adds specialized kernel for head dimension 192 with bfloat16

3699225

Introduces template specialization to improve compilation performance by splitting different head dimensions into separate files. Supports SM80 architecture with split-kv dispatch functionality.

Adds specialized kernel for hdim256 bf16 causal attention

f4dcef6

Splits flash attention kernels by head dimension to improve compilation speed. Creates dedicated compilation unit for 256-dimensional bfloat16 causal attention on SM80 architecture.

Adds split kernel for head dimension 256 with FP16

dae5441

Splits flash attention forward kernels by head dimension to reduce compilation time. Creates dedicated compilation unit for 256-dimensional heads using half precision on SM80 architecture.

LoserCheems requested review from Evanwu1125, SNHuan, Copilot and wubingheng111 and removed request for Copilot June 26, 2025 02:57

LoserCheems assigned LoserCheems, Copilot, Evanwu1125, SNHuan and wubingheng111 Jun 26, 2025

LoserCheems added the feature New feature request label Jun 26, 2025

This comment was marked as outdated.

Sign in to view

LoserCheems added 2 commits June 26, 2025 11:00

Changes is_causal field type from bool to str

1421c1b

Updates the data type of the is_causal field in the Kernel class to support string-based causal configurations instead of simple boolean values. This change enables more flexible causal masking options beyond just enabled/disabled states.

Changes return type from List to Generator for memory efficiency

51daf66

Improves memory usage by returning a generator instead of materializing all kernel objects in memory at once. Updates import statement to remove unused List type and add Generator type.

LoserCheems requested a review from Copilot June 26, 2025 03:04

Copilot AI reviewed Jun 26, 2025

View reviewed changes

LoserCheems merged commit f2aa162 into main Jun 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add specialized CUDA kernels for multi-head attention with various head dimensions #30

Add specialized CUDA kernels for multi-head attention with various head dimensions #30

Uh oh!

LoserCheems commented Jun 26, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add specialized CUDA kernels for multi-head attention with various head dimensions #30

Add specialized CUDA kernels for multi-head attention with various head dimensions #30

Uh oh!

Conversation

LoserCheems commented Jun 26, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants