Refactors CUDA implementation to use new interface #81

LoserCheems · 2025-07-30T13:28:03Z

Switches from direct CUDA extension import to standardized function interface for better maintainability and consistency.

Simplifies function call signature by removing manual tensor operations and utilizing cleaner parameter passing through the new interface.

Adds proper null check to handle cases where the function is unavailable.

Switches from direct CUDA extension import to standardized function interface for better maintainability and consistency. Simplifies function call signature by removing manual tensor operations and utilizing cleaner parameter passing through the new interface. Adds proper null check to handle cases where the function is unavailable.

Copilot

Pull Request Overview

Refactors the CUDA implementation in the forward performance benchmark to use a standardized function interface instead of direct CUDA extension import. This improves maintainability and consistency by switching from the low-level flash_dmattn_cuda.fwd call to the higher-level flash_dmattn_func interface.

Switches from direct CUDA extension import to standardized function interface
Simplifies function call signature by removing manual tensor operations and using cleaner parameter passing
Adds proper null check to handle cases where the function is unavailable

Copilot · 2025-07-30T13:28:48Z

benchmarks/benchmark_forward_performance.py

-    attn_mask = attn_mask.contiguous()                          # [batch, num_kv_heads, query_len, key_len]
+    query_states = query_states.transpose(1, 2)     # [batch, query_len, num_heads, head_dim]
+    key_states = key_states.transpose(1, 2)         # [batch, key_len, num_kv_heads, head_dim]
+    value_states = value_states.transpose(1, 2)     # [batch, key_len, num_kv_heads, head_dim]


The zoh_states calculation and usage has been removed but the calculate_zoh_states function is still being called on line 241. This will cause the function to compute zoh_states that are no longer used, and the new flash_dmattn_func interface may not handle the dynamic masking behavior that zoh_states were intended to provide.

Copilot · 2025-07-30T13:28:48Z

benchmarks/benchmark_forward_performance.py

+    query_states = query_states.transpose(1, 2)     # [batch, query_len, num_heads, head_dim]
+    key_states = key_states.transpose(1, 2)         # [batch, key_len, num_kv_heads, head_dim]
+    value_states = value_states.transpose(1, 2)     # [batch, key_len, num_kv_heads, head_dim]


The .contiguous() calls have been removed from the tensor operations. If the new flash_dmattn_func interface expects contiguous tensors, this could cause performance issues or errors. Consider adding .contiguous() calls back if the interface requires them.

Suggested change

query_states = query_states.transpose(1, 2) # [batch, query_len, num_heads, head_dim]

key_states = key_states.transpose(1, 2) # [batch, key_len, num_kv_heads, head_dim]

value_states = value_states.transpose(1, 2) # [batch, key_len, num_kv_heads, head_dim]

query_states = query_states.transpose(1, 2).contiguous() # [batch, query_len, num_heads, head_dim]

key_states = key_states.transpose(1, 2).contiguous() # [batch, key_len, num_kv_heads, head_dim]

value_states = value_states.transpose(1, 2).contiguous() # [batch, key_len, num_kv_heads, head_dim]

LoserCheems requested review from Evanwu1125, SNHuan, Copilot and wubingheng111 and removed request for Copilot July 30, 2025 13:28

LoserCheems assigned LoserCheems, Copilot, Evanwu1125, SNHuan and wubingheng111 Jul 30, 2025

Copilot AI reviewed Jul 30, 2025

View reviewed changes

LoserCheems merged commit ef11ca8 into main Jul 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactors CUDA implementation to use new interface #81

Refactors CUDA implementation to use new interface #81

Uh oh!

LoserCheems commented Jul 30, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 30, 2025

Uh oh!

Copilot AI Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Refactors CUDA implementation to use new interface #81

Refactors CUDA implementation to use new interface #81

Uh oh!

Conversation

LoserCheems commented Jul 30, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants