Skip to content

Conversation

@LoserCheems
Copy link
Collaborator

Eliminates the dynamic mask attention CUDA implementation without topk computation to simplify benchmark comparisons and reduce code complexity.

Updates test configurations to use head dimension of 128 instead of 32 for more realistic performance testing scenarios.

Adjusts benchmark output formatting to accommodate the reduced number of implementations being compared.

Eliminates the dynamic mask attention CUDA implementation without topk computation to simplify benchmark comparisons and reduce code complexity.

Updates test configurations to use head dimension of 128 instead of 32 for more realistic performance testing scenarios.

Adjusts benchmark output formatting to accommodate the reduced number of implementations being compared.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes the no-topk CUDA implementation from the benchmarks to simplify code and comparisons. The changes update test configurations to use a more realistic head dimension of 128 instead of 32, and adjust the benchmark output formatting to accommodate fewer implementations.

  • Removes the dynamic_mask_attention_cuda_no_topk function and all related benchmarking code
  • Updates test configurations to use head dimension 128 instead of 32 for more realistic scenarios
  • Adjusts benchmark output formatting by reducing table width and removing no-topk columns

flash_avg = time_avgs.get('flash', float('inf'))

for impl_key in ['cuda', 'no_topk', 'triton', 'flex']:
for impl_key in ['cuda', 'triton', 'flex']:
Copy link

Copilot AI Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded list of implementation keys should be extracted to a constant or derived from the implementations dictionary to avoid maintenance issues when adding or removing implementations.

Suggested change
for impl_key in ['cuda', 'triton', 'flex']:
for impl_key in [key for key in implementations.keys() if key != 'flash']:

Copilot uses AI. Check for mistakes.
Removes flash attention backend specification to use default SDPA behavior and enables attention mask usage.

Comments out most benchmark configurations to focus testing on window size variations, reducing benchmark execution time while maintaining core functionality testing.
@LoserCheems LoserCheems merged commit 5f465b8 into main Jul 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants