Removes no-topk CUDA implementation from benchmarks #65

LoserCheems · 2025-07-10T05:41:26Z

Eliminates the dynamic mask attention CUDA implementation without topk computation to simplify benchmark comparisons and reduce code complexity.

Updates test configurations to use head dimension of 128 instead of 32 for more realistic performance testing scenarios.

Adjusts benchmark output formatting to accommodate the reduced number of implementations being compared.

Eliminates the dynamic mask attention CUDA implementation without topk computation to simplify benchmark comparisons and reduce code complexity. Updates test configurations to use head dimension of 128 instead of 32 for more realistic performance testing scenarios. Adjusts benchmark output formatting to accommodate the reduced number of implementations being compared.

Copilot

Pull Request Overview

This PR removes the no-topk CUDA implementation from the benchmarks to simplify code and comparisons. The changes update test configurations to use a more realistic head dimension of 128 instead of 32, and adjust the benchmark output formatting to accommodate fewer implementations.

Removes the dynamic_mask_attention_cuda_no_topk function and all related benchmarking code
Updates test configurations to use head dimension 128 instead of 32 for more realistic scenarios
Adjusts benchmark output formatting by reducing table width and removing no-topk columns

Copilot · 2025-07-10T05:41:46Z

benchmarks/benchmark_forward_performance.py

        flash_avg = time_avgs.get('flash', float('inf'))

-        for impl_key in ['cuda', 'no_topk', 'triton', 'flex']:
+        for impl_key in ['cuda', 'triton', 'flex']:


The hardcoded list of implementation keys should be extracted to a constant or derived from the implementations dictionary to avoid maintenance issues when adding or removing implementations.

Suggested change

for impl_key in ['cuda', 'triton', 'flex']:

for impl_key in [key for key in implementations.keys() if key != 'flash']:

Removes flash attention backend specification to use default SDPA behavior and enables attention mask usage. Comments out most benchmark configurations to focus testing on window size variations, reducing benchmark execution time while maintaining core functionality testing.

LoserCheems requested review from Evanwu1125, SNHuan, Copilot and wubingheng111 and removed request for Copilot July 10, 2025 05:41

LoserCheems assigned LoserCheems, Copilot, Evanwu1125, SNHuan and wubingheng111 Jul 10, 2025

Copilot AI reviewed Jul 10, 2025

View reviewed changes

LoserCheems merged commit 5f465b8 into main Jul 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Removes no-topk CUDA implementation from benchmarks #65

Removes no-topk CUDA implementation from benchmarks #65

Uh oh!

LoserCheems commented Jul 10, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	for impl_key in ['cuda', 'triton', 'flex']:
	for impl_key in [key for key in implementations.keys() if key != 'flash']:

Removes no-topk CUDA implementation from benchmarks #65

Removes no-topk CUDA implementation from benchmarks #65

Uh oh!

Conversation

LoserCheems commented Jul 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants