Fixes attention benchmarking and expands test coverage #39

LoserCheems · 2025-06-30T07:20:24Z

Corrects OOM condition logic to require both query and key lengths exceed threshold instead of either one.

Captures attention mask output from dynamic mask preparation function and properly passes it to the CUDA kernel instead of using active mask incorrectly.

Replaces hardcoded boolean literals with is_causal variable for better code maintainability.

Expands benchmark configurations with larger sequence lengths, additional head dimensions, and higher dimensional embeddings to provide more comprehensive performance testing coverage.

Corrects OOM condition logic to require both query and key lengths exceed threshold instead of either one. Captures attention mask output from dynamic mask preparation function and properly passes it to the CUDA kernel instead of using active mask incorrectly. Replaces hardcoded boolean literals with is_causal variable for better code maintainability. Expands benchmark configurations with larger sequence lengths, additional head dimensions, and higher dimensional embeddings to provide more comprehensive performance testing coverage.

Copilot

Pull Request Overview

This PR refines out-of-memory (OOM) checks, corrects dynamic mask handling in the CUDA kernel call, introduces an is_causal flag for clarity, and expands benchmark configurations for broader performance coverage.

Tighten OOM logic to require both query and key lengths to exceed the threshold.
Capture and pass the attention mask (attn_mask) correctly into the CUDA kernel and add is_causal variable instead of hardcoded True.
Expand benchmark scenarios with larger sequence lengths, additional head dimensions, and higher embedding sizes.

Adds comprehensive memory cleanup with garbage collection and CUDA cache clearing between test configurations to prevent memory issues during extended benchmarking. Expands test configurations to include more diverse scenarios with varying sequence lengths from 4 to 4096 tokens and different head dimensions. Fixes inconsistent attention mask handling by ensuring both Python and CUDA implementations properly use the processed attention mask from prepare_dynamic_mask. Adds proper CUDA synchronization around timing measurements to ensure accurate performance comparisons.

LoserCheems requested review from Evanwu1125, SNHuan, Copilot and wubingheng111 June 30, 2025 07:20

LoserCheems assigned SNHuan, Evanwu1125, wubingheng111 and LoserCheems Jun 30, 2025

LoserCheems added the bug Something isn't working label Jun 30, 2025

Copilot AI reviewed Jun 30, 2025

View reviewed changes

LoserCheems merged commit 5c6a7d6 into main Jun 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes attention benchmarking and expands test coverage #39

Fixes attention benchmarking and expands test coverage #39

Uh oh!

LoserCheems commented Jun 30, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fixes attention benchmarking and expands test coverage #39

Fixes attention benchmarking and expands test coverage #39

Uh oh!

Conversation

LoserCheems commented Jun 30, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants