Skip to content

Conversation

@LoserCheems
Copy link
Collaborator

Corrects OOM condition logic to require both query and key lengths exceed threshold instead of either one.

Captures attention mask output from dynamic mask preparation function and properly passes it to the CUDA kernel instead of using active mask incorrectly.

Replaces hardcoded boolean literals with is_causal variable for better code maintainability.

Expands benchmark configurations with larger sequence lengths, additional head dimensions, and higher dimensional embeddings to provide more comprehensive performance testing coverage.

Corrects OOM condition logic to require both query and key lengths exceed threshold instead of either one.

Captures attention mask output from dynamic mask preparation function and properly passes it to the CUDA kernel instead of using active mask incorrectly.

Replaces hardcoded boolean literals with is_causal variable for better code maintainability.

Expands benchmark configurations with larger sequence lengths, additional head dimensions, and higher dimensional embeddings to provide more comprehensive performance testing coverage.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refines out-of-memory (OOM) checks, corrects dynamic mask handling in the CUDA kernel call, introduces an is_causal flag for clarity, and expands benchmark configurations for broader performance coverage.

  • Tighten OOM logic to require both query and key lengths to exceed the threshold.
  • Capture and pass the attention mask (attn_mask) correctly into the CUDA kernel and add is_causal variable instead of hardcoded True.
  • Expand benchmark scenarios with larger sequence lengths, additional head dimensions, and higher embedding sizes.

Adds comprehensive memory cleanup with garbage collection and CUDA cache clearing between test configurations to prevent memory issues during extended benchmarking.

Expands test configurations to include more diverse scenarios with varying sequence lengths from 4 to 4096 tokens and different head dimensions.

Fixes inconsistent attention mask handling by ensuring both Python and CUDA implementations properly use the processed attention mask from prepare_dynamic_mask.

Adds proper CUDA synchronization around timing measurements to ensure accurate performance comparisons.
@LoserCheems LoserCheems merged commit 5c6a7d6 into main Jun 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants