-
Notifications
You must be signed in to change notification settings - Fork 45
Fix modeling example #147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix modeling example #147
Conversation
Removes flex attention forward function and its integration with BlockMask to streamline the attention mechanism. Updates flash attention import to use the more specific flash_dynamic_mask_attention_forward function instead of the generic auto function. Eliminates the complex prepare_dynamic_mask method that handled topk selection and masking logic, replacing it with a simpler direct bias expansion approach. This reduces code complexity while maintaining the core dynamic mask attention functionality. Changes the attention interface selection to prefer eager attention as fallback when flash attention is unavailable, improving compatibility across different environments.
Corrects improper indentation that was causing the flash dynamic mask attention interface assignment to be misaligned with the surrounding code block structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR simplifies the Doge model attention implementation by removing unstable Flex Attention integration and streamlining the dynamic masking approach to use only Flash Dynamic Mask Attention and a PyTorch eager fallback.
Key changes made:
- Removed Flex Attention backend and BlockMask logic
- Replaced generic kernel auto-selection with explicit flash_dynamic_mask_attention_forward import
- Simplified dynamic mask preparation by directly creating attention bias from dt_states expansion
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| attention_interface = flash_dynamic_mask_attention_forward | ||
|
|
||
| attn_output = attention_interface( | ||
| attention_mask = attention_mask.expand(-1, attn_bias.shape[1], -1, -1) if attention_mask is not None else None, # attention_mask: batch, num_kv_heads, query_len, key_len |
Copilot
AI
Sep 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a trailing comma after the else clause which makes this a tuple assignment instead of a simple assignment. Remove the comma to fix the syntax error.
| attention_mask = attention_mask.expand(-1, attn_bias.shape[1], -1, -1) if attention_mask is not None else None, # attention_mask: batch, num_kv_heads, query_len, key_len | |
| attention_mask = attention_mask.expand(-1, attn_bias.shape[1], -1, -1) if attention_mask is not None else None # attention_mask: batch, num_kv_heads, query_len, key_len |
| attention_interface: Callable = eager_attention_forward | ||
| if flash_dynamic_mask_attention_forward is not None: | ||
| attention_interface = flash_dynamic_mask_attention_forward |
Copilot
AI
Sep 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The variable assignment pattern assigns eager_attention_forward by default, then conditionally reassigns to flash_dynamic_mask_attention_forward. This could be simplified to a single conditional assignment for better readability.
| attention_interface: Callable = eager_attention_forward | |
| if flash_dynamic_mask_attention_forward is not None: | |
| attention_interface = flash_dynamic_mask_attention_forward | |
| attention_interface: Callable = ( | |
| flash_dynamic_mask_attention_forward if flash_dynamic_mask_attention_forward is not None else eager_attention_forward | |
| ) |
| attn_output, attn_weights = attention_interface( | ||
| self, | ||
| query_states, | ||
| key_states, | ||
| value_states, | ||
| attn_mask=attn_mask, | ||
| attn_bias=attn_bias, | ||
| is_causal=self.is_causal, | ||
| attention_mask=attention_mask, | ||
| attention_bias=attn_bias, | ||
| scale=self.scaling, | ||
| ) |
Copilot
AI
Sep 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function call passes self as the first argument, but eager_attention_forward expects module as the first parameter. When flash_dynamic_mask_attention_forward is used, verify it also expects the module instance as the first parameter to ensure consistent API usage.
|
Hi @yiyousong, please try this version. If it runs properly, we will merge it into main. |
Enables the attention mechanism to accept and apply an optional attention bias tensor, allowing for more flexible attention patterns and improved model capabilities. The bias is added to attention weights before applying the attention mask, following standard transformer architecture practices.
Replaces conditional block size logic with fixed value of 64 to streamline the splitkv kernel configuration and eliminate branching based on head size. The previous conditional logic is preserved as a comment for reference.
Standardizes block dimensions to 64x64 across all head dimensions and updates shared memory thresholds for better GPU utilization. Changes kernel selection logic to use consistent 164KB threshold and provides detailed CTA count documentation for different GPU architectures (sm86/89, A100, H100). Improves memory efficiency by using smaller block sizes with better occupancy characteristics and enables compact memory layout flags for older architectures.
Activates kernel optimizations by setting optimization flags to true for both 128 and 256 head dimension configurations on sm86 and sm89 architectures. Adds memory usage comment for 256 head dimension case to document resource requirements.
Adjusts kernel parameters across different head dimensions to improve memory usage and performance on various GPU architectures. Updates shared memory requirements and CTA counts for better utilization on sm86, sm89, A100, and H100 GPUs. Enables double buffering and adjusts block sizes to reduce memory footprint while maintaining or improving performance across different hardware configurations.
Description
This PR fixes issue #146 by simplifying the Doge model attention implementation. It removes unused / unstable Flex Attention integration (including
BlockMasklogic and kernel selection indirection) and streamlines the dynamic masking path to rely only on:flash_dynamic_mask_attention_forward)eager_attention_forward)The previous layered abstraction added complexity (Flex vs Flash vs fallback) without clear functional benefit, and the dynamic mask preparation logic (top-k selection + block pruning) was tightly coupled and harder to maintain. The new implementation keeps the core dynamic bias mechanism while being more readable, explicit, and robust across environments.
Type of Change
Related Issues
Changes Made
Code Changes
BlockMaskimport/usage.flash_dynamic_mask_attention_forward.BlockMask-style structures).attn_biasdirectly fromdt_statesexpansion.attention_biasinto unified interface (flash_dynamic_mask_attention_forwardif present else eager fallback).attention_maskbroadcasting logic is explicit and only expanded once.Documentation
Benchmarks / Runtime
Testing
Performed the following local validation steps:
DogeConfigand executed a dummy batch with and withoutattention_mask.[batch, num_kv_heads, query_len, key_len]aligns with softmax input.Breaking Changes
None functionally intended.
Removed: implicit Flex Attention backend route (if someone relied on it manually forcing Flex kernel usage).
Migration: No action required—model automatically uses flash kernel if installed else eager.
Checklist
CUDA-specific
Additional Notes
attention_biasin developer docs.