Refactor attention block smoothing for consistency #205

LoserCheems · 2025-11-06T07:23:28Z

Summary

Refactors the attention block smoothing logic to ensure consistent voting behavior across top-k and ReLU masks.

Root Cause

The previous implementation contained duplicated logic for block smoothing, leading to inconsistencies.

Changes

Introduces a shared block smoothing helper function to eliminate duplicated logic.
Stops coercing attention scores to float, maintaining native data types during ranking.

Reproduction

Not applicable as this is a refactor without a specific bug.

Tests

Existing tests validate the functionality; no new tests added.

Compatibility

No backward compatibility issues identified.

Checklist

Linked issue provided
Adds or updates tests
Updates docs if needed
No perf regressions

Introduces a shared block smoothing helper so top-k and ReLU masks keep consistent voting behavior while avoiding duplicated logic. Stops coercing attention scores to float, keeping native dtypes when ranking.

Copilot

Pull Request Overview

This PR refactors duplicate block smoothing logic in mask generation functions by extracting it into a reusable block_smooth function. The refactoring reduces code duplication while maintaining identical functionality.

Key changes:

Introduced a new block_smooth helper function to encapsulate block-based mask smoothing logic
Replaced duplicate block smoothing code in both topk_mask and relu_mask functions with calls to the new helper
Changed dtype conversion from torch.int32 to torch.int64 for block counts
Removed redundant block_size validation and casting that existed in both functions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-06T07:25:29Z

flash_dmattn/utils/mask.py

+    attention_mask: torch.Tensor,
+    key_len: int,
+    block_size: int,
+):


The newly introduced block_smooth function is missing a docstring. Add documentation explaining its purpose, parameters, and return value to maintain consistency with other functions in the module like topk_mask and relu_mask.

Suggested change

):

):

"""

Smooths the attention mask by aggregating values in non-overlapping blocks along the key dimension.

For each block of size `block_size`, the block is set to True if more than half of its elements are True,

otherwise it is set to False. The tail (if key_len is not divisible by block_size) is handled similarly.

Args:

attention_mask (torch.Tensor): Boolean tensor of shape (..., key_len) representing the attention mask.

key_len (int): The length of the key dimension.

block_size (int): The size of each block for smoothing. Must be a positive integer.

Returns:

torch.Tensor: The smoothed attention mask tensor of the same shape as the input.

"""

Copilot · 2025-11-06T07:25:29Z

flash_dmattn/utils/mask.py

+    key_len: int,
+    block_size: int,
+):
+    if block_size <= 0:


The validation only checks if block_size <= 0, but doesn't verify that block_size is an integer. The previous code included int(block_size) != block_size checks in both topk_mask and relu_mask. Consider adding integer validation here: if not isinstance(block_size, int) or block_size <= 0: to maintain the same input validation behavior.

Suggested change

if block_size <= 0:

if not isinstance(block_size, int) or block_size <= 0:

Copilot · 2025-11-06T07:25:30Z

flash_dmattn/utils/mask.py

+
+

There is trailing whitespace on line 47. Remove the extra whitespace after the return statement.

Suggested change

Copilot · 2025-11-06T07:25:30Z

flash_dmattn/utils/mask.py

    attention_bias = attention_bias.masked_fill(~attention_mask, min_dtype) if attention_mask is not None else attention_bias
    topk_values, topk_indices = torch.topk(
-        attention_bias.to(torch.float),
+        attention_bias,


The .to(torch.float) conversion was removed from the torch.topk call. While this may be intentional to preserve the original dtype, it changes the existing behavior. If attention_bias is not already torch.float, this could affect numerical precision in the topk operation. Verify this is the intended behavior or document why the dtype conversion was removed.

Suggested change

attention_bias,

attention_bias.to(torch.float),

Refactors attention block smoothing

109c7ad

Introduces a shared block smoothing helper so top-k and ReLU masks keep consistent voting behavior while avoiding duplicated logic. Stops coercing attention scores to float, keeping native dtypes when ranking.

Copilot AI review requested due to automatic review settings November 6, 2025 07:23

github-actions bot requested review from Evanwu1125, SNHuan, Thanksyy, ftgreat, juliohsu, wubingheng111 and zacliu2023 November 6, 2025 07:23

github-actions bot assigned Evanwu1125, ftgreat, juliohsu, SNHuan, Thanksyy, wubingheng111 and zacliu2023 Nov 6, 2025

LoserCheems merged commit e3bcf48 into main Nov 6, 2025
5 checks passed

Copilot AI reviewed Nov 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor attention block smoothing for consistency #205

Refactor attention block smoothing for consistency #205

Uh oh!

LoserCheems commented Nov 6, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 6, 2025

Uh oh!

Copilot AI Nov 6, 2025

Uh oh!

Copilot AI Nov 6, 2025

Uh oh!

Copilot AI Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

-):
+):
+    """
+    Smooths the attention mask by aggregating values in non-overlapping blocks along the key dimension.
+    For each block of size `block_size`, the block is set to True if more than half of its elements are True,
+    otherwise it is set to False. The tail (if key_len is not divisible by block_size) is handled similarly.
+    Args:
+        attention_mask (torch.Tensor): Boolean tensor of shape (..., key_len) representing the attention mask.
+        key_len (int): The length of the key dimension.
+        block_size (int): The size of each block for smoothing. Must be a positive integer.
+    Returns:
+        torch.Tensor: The smoothed attention mask tensor of the same shape as the input.
+    """

	if block_size <= 0:
	if not isinstance(block_size, int) or block_size <= 0:

Refactor attention block smoothing for consistency #205

Refactor attention block smoothing for consistency #205

Uh oh!

Conversation

LoserCheems commented Nov 6, 2025

Summary

Root Cause

Changes

Reproduction

Tests

Compatibility

Checklist

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants