Skip to content

Conversation

@LoserCheems
Copy link
Collaborator

Summary

  • Refactors the attention block smoothing logic to ensure consistent voting behavior across top-k and ReLU masks.

Root Cause

  • The previous implementation contained duplicated logic for block smoothing, leading to inconsistencies.

Changes

  • Introduces a shared block smoothing helper function to eliminate duplicated logic.
  • Stops coercing attention scores to float, maintaining native data types during ranking.

Reproduction

  • Not applicable as this is a refactor without a specific bug.

Tests

  • Existing tests validate the functionality; no new tests added.

Compatibility

  • No backward compatibility issues identified.

Checklist

  • Linked issue provided
  • Adds or updates tests
  • Updates docs if needed
  • No perf regressions

Introduces a shared block smoothing helper so top-k and ReLU masks keep consistent voting behavior while avoiding duplicated logic.
Stops coercing attention scores to float, keeping native dtypes when ranking.
Copilot AI review requested due to automatic review settings November 6, 2025 07:23
@LoserCheems LoserCheems merged commit e3bcf48 into main Nov 6, 2025
5 checks passed
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors duplicate block smoothing logic in mask generation functions by extracting it into a reusable block_smooth function. The refactoring reduces code duplication while maintaining identical functionality.

Key changes:

  • Introduced a new block_smooth helper function to encapsulate block-based mask smoothing logic
  • Replaced duplicate block smoothing code in both topk_mask and relu_mask functions with calls to the new helper
  • Changed dtype conversion from torch.int32 to torch.int64 for block counts
  • Removed redundant block_size validation and casting that existed in both functions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

attention_mask: torch.Tensor,
key_len: int,
block_size: int,
):
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly introduced block_smooth function is missing a docstring. Add documentation explaining its purpose, parameters, and return value to maintain consistency with other functions in the module like topk_mask and relu_mask.

Suggested change
):
):
"""
Smooths the attention mask by aggregating values in non-overlapping blocks along the key dimension.
For each block of size `block_size`, the block is set to True if more than half of its elements are True,
otherwise it is set to False. The tail (if key_len is not divisible by block_size) is handled similarly.
Args:
attention_mask (torch.Tensor): Boolean tensor of shape (..., key_len) representing the attention mask.
key_len (int): The length of the key dimension.
block_size (int): The size of each block for smoothing. Must be a positive integer.
Returns:
torch.Tensor: The smoothed attention mask tensor of the same shape as the input.
"""

Copilot uses AI. Check for mistakes.
key_len: int,
block_size: int,
):
if block_size <= 0:
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation only checks if block_size <= 0, but doesn't verify that block_size is an integer. The previous code included int(block_size) != block_size checks in both topk_mask and relu_mask. Consider adding integer validation here: if not isinstance(block_size, int) or block_size <= 0: to maintain the same input validation behavior.

Suggested change
if block_size <= 0:
if not isinstance(block_size, int) or block_size <= 0:

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +48


Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is trailing whitespace on line 47. Remove the extra whitespace after the return statement.

Suggested change

Copilot uses AI. Check for mistakes.
attention_bias = attention_bias.masked_fill(~attention_mask, min_dtype) if attention_mask is not None else attention_bias
topk_values, topk_indices = torch.topk(
attention_bias.to(torch.float),
attention_bias,
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .to(torch.float) conversion was removed from the torch.topk call. While this may be intentional to preserve the original dtype, it changes the existing behavior. If attention_bias is not already torch.float, this could affect numerical precision in the topk operation. Verify this is the intended behavior or document why the dtype conversion was removed.

Suggested change
attention_bias,
attention_bias.to(torch.float),

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants