Add block-wise smoothing to attention mask #201

LoserCheems · 2025-10-30T12:30:37Z

Introduce an optional block_size parameter to enhance the attention mask generation by aggregating top-k selections, reducing fragmentation, and improving locality. Validate block_size as a positive integer and ensure previous behavior remains intact when unset. Update documentation accordingly.

Introduces an optional block_size to aggregate top-k selections along the key dimension using a majority vote, reducing fragmentation and encouraging locality in the dynamic mask. Validates block_size as a positive integer, handles remainder tails, and forwards the parameter through mask creation. Updates docs accordingly. Preserves previous behavior when unset and uses in-place ops for efficiency.

Copilot

Pull Request Overview

This PR adds an optional block_size parameter to enable block-wise aggregation and smoothing of attention masks after top-k selection. The feature applies a majority voting mechanism where blocks are kept if more than half of their elements are selected.

Added block_size parameter to dynamic_mask() and create_mask() functions
Implemented block-wise smoothing logic with majority voting for both full blocks and tail elements
Added input validation for block_size to ensure it's a positive integer

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-30T12:33:29Z

flash_dmattn/utils/mask.py

+            block_shape = (*block_view.shape[:-1], full_len // block_size, block_size)
+            blocks = block_view.view(*block_shape)
+            block_counts = blocks.sum(dim=-1).to(torch.int32)
+            block_keep = (block_counts * 2) > block_size


The condition (block_counts * 2) > block_size is a magic number pattern that obscures the majority voting logic. Consider extracting this as a named constant or adding an inline comment to clarify this represents 'keep block if more than 50% of elements are True'.

Copilot · 2025-10-30T12:33:30Z

flash_dmattn/utils/mask.py

+            tail_slice = attention_mask[..., full_len:]
+            tail_len = tail_slice.shape[-1]
+            tail_counts = tail_slice.sum(dim=-1, keepdim=True).to(torch.int32)
+            tail_keep = (tail_counts * 2) > tail_len


Same as line 67, this condition (tail_counts * 2) > tail_len uses a magic number pattern. Consider adding a comment or using a named constant to clarify this represents the majority voting threshold (>50%).

Copilot · 2025-10-30T12:33:30Z

flash_dmattn/utils/mask.py

+            block_counts = blocks.sum(dim=-1).to(torch.int32)
+            block_keep = (block_counts * 2) > block_size
+            blocks.copy_(block_keep.unsqueeze(-1).expand_as(blocks))
+
+        if key_len > full_len:
+            tail_slice = attention_mask[..., full_len:]
+            tail_len = tail_slice.shape[-1]
+            tail_counts = tail_slice.sum(dim=-1, keepdim=True).to(torch.int32)


Converting to torch.int32 may cause unnecessary precision loss or overflow for large block sizes. Consider using torch.int64 for safer integer arithmetic, or document why int32 is sufficient for the expected use cases.

Suggested change

block_counts = blocks.sum(dim=-1).to(torch.int32)

block_keep = (block_counts * 2) > block_size

blocks.copy_(block_keep.unsqueeze(-1).expand_as(blocks))

if key_len > full_len:

tail_slice = attention_mask[..., full_len:]

tail_len = tail_slice.shape[-1]

tail_counts = tail_slice.sum(dim=-1, keepdim=True).to(torch.int32)

block_counts = blocks.sum(dim=-1).to(torch.int64)

block_keep = (block_counts * 2) > block_size

blocks.copy_(block_keep.unsqueeze(-1).expand_as(blocks))

if key_len > full_len:

tail_slice = attention_mask[..., full_len:]

tail_len = tail_slice.shape[-1]

tail_counts = tail_slice.sum(dim=-1, keepdim=True).to(torch.int64)

Copilot · 2025-10-30T12:33:30Z

flash_dmattn/utils/mask.py

+        if key_len > full_len:
+            tail_slice = attention_mask[..., full_len:]
+            tail_len = tail_slice.shape[-1]
+            tail_counts = tail_slice.sum(dim=-1, keepdim=True).to(torch.int32)


Same as line 66, converting to torch.int32 may cause precision loss. Consider using torch.int64 for consistency and to handle edge cases with large tensors safely.

Suggested change

tail_counts = tail_slice.sum(dim=-1, keepdim=True).to(torch.int32)

tail_counts = tail_slice.sum(dim=-1, keepdim=True).to(torch.int64)

Copilot AI review requested due to automatic review settings October 30, 2025 12:30

github-actions bot requested review from Evanwu1125, SNHuan, Thanksyy, ftgreat, juliohsu, wubingheng111 and zacliu2023 October 30, 2025 12:30

github-actions bot assigned Evanwu1125, ftgreat, juliohsu, SNHuan, Thanksyy, wubingheng111 and zacliu2023 Oct 30, 2025

Copilot AI reviewed Oct 30, 2025

View reviewed changes

LoserCheems merged commit d883adc into main Oct 30, 2025
9 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add block-wise smoothing to attention mask #201

Add block-wise smoothing to attention mask #201

Uh oh!

LoserCheems commented Oct 30, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 30, 2025

Uh oh!

Copilot AI Oct 30, 2025

Uh oh!

Copilot AI Oct 30, 2025

Uh oh!

Copilot AI Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

	tail_counts = tail_slice.sum(dim=-1, keepdim=True).to(torch.int32)
	tail_counts = tail_slice.sum(dim=-1, keepdim=True).to(torch.int64)

Add block-wise smoothing to attention mask #201

Add block-wise smoothing to attention mask #201

Uh oh!

Conversation

LoserCheems commented Oct 30, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants