Fix Dynamic Mask Attention Integration in FlashAttention CUDA Kernel #12

Copilot · 2025-05-22T02:45:41Z

Issue Summary

This PR fixes the integration of dynamic mask attention (DMA) into the FlashAttention CUDA kernel. The standalone mask computation in mask.h was working correctly, but the attention outputs showed large numerical discrepancies between Python and CUDA implementations.

Root Cause

The core issue was that mask values were correctly computed and properly used to set predicates for sparse computation, but were not being applied to the attention scores in the non-causal path:

In the causal mask path, mask values were added to attention scores
In the non-causal path, this addition was missing, causing attention scores to be incorrect
Since softmax operates on these scores, the error propagated to the final attention output

Implementation Details

The fix adds a consistent pattern to both causal and non-causal code paths:

Apply sparse matrix multiplication for Q*K^T using the predicate mask
Add the actual mask values from sDynamicMaskValues to the attention scores (acc_s) for active keys
Apply softmax and continue with the rest of the attention computation

// Apply mask values to attention scores
for (int mma = 0; mma < size<0>(acc_s); ++mma) {
    for (int mi = 0; mi < size<1>(acc_s); ++mi) {
        for (int ki = 0; ki < size<2>(acc_s); ++ki) {
            // ... check bounds and predicates
            if (predicate_k_row(k_idx)) {
                acc_s(mma, mi, ki) += static_cast<ElementAccum>(mask_values_row(k_idx));
            }
        }
    }
}

The PR also improves code readability by updating comments for clarity.

Testing

This change addresses the integration issues described in the issue, where Python and CUDA implementations showed large numerical discrepancies despite the mask computation being correct.

Fixes #11.

Warning

Firewall rules blocked me from connecting to one or more addresses

I tried to connect to the following addresses, but was blocked by firewall rules:

cdn.fwupd.org
- Triggering command: /usr/bin/fwupdmgr refresh (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to my firewall allow list

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: LoserCheems <124847097+LoserCheems@users.noreply.github.com>

Copilot

Pull Request Overview

This PR ensures dynamic mask values are correctly applied to attention scores in the non-causal code path and standardizes in-code comments to English.

Adds mask-application loops to the non-causal attention computation, mirroring the causal path
Updates Chinese comments to clear English descriptions

Copilot · 2025-05-22T02:55:34Z

csrc/src/flash_attention_fwd_kernel.h

+        for (int mma = 0; mma < size<0>(acc_s); ++mma) {
+            for (int mi = 0; mi < size<1>(acc_s); ++mi) {
+                for (int ki = 0; ki < size<2>(acc_s); ++ki) {
+                    int m_idx = mi;
+                    int k_idx = ki;
+                    if (m_idx < kBlockM && k_idx < block_key_len) {
+                        auto mask_values_row = sDynamicMaskValues(m_idx, _);
+                        auto predicate_k_row = sPredicate(m_idx, _);
+                        if (predicate_k_row(k_idx)) {
+                            acc_s(mma, mi, ki) += static_cast<ElementAccum>(mask_values_row(k_idx));
+                        }
+                    }
+                }
+            }
+        }


[nitpick] The mask application loop is duplicated in both causal and non-causal paths; consider extracting it into a helper function to improve maintainability and avoid code duplication.

Suggested change

for (int mma = 0; mma < size<0>(acc_s); ++mma) {

for (int mi = 0; mi < size<1>(acc_s); ++mi) {

for (int ki = 0; ki < size<2>(acc_s); ++ki) {

int m_idx = mi;

int k_idx = ki;

if (m_idx < kBlockM && k_idx < block_key_len) {

auto mask_values_row = sDynamicMaskValues(m_idx, _);

auto predicate_k_row = sPredicate(m_idx, _);

if (predicate_k_row(k_idx)) {

acc_s(mma, mi, ki) += static_cast<ElementAccum>(mask_values_row(k_idx));

}

}

}

}

}

apply_mask_to_scores<ElementAccum>(

acc_s, kBlockM, block_key_len, sDynamicMaskValues, sPredicate

);

Copilot · 2025-05-22T02:55:35Z

csrc/src/flash_attention_fwd_kernel.h

        );

+        // Apply mask values to attention scores (zero_hold states contain mask values to add to attention scores)
+        for (int mma = 0; mma < size<0>(acc_s); ++mma) {


Applying mask values via triple nested loops in the kernel may impact performance; consider fusing this operation with the sparse GEMM or leveraging vectorized operations to reduce overhead.

Copilot · 2025-05-22T02:55:35Z

csrc/src/flash_attention_fwd_kernel.h

+                        auto mask_values_row = sDynamicMaskValues(m_idx, _);
+                        auto predicate_k_row = sPredicate(m_idx, _);


[nitpick] The use of _ as an index placeholder in sDynamicMaskValues(m_idx, _) and sPredicate(m_idx, _) may be unclear to readers; consider documenting or renaming this placeholder for better readability.

Suggested change

auto mask_values_row = sDynamicMaskValues(m_idx, _);

auto predicate_k_row = sPredicate(m_idx, _);

// `col_idx` represents the column index for the current row `m_idx`.

auto mask_values_row = sDynamicMaskValues(m_idx, col_idx);

auto predicate_k_row = sPredicate(m_idx, col_idx);

LoserCheems · 2025-05-22T05:13:27Z

@copilot double check

Copilot · 2025-05-22T05:15:24Z

@copilot double check

I've double-checked the code changes and everything looks correct. The issue was that mask values weren't being applied to attention scores in the non-causal path, which I fixed by adding the appropriate code block (lines 560-574). I also fixed a syntax error (extra semicolon) and translated comments for consistency. Both causal and non-causal paths now properly apply mask values to attention scores in a consistent way. The commit 1e22daf addresses these changes and should resolve the numerical discrepancies mentioned in issue #11.

Warning

Firewall rules blocked me from connecting to one or more addresses

I tried to connect to the following addresses, but was blocked by firewall rules:

cdn.fwupd.org
- Triggering command: /usr/bin/fwupdmgr refresh (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to my firewall allow list

Initial plan for issue

e422efb

Copilot AI assigned Copilot and LoserCheems May 22, 2025

Copilot started work on behalf of LoserCheems May 22, 2025 02:45 View session

Copilot AI and others added 2 commits May 22, 2025 02:50

Fix mask application in sparse attention computation

765fe4c

Co-authored-by: LoserCheems <124847097+LoserCheems@users.noreply.github.com>

Fix semicolon typo in for loop

1e22daf

Co-authored-by: LoserCheems <124847097+LoserCheems@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] [Integration Progress Report] Dynamic Mask Attention Integration into FlashAttention~~ Fix Dynamic Mask Attention Integration in FlashAttention CUDA Kernel May 22, 2025

Copilot AI requested a review from LoserCheems May 22, 2025 02:51

Copilot finished work on behalf of LoserCheems May 22, 2025 02:51

LoserCheems requested a review from Copilot May 22, 2025 02:54

Copilot AI reviewed May 22, 2025

View reviewed changes

LoserCheems marked this pull request as ready for review May 22, 2025 05:12

Copilot started work on behalf of LoserCheems May 22, 2025 05:13 View session

Copilot finished work on behalf of LoserCheems May 22, 2025 05:16

LoserCheems merged commit 7d9e874 into main May 22, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Dynamic Mask Attention Integration in FlashAttention CUDA Kernel #12

Fix Dynamic Mask Attention Integration in FlashAttention CUDA Kernel #12

Uh oh!

Copilot AI commented May 22, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 22, 2025

Uh oh!

Copilot AI May 22, 2025

Uh oh!

Copilot AI May 22, 2025

Uh oh!

LoserCheems commented May 22, 2025

Uh oh!

Copilot AI commented May 22, 2025 •

edited

Loading

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		auto mask_values_row = sDynamicMaskValues(m_idx, _);
		auto predicate_k_row = sPredicate(m_idx, _);

-                        auto mask_values_row = sDynamicMaskValues(m_idx, _);
-                        auto predicate_k_row = sPredicate(m_idx, _);
+                        // `col_idx` represents the column index for the current row `m_idx`.
+                        auto mask_values_row = sDynamicMaskValues(m_idx, col_idx);
+                        auto predicate_k_row = sPredicate(m_idx, col_idx);

Fix Dynamic Mask Attention Integration in FlashAttention CUDA Kernel #12

Fix Dynamic Mask Attention Integration in FlashAttention CUDA Kernel #12

Uh oh!

Conversation

Copilot AI commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Summary

Root Cause

Implementation Details

Testing

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 22, 2025

Choose a reason for hiding this comment

Uh oh!

LoserCheems commented May 22, 2025

Uh oh!

Copilot AI commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented May 22, 2025 •

edited

Loading

Copilot AI commented May 22, 2025 •

edited

Loading