[FEATURE SUPPORT] Triton special compact dynamic-mask attention: 1.6× faster fwd+bwd, numerically equivalent #206

LoserCheems · 2025-11-07T11:47:11Z

Summary

Purpose: Introduce a compact dynamic-mask attention (DMA) path optimized with Triton that delivers substantially faster training while preserving exact numerical equivalence with the baseline DMA.
Outcome: On bf16 and large-window workloads, this implementation achieves ~1.6× end-to-end speedup for forward+backward with identical outputs and gradients to the original implementation.

Design

Compact representation: Preprocess K/V/B via indices into compact buffers (CuK, CuV, CuB) and build a boolean mask CuM for [query_len × window_size]. This eliminates useless memory traffic and compute outside the window.
Streaming softmax (forward): Iterate over compact blocks, apply bias and mask, compute stable log-sum-exp statistics (lse), and accumulate outputs. Skip fully-inactive tiles.
Backward rematerialization:
- Compute Delta = Σ(o * do) per row in fp32.
- For each column block, rematerialize scores/probabilities under CuM, accumulate dV/dK/dB/dQ in fp32, then cast to input dtype.
- Scatter-add compact dK/dV/dB back to the original sequence dimension using attn_indices.
GQA mapping: Map Q heads to KV heads via h_h_k_ratio = nheads // nheads_k in both forward and backward.
Numerical stability: All accumulations are performed in fp32, with stable lse tracking identical to the baseline DMA. Causal masking is supported and pre-applied into CuM.
Constraints: head_dim ≤ 128, dtype in {fp16, bf16}, attn_indices is int64 and must be valid for the chosen window_size.

Changes

New fast path: flash_dmattn.flash_dmattn_triton_special.triton_dmattn_func(query, key, value, attn_bias, attn_indices, is_causal=False, softmax_scale=None).
Triton kernels:
- _fwd_preprocess: gather K/V/B into CuK/CuV/CuB and construct CuM with row/col/causal masking.
- _fwd_kernel: streaming softmax over compact tiles with stable lse.
- _bwd_preprocess_do_o_dot: compute per-row Delta.
- _bwd_kernel + _bwd_kernel_one_col_block: column-block backward with fp32 accumulators, then scatter back via indices.
Public API: No breaking changes; adds the triton_dmattn_func convenience entrypoint.
Behavior: Supports causal and non-causal; works with GQA/MQA (Q heads divisible by KV heads). Assumes attn_indices are valid indices into key_len.

Implementation notes

Safety: Double-masking (row and col) prevents OOB loads/stores on non-divisible tiles; causal is pre-baked into CuM.
Precision: Internal reductions are fp32; final outputs cast to the input dtype (fp16/bf16).
Performance: Skips fully inactive tiles; reduces memory bandwidth via compact buffers; autotune configs provided for common BLOCK_M/N and num_warps.
Indices: attn_indices should be in [0, key_len); generator (e.g., topk_indices) guarantees validity. If external indices are used, optional guards can be added during scatter to drop invalid entries.
Limits: Designed for head_dim ≤ 128; extending beyond may require additional kernel variants.

Tests

Correctness: Exact numerical equivalence to the baseline DMA for forward and backward across large-window causal settings and GQA.
Performance (100 runs, bf16):
- Config: batch=2, num_heads=16, num_kv_heads=8, query_len=8192, key_len=8192, head_dim=128, window_size=2048
- Baseline DMA (triton) fwd+bwd: 29.750222ms ± 0.265306ms
- Triton special fwd+bwd: 18.768802ms ± 0.224953ms
- Speedup: ~1.59×

Minimal runnable example:

import torch
from flash_dmattn.flash_dmattn_triton_special import triton_dmattn_func
from flash_dmattn.utils.mask import topk_indices

device = 'cuda'
dtype = torch.bfloat16
batch, num_heads, num_kv_heads, query_len, key_len, head_dim, window_size = 2, 16, 8, 8192, 8192, 128, 2048

query = torch.randn(batch, num_heads, query_len, head_dim, device=device, dtype=dtype, requires_grad=True)
key = torch.randn(batch, num_kv_heads, key_len, head_dim, device=device, dtype=dtype, requires_grad=True)
value = torch.randn(batch, num_kv_heads, key_len, head_dim, device=device, dtype=dtype, requires_grad=True)
attn_bias = torch.randn(batch, num_kv_heads, key_len, device=device, dtype=dtype, requires_grad=True)
attn_indices = topk_indices(attn_bias, window_size)

out = triton_dmattn_func(query, key, value, attn_bias, attn_indices, is_causal=True)
out.sum().backward()

Documentation

API reference: Add triton_dmattn_func to the English API docs, including input shapes, dtype constraints, and notes on attn_indices.
Integration guide: Brief section showing how to compute attn_indices (e.g., via topk_indices) and switch between baseline and special Triton path.
Performance notes: Document typical speedups and constraints (head_dim ≤ 128, bf16/fp16).

Checklist

Enables fused Triton forward/backward paths for dynamic masked attention to reduce padding overhead and deliver faster windowed attention execution.

Introduces reusable top-k extraction on the bias tensor to simplify downstream mask logic.

Copilot

Pull Request Overview

This PR adds a specialized Triton implementation for flash dynamic masked attention along with a utility function to extract top-k indices from attention bias. The implementation introduces a gather-based approach where attention is computed only on a subset of key-value pairs selected by top-k indices.

Key changes:

Added topk_indices utility function to extract and sort top-k indices from attention bias
Implemented a new Triton-based flash attention variant that uses gathered K/V/bias values
Added preprocessing, forward, and backward kernels for the specialized implementation

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.

File	Description
flash_dmattn/utils/mask.py	Added `topk_indices` function to compute sorted top-k indices from attention bias
flash_dmattn/flash_dmattn_triton_special.py	New file implementing specialized Triton kernels for flash dynamic masked attention with gather-based optimization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-07T11:49:39Z

flash_dmattn/utils/mask.py

+            (batch_size, num_kv_heads, key_len).
+        window_size (int): The number of top elements to consider for the mask.
+        **kwargs: Additional keyword arguments.
+    


Trailing whitespace should be removed.

Suggested change

Copilot · 2025-11-07T11:49:40Z