[FEATURE SUPPORT] Robust dBias accumulation for seqlen_q_bias == 1 #194

LoserCheems · 2025-10-22T12:15:37Z

Summary

Add robust, numerically-correct support for dBias accumulation when dbias has a broadcast query dimension (seqlen_q_bias == 1).
Fixes prior under-accumulation/overwrite issues by performing a cooperative block-wide reduction along M and writing a single row to global memory.
Improves API ergonomics by inferring “accumulate dBias” from dbias_row_stride == 0 in-kernel (no need for an explicit flag).

Design

In-kernel inference: accum_dbias is inferred as (params.has_bias && params.dbias_row_stride == 0), which exactly captures seqlen_q_bias == 1.
Reduction strategy:
- Materialize the MMA accumulator to shared memory as Element.
- Reduce across M into row 0 per N-column (in ElementAccum to preserve precision).
- Write only the first row to global using copy_MN with max_M=1 and proper N-tail predication.

Alternatives considered:

Accumulate directly in global memory with atomicAdd: simpler to reason about for determinism, but higher overhead; retained as a future option via dbias_accum_ptr if needed.
Keep the old per-thread “first row” collapse: incorrect under row_stride == 0 due to last-writer-wins.

Changes

Kernel (flash_bwd_kernel.h):
- Infer accum_dbias via params.dbias_row_stride == 0 and binfo.actual_seqlen_q > 1 (the latter is a micro-optimization).
- Add shared-memory reduction across M; write a single row with max_M=1.
- Maintain proper column predication (tail N tile).
- Remove temporary debug prints.
No public Python API changes required.
- Note: accum_dbias on the host can be treated as deprecated; the kernel behavior is determined by the dbias layout (row_stride).

Implementation Notes

ElementAccum is used for the reduction; cast to Element only on writeback to preserve numerical stability.
We avoided extra atomics; if deterministic multi-split accumulation is required later, guard on dbias_accum_ptr (and deterministic) to choose an alternate path.
The stride check is reliable because set_params_dgrad sets dbias_row_stride = 0 exactly when seqlen_q_bias == 1 (broadcasted row).

Tests

Benchmarks/Equivalence:
- backward_equivalence.py: dBias now matches reference within bf16 tolerance.
- User-provided metrics:
  - Shape: [1, 1, 1, 4196], dtype: bfloat16
  - Original range: [-552.0, 852.0], CUDA range: [-552.0, 852.0]
  - Max abs diff: 2.0; Mean abs diff: 0.01867676
  - Elements within tolerance: 100% (4196/4196)
  - Strict allclose (bf16 rtol=0.1, atol=0.1): PASS
Grad check:
- grad_equivalence.py: should pass with similar tolerances; please re-run to confirm end-to-end.

Optional run steps (Windows PowerShell):

pip install -e . --no-build-isolation
python -c "from flash_dmattn import get_available_backends; print(get_available_backends())"
python benchmarks\backward_equivalence.py
python benchmarks\grad_equivalence.py

Docs

API/integration docs can mention that dBias accumulation for seqlen_q_bias == 1 is now automatically inferred from stride, no manual flag required.
Internal developer docs: note the shared-memory reduction pattern and the reason we write max_M=1.

Checklist

Linked issue provided
API stable (no breaking changes; kernel infers behavior from layout)
Tests added or updated (benchmarks serve as regression harness; strict bf16 allclose passes)
Docs added or updated (developer notes and integration hint)
No known performance regressions (shared-memory reduction adds minimal overhead; avoided atomics; tail-predication unchanged)

Simplifies backward parameters by removing an unused boolean that toggled bias accumulation. Reduces maintenance surface and potential ABI issues; no behavior change expected.

Replaces atomic updates in the bias-gradient path with on-chip accumulation and a post-loop row-sum reduction, reducing contention and improving performance and determinism. Derives the accumulation condition from layout (row stride == 0) and sequence length, drops auxiliary pointers/increments, and adds necessary synchronization to avoid shared-memory races when reusing buffers. Cleans up zeroing and copy ordering and consolidates the final write to global memory.

Simplifies bias-gradient handling by deriving accumulation from the bias sequence-length condition, removing the redundant parameter and related plumbing. Aligns zero-init of bias buffers with provided tensor options (no forced float), preventing mixed-precision dtype mismatches and improving correctness for MQA/GQA bias shapes. Streamlines the backward API with no intended behavior changes beyond dtype fix.

Copilot

Pull Request Overview

This PR implements robust dBias accumulation for the case when the query sequence dimension is broadcasted (seqlen_q_bias == 1), addressing numerical correctness issues in the backward pass. The kernel now automatically infers when to perform dBias accumulation based on the dbias_row_stride parameter, eliminating the need for an explicit flag in the API.

Key changes:

In-kernel inference of dBias accumulation mode from dbias_row_stride == 0 instead of using an explicit API parameter
Block-wide reduction across the M dimension to correctly accumulate gradients into a single row when broadcasting
Removal of the accum_dbias parameter from the API layer and struct definition

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
csrc/flash_dmattn/src/flash_bwd_kernel.h	Implements robust dBias accumulation with block-wide reduction for broadcasted query dimension; removes atomic accumulation path
csrc/flash_dmattn/src/flash.h	Removes deprecated `accum_dbias` field from Flash_bwd_params struct
csrc/flash_dmattn/flash_api.cpp	Removes `accum_dbias` parameter from function signatures and replaces its usage with direct `seqlen_q_bias == 1` checks

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-22T12:16:32Z

csrc/flash_dmattn/flash_api.cpp

+                ? torch::zeros({batch_size, num_heads, 1, seqlen_k_rounded}, opts)
                : torch::zeros({batch_size, num_heads, seqlen_q, seqlen_k_rounded}, opts)


The dtype for dbias_expanded allocation has changed from at::kFloat to the default opts dtype. This may cause precision loss if opts is not Float32, since the accumulation in the kernel uses ElementAccum (fp32). Consider explicitly preserving opts.dtype(at::kFloat) to match the kernel's accumulation precision.

Suggested change

? torch::zeros({batch_size, num_heads, 1, seqlen_k_rounded}, opts)

: torch::zeros({batch_size, num_heads, seqlen_q, seqlen_k_rounded}, opts)

? torch::zeros({batch_size, num_heads, 1, seqlen_k_rounded}, opts.dtype(at::kFloat))

: torch::zeros({batch_size, num_heads, seqlen_q, seqlen_k_rounded}, opts.dtype(at::kFloat))

Copilot · 2025-10-22T12:16:33Z

csrc/flash_dmattn/src/flash_bwd_kernel.h

+
+            cute::copy(smem_tiled_copy_PdS, tdBiasadBias, tdSsdS);
+
+            __syncthreads();


The reduction loop should document that it's performing a column-wise sum across all M rows into row 0 of shared memory, since this is the critical step for correct dBias accumulation when dbias_row_stride == 0.

Suggested change

__syncthreads();

__syncthreads();

// Perform a column-wise sum across all M rows into row 0 of shared memory.

// This is the critical step for correct dBias accumulation when dbias_row_stride == 0.

Copilot · 2025-10-22T12:16:33Z

csrc/flash_dmattn/src/flash_bwd_kernel.h

+                gmem_tiled_copy_dBias,
+                tBiassBias, tdBiasgdBias,
+                tBiascBias, tBiaspBias,
+                /*max_M=*/1


The max_M=1 parameter is critical to the correctness of this implementation as it ensures only the reduced row is written. This should have a comment explaining that we're writing only row 0 which contains the sum across all M rows.

algo-home added 3 commits October 22, 2025 20:03

Removes unused bias accumulation flag

8b72ed7

Simplifies backward parameters by removing an unused boolean that toggled bias accumulation. Reduces maintenance surface and potential ABI issues; no behavior change expected.

Copilot AI review requested due to automatic review settings October 22, 2025 12:15

github-actions bot requested review from Evanwu1125, SNHuan, Thanksyy, ftgreat, juliohsu, wubingheng111 and zacliu2023 October 22, 2025 12:15

github-actions bot assigned Evanwu1125, ftgreat, juliohsu, SNHuan, Thanksyy, wubingheng111 and zacliu2023 Oct 22, 2025

Copilot AI reviewed Oct 22, 2025

View reviewed changes

LoserCheems merged commit df5ade9 into main Oct 22, 2025
1 check passed

LoserCheems deleted the accum_dbias branch October 27, 2025 08:56

LoserCheems restored the accum_dbias branch October 27, 2025 08:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE SUPPORT] Robust dBias accumulation for seqlen_q_bias == 1 #194

[FEATURE SUPPORT] Robust dBias accumulation for seqlen_q_bias == 1 #194

Uh oh!

LoserCheems commented Oct 22, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 22, 2025

Uh oh!

Copilot AI Oct 22, 2025

Uh oh!

Copilot AI Oct 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

		? torch::zeros({batch_size, num_heads, 1, seqlen_k_rounded}, opts)
		: torch::zeros({batch_size, num_heads, seqlen_q, seqlen_k_rounded}, opts)


		cute::copy(smem_tiled_copy_PdS, tdBiasadBias, tdSsdS);

		__syncthreads();

[FEATURE SUPPORT] Robust dBias accumulation for seqlen_q_bias == 1 #194

[FEATURE SUPPORT] Robust dBias accumulation for seqlen_q_bias == 1 #194

Uh oh!

Conversation

LoserCheems commented Oct 22, 2025

Summary

Design

Changes

Implementation Notes

Tests

Docs

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants