Update-docs #154

LoserCheems · 2025-09-08T02:22:50Z

Description

Revises documentation to reflect the transition from a two-stage ZOH-based approach to a unified sparse computation system with block-level skip logic.

Removes references to TopK selection and keep_window_size parameters in favor of direct mask and bias tensor inputs, simplifying the API while maintaining sparse computation benefits.

Key documentation updates include:

Replaces ZOH states and active masks with attention mask and bias tensors
Documents unified block-level skip logic for both forward and backward passes
Updates API signatures to reflect new required parameters
Adds comprehensive shared memory aliasing strategies
Documents LSE caching for numerical stability in backward pass
Updates performance models to reflect block-level sparsity benefits
Provides complete migration examples for existing codebases

Type of Change

Please check the relevant option(s):

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance optimization
CUDA kernel improvement
Code refactoring

Related Issues

no

Changes Made

Code Changes

Modified Python API
Updated CUDA kernels
Changed build system
Updated dependencies

Documentation

Updated README
Updated API documentation
Added examples
Updated benchmarks

Testing

Please describe the tests you ran to verify your changes:

Existing tests pass: python -m pytest tests/ -v
Added new tests for new functionality
Benchmarks show no performance regression
Tested on multiple GPU architectures (if applicable)

Test Configuration

OS: [e.g., Ubuntu 20.04]
Python: [e.g., 3.9.7]
PyTorch: [e.g., 2.1.0]
CUDA: [e.g., 11.8]
GPU: [e.g., RTX 4090]

Performance Impact

If this change affects performance, please provide benchmarks:

Before

# Benchmark results before your changes

After

# Benchmark results after your changes

Breaking Changes

If this PR introduces breaking changes, please describe:

What breaks
How users can migrate their code
Why the breaking change is necessary

Checklist

Please check all that apply:

My code follows the project's style guidelines
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published

CUDA-specific (if applicable)

CUDA kernels compile without warnings
Tested on SM 8.0+ architectures
Memory usage has been profiled
No memory leaks detected

Additional Notes

Any additional information that reviewers should know:

Screenshots (if applicable)

If your changes include visual elements or performance improvements, please add screenshots or graphs.

Replaces packed variants and variable length sections with comprehensive Transformers integration documentation. Adds detailed documentation for flash_dynamic_mask_attention_forward function with complete usage examples showing dynamic attention bias generation and flexible backend selection. Reorganizes content structure to prioritize practical integration patterns over low-level API variants. Includes backend comparison table and updated installation instructions for better developer onboarding.

Revises documentation to reflect the transition from a two-stage ZOH-based approach to a unified sparse computation system with block-level skip logic. Key documentation updates include: - Replaces ZOH states and active masks with attention mask and bias tensors - Documents unified block-level skip logic for both forward and backward passes - Updates API signatures to reflect new required parameters - Adds comprehensive shared memory aliasing strategies - Documents LSE caching for numerical stability in backward pass - Updates performance models to reflect block-level sparsity benefits - Provides complete migration examples for existing codebases Removes references to TopK selection and keep_window_size parameters in favor of direct mask and bias tensor inputs, simplifying the API while maintaining sparse computation benefits.

…-level optimizations Restructures README to highlight core kernel advantages including native 4D mask/bias tensor processing and intelligent computation skipping mechanisms. Reorganizes feature sections to better showcase performance optimizations and separates basic usage from gradient computation examples. Improves technical explanations by focusing on unified skip logic, memory access patterns, and complete gradient chain support rather than abstract integration concepts. Updates code examples to demonstrate proper tensor shapes and sparse mask generation patterns for better user guidance.

Copilot

Pull Request Overview

This PR updates the documentation to reflect a transition from a two-stage ZOH-based approach to a unified sparse computation system with block-level skip logic. The changes remove references to TopK selection parameters and update API signatures to reflect direct mask and bias tensor inputs.

Comprehensive documentation update across API reference and README files
Updated API signatures to reflect unified block-level sparse computation
Added transformers integration documentation with complete usage examples

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File	Description
docs/api_reference.md	Major API documentation overhaul with new backend details, transformers integration section, and updated function signatures
README_zh.md	Chinese README updates with revised feature descriptions and usage examples
README.md	English README updates with revised feature descriptions and usage examples

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-08T03:57:58Z

docs/api_reference.md

+- softcap: Softcap value for attention scores
+- **kwargs: Additional arguments including:
+  - is_causal: Whether to apply causal mask
+  - keep_window_size: Size of window to keep


The keep_window_size parameter is documented here but the PR description mentions removing references to keep_window_size parameters. This appears inconsistent with the stated goal of the documentation update.

Suggested change

- keep_window_size: Size of window to keep

docs/api_reference.md

Copilot · 2025-09-08T03:57:58Z

README.md

-                           device=device, dtype=dtype)
-attention_mask = torch.ones(batch_size, num_heads, seq_len, seq_len,
-                          device=device, dtype=dtype)
+attention_mask = torch.ones(batch_size, num_heads, seq_len, seq_len, device=device, dtype=dtype)


The attention_bias tensor shape uses num_kv_heads but the attention_mask tensor on line 172 uses num_heads. This inconsistency in head dimensions should be clarified or made consistent.

Suggested change

attention_mask = torch.ones(batch_size, num_heads, seq_len, seq_len, device=device, dtype=dtype)

attention_mask = torch.ones(batch_size, num_kv_heads, seq_len, seq_len, device=device, dtype=dtype)

Corrects attention mask to use num_kv_heads instead of num_heads for proper dimensional consistency with attention bias and key-value tensors in sparse attention implementation. Updates both English and Chinese documentation examples.

algo-home added 2 commits September 8, 2025 10:17

Copilot AI review requested due to automatic review settings September 8, 2025 02:22

This comment was marked as outdated.

Sign in to view

LoserCheems requested a review from Copilot September 8, 2025 03:57

Copilot AI reviewed Sep 8, 2025

View reviewed changes

Fixes attention mask dimension mismatch

9e3c2bc

Corrects attention mask to use num_kv_heads instead of num_heads for proper dimensional consistency with attention bias and key-value tensors in sparse attention implementation. Updates both English and Chinese documentation examples.

LoserCheems merged commit cdfa83a into main Sep 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update-docs #154

Update-docs #154

Uh oh!

LoserCheems commented Sep 8, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 8, 2025

Uh oh!

Uh oh!

Copilot AI Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	attention_mask = torch.ones(batch_size, num_heads, seq_len, seq_len, device=device, dtype=dtype)
	attention_mask = torch.ones(batch_size, num_kv_heads, seq_len, seq_len, device=device, dtype=dtype)

Update-docs #154

Update-docs #154

Uh oh!

Conversation

LoserCheems commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Changes Made

Code Changes

Documentation

Testing

Test Configuration

Performance Impact

Before

After

Breaking Changes

Checklist

CUDA-specific (if applicable)

Additional Notes

Screenshots (if applicable)

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LoserCheems commented Sep 8, 2025 •

edited

Loading