Skip to content

Conversation

@LoserCheems
Copy link
Collaborator

@LoserCheems LoserCheems commented Sep 8, 2025

Description

Revises documentation to reflect the transition from a two-stage ZOH-based approach to a unified sparse computation system with block-level skip logic.

Removes references to TopK selection and keep_window_size parameters in favor of direct mask and bias tensor inputs, simplifying the API while maintaining sparse computation benefits.

Key documentation updates include:

  • Replaces ZOH states and active masks with attention mask and bias tensors
  • Documents unified block-level skip logic for both forward and backward passes
  • Updates API signatures to reflect new required parameters
  • Adds comprehensive shared memory aliasing strategies
  • Documents LSE caching for numerical stability in backward pass
  • Updates performance models to reflect block-level sparsity benefits
  • Provides complete migration examples for existing codebases

Type of Change

Please check the relevant option(s):

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance optimization
  • CUDA kernel improvement
  • Code refactoring

Related Issues

no

Changes Made

Code Changes

  • Modified Python API
  • Updated CUDA kernels
  • Changed build system
  • Updated dependencies

Documentation

  • Updated README
  • Updated API documentation
  • Added examples
  • Updated benchmarks

Testing

Please describe the tests you ran to verify your changes:

  • Existing tests pass: python -m pytest tests/ -v
  • Added new tests for new functionality
  • Benchmarks show no performance regression
  • Tested on multiple GPU architectures (if applicable)

Test Configuration

  • OS: [e.g., Ubuntu 20.04]
  • Python: [e.g., 3.9.7]
  • PyTorch: [e.g., 2.1.0]
  • CUDA: [e.g., 11.8]
  • GPU: [e.g., RTX 4090]

Performance Impact

If this change affects performance, please provide benchmarks:

Before

# Benchmark results before your changes

After

# Benchmark results after your changes

Breaking Changes

If this PR introduces breaking changes, please describe:

  • What breaks
  • How users can migrate their code
  • Why the breaking change is necessary

Checklist

Please check all that apply:

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published

CUDA-specific (if applicable)

  • CUDA kernels compile without warnings
  • Tested on SM 8.0+ architectures
  • Memory usage has been profiled
  • No memory leaks detected

Additional Notes

Any additional information that reviewers should know:

Screenshots (if applicable)

If your changes include visual elements or performance improvements, please add screenshots or graphs.

Replaces packed variants and variable length sections with comprehensive Transformers integration documentation.

Adds detailed documentation for flash_dynamic_mask_attention_forward function with complete usage examples showing dynamic attention bias generation and flexible backend selection.

Reorganizes content structure to prioritize practical integration patterns over low-level API variants.

Includes backend comparison table and updated installation instructions for better developer onboarding.
Revises documentation to reflect the transition from a two-stage ZOH-based approach to a unified sparse computation system with block-level skip logic.

Key documentation updates include:
- Replaces ZOH states and active masks with attention mask and bias tensors
- Documents unified block-level skip logic for both forward and backward passes
- Updates API signatures to reflect new required parameters
- Adds comprehensive shared memory aliasing strategies
- Documents LSE caching for numerical stability in backward pass
- Updates performance models to reflect block-level sparsity benefits
- Provides complete migration examples for existing codebases

Removes references to TopK selection and keep_window_size parameters in favor of direct mask and bias tensor inputs, simplifying the API while maintaining sparse computation benefits.
Copilot AI review requested due to automatic review settings September 8, 2025 02:22

This comment was marked as outdated.

…-level optimizations

Restructures README to highlight core kernel advantages including native 4D mask/bias tensor processing and intelligent computation skipping mechanisms.

Reorganizes feature sections to better showcase performance optimizations and separates basic usage from gradient computation examples.

Improves technical explanations by focusing on unified skip logic, memory access patterns, and complete gradient chain support rather than abstract integration concepts.

Updates code examples to demonstrate proper tensor shapes and sparse mask generation patterns for better user guidance.
@LoserCheems LoserCheems requested a review from Copilot September 8, 2025 03:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the documentation to reflect a transition from a two-stage ZOH-based approach to a unified sparse computation system with block-level skip logic. The changes remove references to TopK selection parameters and update API signatures to reflect direct mask and bias tensor inputs.

  • Comprehensive documentation update across API reference and README files
  • Updated API signatures to reflect unified block-level sparse computation
  • Added transformers integration documentation with complete usage examples

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File Description
docs/api_reference.md Major API documentation overhaul with new backend details, transformers integration section, and updated function signatures
README_zh.md Chinese README updates with revised feature descriptions and usage examples
README.md English README updates with revised feature descriptions and usage examples

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

- softcap: Softcap value for attention scores
- **kwargs: Additional arguments including:
- is_causal: Whether to apply causal mask
- keep_window_size: Size of window to keep
Copy link

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keep_window_size parameter is documented here but the PR description mentions removing references to keep_window_size parameters. This appears inconsistent with the stated goal of the documentation update.

Suggested change
- keep_window_size: Size of window to keep

Copilot uses AI. Check for mistakes.
README.md Outdated
device=device, dtype=dtype)
attention_mask = torch.ones(batch_size, num_heads, seq_len, seq_len,
device=device, dtype=dtype)
attention_mask = torch.ones(batch_size, num_heads, seq_len, seq_len, device=device, dtype=dtype)
Copy link

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attention_bias tensor shape uses num_kv_heads but the attention_mask tensor on line 172 uses num_heads. This inconsistency in head dimensions should be clarified or made consistent.

Suggested change
attention_mask = torch.ones(batch_size, num_heads, seq_len, seq_len, device=device, dtype=dtype)
attention_mask = torch.ones(batch_size, num_kv_heads, seq_len, seq_len, device=device, dtype=dtype)

Copilot uses AI. Check for mistakes.
Corrects attention mask to use num_kv_heads instead of num_heads for proper dimensional consistency with attention bias and key-value tensors in sparse attention implementation.

Updates both English and Chinese documentation examples.
@LoserCheems LoserCheems merged commit cdfa83a into main Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants