flash-algo · LoserCheems · Nov 9, 2025 · Nov 9, 2025 · Nov 9, 2025 · Nov 9, 2025
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -1,6 +1,6 @@
 ---
 name: Bug report
-about: Create a report to help us improve Flash-DMA
+about: Create a report to help us improve FSA
 title: '[BUG REPORT] '
 labels: ["bug"]
 assignees: 
@@ -39,7 +39,7 @@ python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {
 **Additional context**
 - OS: [e.g. Ubuntu 20.04, Windows 10, macOS 12]
 - Python version: [e.g. 3.9.7]
-- Flash-DMA version: [e.g. 0.1.0]
+- FSA version: [e.g. 0.1.0]
 - CUDA Compute Capability: [e.g. 8.6]
 
 **Error traceback**

diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml
@@ -1,5 +1,5 @@
 name: Bug report
-description: Create a report to help us improve Flash-DMA
+description: Create a report to help us improve FSA
 title: "[BUG REPORT] "
 labels:
   - bug

diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -1,6 +1,6 @@
 ---
 name: Feature request
-about: Suggest an idea for Flash-DMA
+about: Suggest an idea for FSA
 title: '[FEATURE REQUEST] '
 labels: ["feature"]
 assignees: 
@@ -44,4 +44,4 @@ Add any other context or screenshots about the feature request here.
 If this feature is inspired by a paper or existing implementation, please provide:
 - Link to paper/implementation
 - Brief explanation of the technique
-- Why it would be valuable for Flash-DMA users
+- Why it would be valuable for FSA users
diff --git a/.github/ISSUE_TEMPLATE/feature_request.yml b/.github/ISSUE_TEMPLATE/feature_request.yml
@@ -1,5 +1,5 @@
 name: Feature request
-description: Suggest an idea for FDMA
+description: Suggest an idea for FSA
 title: "[FEATURE REQUEST] "
 labels:
   - feature
@@ -16,7 +16,7 @@ body:
   - type: markdown
     attributes:
       value: |
-        Help us understand the feature you are proposing and why it matters for Flash-DMA workflows.
+        Help us understand the feature you are proposing and why it matters for FSA workflows.
   - type: textarea
     id: problem
     attributes:

diff --git a/.github/workflows/_build.yml b/.github/workflows/_build.yml
@@ -172,12 +172,12 @@ jobs:
 
           export MAX_JOBS=$([ "$MATRIX_CUDA_VERSION" == "129" ] && echo 1 || echo 2)
           export NVCC_THREADS=2
-          export FLASH_DMATTN_FORCE_BUILD="TRUE"
-          export FLASH_DMATTN_FORCE_CXX11_ABI=${{ inputs.cxx11_abi }}
+          export FLASH_SPARSE_ATTENTION_FORCE_BUILD="TRUE"
+          export FLASH_SPARSE_ATTENTION_FORCE_CXX11_ABI=${{ inputs.cxx11_abi }}
 
           # If specified, limit to a single compute capability to speed up build
           if [ -n "${MATRIX_ARCH}" ]; then
-            export FLASH_DMATTN_CUDA_ARCHS="${MATRIX_ARCH}"
+            export FLASH_SPARSE_ATTENTION_CUDA_ARCHS="${MATRIX_ARCH}"
           fi
 
           # GH allows max 6h

diff --git a/.github/workflows/manual_publish.yml b/.github/workflows/manual_publish.yml
@@ -38,7 +38,7 @@ jobs:
 
       - name: Build core package
         env:
-          FLASH_DMATTN_SKIP_CUDA_BUILD: "TRUE"
+          FLASH_SPARSE_ATTENTION_SKIP_CUDA_BUILD: "TRUE"
         run: |
           python setup.py sdist --dist-dir=dist
           ls -l dist

diff --git a/CITATION.cff b/CITATION.cff
@@ -1,8 +1,8 @@
 cff-version: "1.2.0"
 date-released: 2025-06
 message: "If you use this software, please cite it using these metadata."
-title: "Flash Dynamic Mask Attention: Trainable Dynamic Mask Sparse Attention"
-url: "https://github.com/SmallDoges/flash-dmattn"
+title: "Flash Sparse Attention: Trainable Dynamic Mask Sparse Attention"
+url: "https://github.com/SmallDoges/flash-sparse-attention"
 authors:
   - family-names: Shi
     given-names: Jingze

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -4,7 +4,7 @@ Everyone is welcome to contribute, and we value everybody's contribution. Code c
 
 It also helps us if you spread the word! Reference the library in blog posts about the awesome projects it made possible, shout out on Twitter every time it has helped you, or simply ⭐️ the repository to say thank you.
 
-However you choose to contribute, please be mindful and respect our [code of conduct](https://github.com/SmallDoges/flash-dmattn/blob/main/CODE_OF_CONDUCT.md).
+However you choose to contribute, please be mindful and respect our [code of conduct](https://github.com/SmallDoges/flash-sparse-attention/blob/main/CODE_OF_CONDUCT.md).
 
 ## Ways to contribute
 
@@ -16,7 +16,7 @@ There are several ways you can contribute to Flash-DMA:
 * Contribute to the examples, benchmarks, or documentation.
 * Improve CUDA kernel performance.
 
-If you don't know where to start, there is a special [Good First Issue](https://github.com/SmallDoges/flash-dmattn/contribute) listing. It will give you a list of open issues that are beginner-friendly and help you start contributing to open-source.
+If you don't know where to start, there is a special [Good First Issue](https://github.com/SmallDoges/flash-sparse-attention/contribute) listing. It will give you a list of open issues that are beginner-friendly and help you start contributing to open-source.
 
 > All contributions are equally valuable to the community. 🥰
 
@@ -81,14 +81,14 @@ You will need basic `git` proficiency to contribute to Flash-DMA. You'll need **
 
 ### Development Setup
 
-1. Fork the [repository](https://github.com/SmallDoges/flash-dmattn) by clicking on the **Fork** button.
+1. Fork the [repository](https://github.com/SmallDoges/flash-sparse-attention) by clicking on the **Fork** button.
 
 2. Clone your fork to your local disk, and add the base repository as a remote:
 
    ```bash
-   git clone https://github.com/<your Github handle>/flash-dmattn.git
-   cd flash-dmattn
-   git remote add upstream https://github.com/SmallDoges/flash-dmattn.git
+   git clone https://github.com/<your Github handle>/flash-sparse-attention.git
+   cd flash-sparse-attention
+   git remote add upstream https://github.com/SmallDoges/flash-sparse-attention.git
    ```
 
 3. Create a new branch to hold your development changes:
@@ -157,7 +157,7 @@ You will need basic `git` proficiency to contribute to Flash-DMA. You'll need **
 
 ### Tests
 
-An extensive test suite is included to test the library behavior and performance. Tests can be found in the [tests](https://github.com/SmallDoges/flash-dmattn/tree/main/tests) folder and benchmarks in the [benchmarks](https://github.com/SmallDoges/flash-dmattn/tree/main/benchmarks) folder.
+An extensive test suite is included to test the library behavior and performance. Tests can be found in the [tests](https://github.com/SmallDoges/flash-sparse-attention/tree/main/tests) folder and benchmarks in the [benchmarks](https://github.com/SmallDoges/flash-sparse-attention/tree/main/benchmarks) folder.
 
 We use `pytest` for testing. From the root of the repository, run:
 
@@ -200,6 +200,6 @@ If you discover a security vulnerability, please send an e-mail to the maintaine
 
 ## Questions?
 
-If you have questions about contributing, feel free to ask in the [GitHub Discussions](https://github.com/SmallDoges/flash-dmattn/discussions) or open an issue.
+If you have questions about contributing, feel free to ask in the [GitHub Discussions](https://github.com/SmallDoges/flash-sparse-attention/discussions) or open an issue.
 
-Thank you for contributing to Flash Dynamic Mask Attention! 🚀
+Thank you for contributing to Flash Sparse Attention! 🚀
diff --git a/README.md b/README.md
@@ -45,95 +45,6 @@ Thus, a more effective approach is sparse attention: interacting each query with
 - Further performance improvements for skipping memory access and computation
 
 
-## Performance
-
-We present the expected speedup of FSA over standard PyTorch SDPA under mask and bias conditions.
-
-![FSA Performance Overview](assets/performance_overview.png)
-
----
-
-### Forward Pass Performance
-
-The following table shows the forward pass performance comparison between FSA and standard PyTorch SDPA on an NVIDIA A100-SXM4-80GB. Results are averaged over 3 runs after 2 warmup runs.
-
-| Mode   | Q len | K len  | Window W | SDPA (ms) | FSA (ms) | Speedup |
-|--------|-------|--------|----------|-----------|-----------|---------|
-| Train  | 256   | 256    | 1024     | 0.29      | 0.19      | 1.58x   |
-| Train  | 512   | 512    | 1024     | 0.35      | 0.19      | 1.86x   |
-| Train  | 1024  | 1024   | 1024     | 0.51      | 0.18      | 2.81x   |
-| Train  | 2048  | 2048   | 1024     | 1.04      | 0.18      | 5.68x   |
-| Train  | 4096  | 4096   | 1024     | 2.53      | 0.24      | 10.41x  |
-| Train  | 8192  | 8192   | 1024     | 9.38      | 0.36      | 25.93x  |
-| Train  | 16384 | 16384  | 1024     | 28.39     | 0.81      | 35.25x  |
-| Train  | 32768 | 32768  | 1024     | 111.87    | 2.25      | 49.78x  |
-| Train  | 32768 | 32768  | 32       | 113.19    | 2.10      | 53.97x  |
-| Train  | 32768 | 32768  | 64       | 113.17    | 2.12      | 53.32x  |
-| Train  | 32768 | 32768  | 128      | 113.14    | 2.10      | 53.78x  |
-| Train  | 32768 | 32768  | 256      | 113.18    | 2.13      | 53.18x  |
-| Train  | 32768 | 32768  | 512      | 113.19    | 2.17      | 52.17x  |
-| Train  | 32768 | 32768  | 1024     | 113.19    | 2.24      | 50.45x  |
-| Train  | 32768 | 32768  | 2048     | 113.15    | 2.39      | 47.35x  |
-| Train  | 32768 | 32768  | 4096     | 113.16    | 2.67      | 42.39x  |
-| Train  | 32768 | 32768  | 8192     | 113.11    | 3.20      | 35.29x  |
-| Train  | 32768 | 32768  | 16384    | 113.15    | 3.97      | 28.51x  |
-| Train  | 32768 | 32768  | 32768    | 113.11    | 4.90      | 23.10x  |
-| Infer  | 1     | 256    | 1024     | 0.25      | 0.19      | 1.28x   |
-| Infer  | 1     | 512    | 1024     | 0.25      | 0.19      | 1.27x   |
-| Infer  | 1     | 1024   | 1024     | 0.25      | 0.20      | 1.28x   |
-| Infer  | 1     | 2048   | 1024     | 0.25      | 0.20      | 1.24x   |
-| Infer  | 1     | 4096   | 1024     | 0.25      | 0.19      | 1.29x   |
-| Infer  | 1     | 8192   | 1024     | 0.25      | 0.20      | 1.25x   |
-| Infer  | 1     | 16384  | 1024     | 0.25      | 0.19      | 1.29x   |
-| Infer  | 1     | 32768  | 1024     | 0.27      | 0.20      | 1.33x   |
-| Infer  | 1     | 65536  | 1024     | 0.42      | 0.20      | 2.10x   |
-| Infer  | 1     | 131072 | 1024     | 0.72      | 0.20      | 3.65x   |
-| Infer  | 1     | 262144 | 1024     | 1.31      | 0.22      | 6.06x   |
-| Infer  | 1     | 524288 | 1024     | 2.49      | 0.24      | 10.45x  |
-| Infer  | 1     | 524288 | 32       | 2.48      | 0.21      | 11.60x  |
-| Infer  | 1     | 524288 | 64       | 2.44      | 0.21      | 11.66x  |
-| Infer  | 1     | 524288 | 128      | 2.45      | 0.21      | 11.47x  |
-| Infer  | 1     | 524288 | 256      | 2.43      | 0.21      | 11.47x  |
-| Infer  | 1     | 524288 | 512      | 2.44      | 0.22      | 10.89x  |
-| Infer  | 1     | 524288 | 1024     | 2.44      | 0.24      | 10.31x  |
-| Infer  | 1     | 524288 | 2048     | 2.44      | 0.27      | 9.07x   |
-| Infer  | 1     | 524288 | 4096     | 2.45      | 0.33      | 7.41x   |
-| Infer  | 1     | 524288 | 8192     | 2.44      | 0.35      | 6.93x   |
-| Infer  | 1     | 524288 | 16384    | 2.44      | 0.35      | 6.93x   |
-| Infer  | 1     | 524288 | 32768    | 2.45      | 0.35      | 6.96x   |
-| Infer  | 1     | 524288 | 65536    | 2.44      | 0.35      | 6.88x   |
-
----
-
-### Backward Pass Performance
-
-The following table shows the backward pass performance comparison between FSA and standard PyTorch SDPA on an NVIDIA A100-SXM4-80GB. Results are averaged over 3 runs after 2 warmup runs.
-
-| Mode  | Q len | K len  | Window W | SDPA-BWD (ms) | FSA-BWD (ms) | Speedup |
-|-------|-------|--------|----------|---------------|---------------|---------|
-| Train | 256   | 256    | 1024     | 0.42          | 0.62          | 0.7x    |
-| Train | 512   | 512    | 1024     | 0.56          | 0.60          | 0.9x    |
-| Train | 1024  | 1024   | 1024     | 0.94          | 0.61          | 1.5x    |
-| Train | 2048  | 2048   | 1024     | 1.79          | 0.69          | 2.6x    |
-| Train | 4096  | 4096   | 1024     | 3.76          | 1.08          | 3.5x    |
-| Train | 8192  | 8192   | 1024     | 14.39         | 2.06          | 7.0x    |
-| Train | 16384 | 16384  | 1024     | 39.56         | 4.97          | 8.0x    |
-| Train | 32768 | 32768  | 1024     | 142.07        | 25.63         | 5.5x    |
-| Train | 32768 | 32768  | 32       | 142.70        | 21.91         | 6.5x    |
-| Train | 32768 | 32768  | 64       | 142.65        | 22.29         | 6.4x    |
-| Train | 32768 | 32768  | 128      | 142.69        | 23.04         | 6.2x    |
-| Train | 32768 | 32768  | 256      | 142.69        | 24.27         | 5.9x    |
-| Train | 32768 | 32768  | 512      | 142.67        | 25.12         | 5.7x    |
-| Train | 32768 | 32768  | 1024     | 142.55        | 25.58         | 5.6x    |
-| Train | 32768 | 32768  | 2048     | 142.75        | 25.64         | 5.6x    |
-| Train | 32768 | 32768  | 4096     | 142.61        | 24.84         | 5.7x    |
-| Train | 32768 | 32768  | 8192     | 142.33        | 25.63         | 5.6x    |
-| Train | 32768 | 32768  | 16384    | 142.40        | 25.62         | 5.6x    |
-| Train | 32768 | 32768  | 32768    | 142.43        | 25.63         | 5.6x    |
-
----
-
-
 ## Installation
 
 ### Requirements
@@ -150,14 +61,14 @@ The following table shows the backward pass performance comparison between FSA a
 You can install FSA via pre-compiled wheels:
 
 ```bash
-pip install flash_sparse_attn --no-build-isolation
+pip install flash-sparse-attn --no-build-isolation
 ```
 
 Alternatively, you can compile and install from source:
 
 ```bash
-git clone https://github.com/SmallDoges/flash_sparse_attn.git
-cd flash_sparse_attn
+git clone https://github.com/SmallDoges/flash-sparse-attn.git
+cd flash-sparse-attn
 pip install . --no-build-isolation
 ```
 
@@ -245,6 +156,95 @@ print(f"Bias gradient shape: {attn_bias.grad.shape}")
 ```
 
 
+## Performance
+
+We present the expected speedup of FSA over standard PyTorch SDPA under mask and bias conditions.
+
+![FSA Performance Overview](assets/performance_overview.png)
+
+---
+
+### Forward Pass Performance
+
+The following table shows the forward pass performance comparison between FSA and standard PyTorch SDPA on an NVIDIA A100-SXM4-80GB. Results are averaged over 3 runs after 2 warmup runs.
+
+| Mode   | Q len | K len  | Window W | SDPA (ms) | FSA (ms) | Speedup |
+|--------|-------|--------|----------|-----------|-----------|---------|
+| Train  | 256   | 256    | 1024     | 0.29      | 0.19      | 1.58x   |
+| Train  | 512   | 512    | 1024     | 0.35      | 0.19      | 1.86x   |
+| Train  | 1024  | 1024   | 1024     | 0.51      | 0.18      | 2.81x   |
+| Train  | 2048  | 2048   | 1024     | 1.04      | 0.18      | 5.68x   |
+| Train  | 4096  | 4096   | 1024     | 2.53      | 0.24      | 10.41x  |
+| Train  | 8192  | 8192   | 1024     | 9.38      | 0.36      | 25.93x  |
+| Train  | 16384 | 16384  | 1024     | 28.39     | 0.81      | 35.25x  |
+| Train  | 32768 | 32768  | 1024     | 111.87    | 2.25      | 49.78x  |
+| Train  | 32768 | 32768  | 32       | 113.19    | 2.10      | 53.97x  |
+| Train  | 32768 | 32768  | 64       | 113.17    | 2.12      | 53.32x  |
+| Train  | 32768 | 32768  | 128      | 113.14    | 2.10      | 53.78x  |
+| Train  | 32768 | 32768  | 256      | 113.18    | 2.13      | 53.18x  |
+| Train  | 32768 | 32768  | 512      | 113.19    | 2.17      | 52.17x  |
+| Train  | 32768 | 32768  | 1024     | 113.19    | 2.24      | 50.45x  |
+| Train  | 32768 | 32768  | 2048     | 113.15    | 2.39      | 47.35x  |
+| Train  | 32768 | 32768  | 4096     | 113.16    | 2.67      | 42.39x  |
+| Train  | 32768 | 32768  | 8192     | 113.11    | 3.20      | 35.29x  |
+| Train  | 32768 | 32768  | 16384    | 113.15    | 3.97      | 28.51x  |
+| Train  | 32768 | 32768  | 32768    | 113.11    | 4.90      | 23.10x  |
+| Infer  | 1     | 256    | 1024     | 0.25      | 0.19      | 1.28x   |
+| Infer  | 1     | 512    | 1024     | 0.25      | 0.19      | 1.27x   |
+| Infer  | 1     | 1024   | 1024     | 0.25      | 0.20      | 1.28x   |
+| Infer  | 1     | 2048   | 1024     | 0.25      | 0.20      | 1.24x   |
+| Infer  | 1     | 4096   | 1024     | 0.25      | 0.19      | 1.29x   |
+| Infer  | 1     | 8192   | 1024     | 0.25      | 0.20      | 1.25x   |
+| Infer  | 1     | 16384  | 1024     | 0.25      | 0.19      | 1.29x   |
+| Infer  | 1     | 32768  | 1024     | 0.27      | 0.20      | 1.33x   |
+| Infer  | 1     | 65536  | 1024     | 0.42      | 0.20      | 2.10x   |
+| Infer  | 1     | 131072 | 1024     | 0.72      | 0.20      | 3.65x   |
+| Infer  | 1     | 262144 | 1024     | 1.31      | 0.22      | 6.06x   |
+| Infer  | 1     | 524288 | 1024     | 2.49      | 0.24      | 10.45x  |
+| Infer  | 1     | 524288 | 32       | 2.48      | 0.21      | 11.60x  |
+| Infer  | 1     | 524288 | 64       | 2.44      | 0.21      | 11.66x  |
+| Infer  | 1     | 524288 | 128      | 2.45      | 0.21      | 11.47x  |
+| Infer  | 1     | 524288 | 256      | 2.43      | 0.21      | 11.47x  |
+| Infer  | 1     | 524288 | 512      | 2.44      | 0.22      | 10.89x  |
+| Infer  | 1     | 524288 | 1024     | 2.44      | 0.24      | 10.31x  |
+| Infer  | 1     | 524288 | 2048     | 2.44      | 0.27      | 9.07x   |
+| Infer  | 1     | 524288 | 4096     | 2.45      | 0.33      | 7.41x   |
+| Infer  | 1     | 524288 | 8192     | 2.44      | 0.35      | 6.93x   |
+| Infer  | 1     | 524288 | 16384    | 2.44      | 0.35      | 6.93x   |
+| Infer  | 1     | 524288 | 32768    | 2.45      | 0.35      | 6.96x   |
+| Infer  | 1     | 524288 | 65536    | 2.44      | 0.35      | 6.88x   |
+
+---
+
+### Backward Pass Performance
+
+The following table shows the backward pass performance comparison between FSA and standard PyTorch SDPA on an NVIDIA A100-SXM4-80GB. Results are averaged over 3 runs after 2 warmup runs.
+
+| Mode  | Q len | K len  | Window W | SDPA-BWD (ms) | FSA-BWD (ms) | Speedup |
+|-------|-------|--------|----------|---------------|---------------|---------|
+| Train | 256   | 256    | 1024     | 0.42          | 0.62          | 0.7x    |
+| Train | 512   | 512    | 1024     | 0.56          | 0.60          | 0.9x    |
+| Train | 1024  | 1024   | 1024     | 0.94          | 0.61          | 1.5x    |
+| Train | 2048  | 2048   | 1024     | 1.79          | 0.69          | 2.6x    |
+| Train | 4096  | 4096   | 1024     | 3.76          | 1.08          | 3.5x    |
+| Train | 8192  | 8192   | 1024     | 14.39         | 2.06          | 7.0x    |
+| Train | 16384 | 16384  | 1024     | 39.56         | 4.97          | 8.0x    |
+| Train | 32768 | 32768  | 1024     | 142.07        | 25.63         | 5.5x    |
+| Train | 32768 | 32768  | 32       | 142.70        | 21.91         | 6.5x    |
+| Train | 32768 | 32768  | 64       | 142.65        | 22.29         | 6.4x    |
+| Train | 32768 | 32768  | 128      | 142.69        | 23.04         | 6.2x    |
+| Train | 32768 | 32768  | 256      | 142.69        | 24.27         | 5.9x    |
+| Train | 32768 | 32768  | 512      | 142.67        | 25.12         | 5.7x    |
+| Train | 32768 | 32768  | 1024     | 142.55        | 25.58         | 5.6x    |
+| Train | 32768 | 32768  | 2048     | 142.75        | 25.64         | 5.6x    |
+| Train | 32768 | 32768  | 4096     | 142.61        | 24.84         | 5.7x    |
+| Train | 32768 | 32768  | 8192     | 142.33        | 25.63         | 5.6x    |
+| Train | 32768 | 32768  | 16384    | 142.40        | 25.62         | 5.6x    |
+| Train | 32768 | 32768  | 32768    | 142.43        | 25.63         | 5.6x    |
+
+---
+
+
 ## Benchmarking
 
 FSA provides comprehensive benchmarking tools to evaluate performance across different configurations: