Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
6900b29
Adds backend auto-selection API
LoserCheems Nov 9, 2025
71add00
Fix docstring for flash_sparse_attn_func_auto to reflect correct func…
LoserCheems Nov 9, 2025
e02668c
Adds Flex flash-sparse attention hook
LoserCheems Nov 9, 2025
508d2d1
Remove unused files related to Flash Dynamic Mask Attention integration
LoserCheems Nov 9, 2025
ea4350a
Implement feature X to enhance user experience and optimize performance
LoserCheems Nov 9, 2025
6bf01c4
Introduces Triton sparse attention kernels
LoserCheems Nov 9, 2025
152c73a
Adds flash sparse attention interface
LoserCheems Nov 9, 2025
77e4e61
Clarifies install docs and performance layout
LoserCheems Nov 9, 2025
6a29931
Renames project to flash-sparse-attn
LoserCheems Nov 9, 2025
612b85c
Aligns security docs with FSA naming
LoserCheems Nov 9, 2025
9f5d48d
Renames package to flash_sparse_attn
LoserCheems Nov 9, 2025
13a0db0
Aligns repo links with new name
LoserCheems Nov 9, 2025
307a50e
Aligns citation with repo rename
LoserCheems Nov 9, 2025
186c725
Adds import helpers for sparse attention
LoserCheems Nov 9, 2025
0402b39
Adds flash sparse attention wrapper
LoserCheems Nov 9, 2025
6bb896f
Adds flash sparse attention utils
LoserCheems Nov 9, 2025
11a0862
Adds dynamic mask helpers
LoserCheems Nov 9, 2025
3dd3392
Adds shared unpadding utilities
LoserCheems Nov 9, 2025
df69839
Updates flash attention integration
LoserCheems Nov 9, 2025
7e3faab
Remove outdated documentation files for Flash Dynamic Mask Attention …
LoserCheems Nov 9, 2025
554e7e0
Align docs with sparse attention rename
LoserCheems Nov 9, 2025
ac95f25
Aligns Chinese doc with sparse attention
LoserCheems Nov 9, 2025
a0ed87d
Aligns benchmarks with sparse attn imports
LoserCheems Nov 9, 2025
b3ac56f
Renames flash attention variant
LoserCheems Nov 9, 2025
92c0fad
Aligns issue templates with FSA
LoserCheems Nov 9, 2025
4aa0153
Renames environment variables for sparse attention build configuration
LoserCheems Nov 9, 2025
fc64149
Renames the kernel generation script description to reflect sparse at…
LoserCheems Nov 9, 2025
1211c5b
Renames environment variable for sparse attention build configuration
LoserCheems Nov 9, 2025
8695288
Update CONTRIBUTING.md
LoserCheems Nov 9, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 2 additions & 2 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: Bug report
about: Create a report to help us improve Flash-DMA
about: Create a report to help us improve FSA
title: '[BUG REPORT] '
labels: ["bug"]
assignees:
Expand Down Expand Up @@ -39,7 +39,7 @@ python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {
**Additional context**
- OS: [e.g. Ubuntu 20.04, Windows 10, macOS 12]
- Python version: [e.g. 3.9.7]
- Flash-DMA version: [e.g. 0.1.0]
- FSA version: [e.g. 0.1.0]
- CUDA Compute Capability: [e.g. 8.6]

**Error traceback**
Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/bug_report.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: Bug report
description: Create a report to help us improve Flash-DMA
description: Create a report to help us improve FSA
title: "[BUG REPORT] "
labels:
- bug
Expand Down
4 changes: 2 additions & 2 deletions .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: Feature request
about: Suggest an idea for Flash-DMA
about: Suggest an idea for FSA
title: '[FEATURE REQUEST] '
labels: ["feature"]
assignees:
Expand Down Expand Up @@ -44,4 +44,4 @@ Add any other context or screenshots about the feature request here.
If this feature is inspired by a paper or existing implementation, please provide:
- Link to paper/implementation
- Brief explanation of the technique
- Why it would be valuable for Flash-DMA users
- Why it would be valuable for FSA users
4 changes: 2 additions & 2 deletions .github/ISSUE_TEMPLATE/feature_request.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: Feature request
description: Suggest an idea for FDMA
description: Suggest an idea for FSA
title: "[FEATURE REQUEST] "
labels:
- feature
Expand All @@ -16,7 +16,7 @@ body:
- type: markdown
attributes:
value: |
Help us understand the feature you are proposing and why it matters for Flash-DMA workflows.
Help us understand the feature you are proposing and why it matters for FSA workflows.
- type: textarea
id: problem
attributes:
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/_build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -172,12 +172,12 @@ jobs:

export MAX_JOBS=$([ "$MATRIX_CUDA_VERSION" == "129" ] && echo 1 || echo 2)
export NVCC_THREADS=2
export FLASH_DMATTN_FORCE_BUILD="TRUE"
export FLASH_DMATTN_FORCE_CXX11_ABI=${{ inputs.cxx11_abi }}
export FLASH_SPARSE_ATTENTION_FORCE_BUILD="TRUE"
export FLASH_SPARSE_ATTENTION_FORCE_CXX11_ABI=${{ inputs.cxx11_abi }}

# If specified, limit to a single compute capability to speed up build
if [ -n "${MATRIX_ARCH}" ]; then
export FLASH_DMATTN_CUDA_ARCHS="${MATRIX_ARCH}"
export FLASH_SPARSE_ATTENTION_CUDA_ARCHS="${MATRIX_ARCH}"
fi

# GH allows max 6h
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/manual_publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ jobs:

- name: Build core package
env:
FLASH_DMATTN_SKIP_CUDA_BUILD: "TRUE"
FLASH_SPARSE_ATTENTION_SKIP_CUDA_BUILD: "TRUE"
run: |
python setup.py sdist --dist-dir=dist
ls -l dist
Expand Down
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
cff-version: "1.2.0"
date-released: 2025-06
message: "If you use this software, please cite it using these metadata."
title: "Flash Dynamic Mask Attention: Trainable Dynamic Mask Sparse Attention"
url: "https://github.com/SmallDoges/flash-dmattn"
title: "Flash Sparse Attention: Trainable Dynamic Mask Sparse Attention"
url: "https://github.com/SmallDoges/flash-sparse-attention"
authors:
- family-names: Shi
given-names: Jingze
Expand Down
18 changes: 9 additions & 9 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Everyone is welcome to contribute, and we value everybody's contribution. Code c

It also helps us if you spread the word! Reference the library in blog posts about the awesome projects it made possible, shout out on Twitter every time it has helped you, or simply ⭐️ the repository to say thank you.

However you choose to contribute, please be mindful and respect our [code of conduct](https://github.com/SmallDoges/flash-dmattn/blob/main/CODE_OF_CONDUCT.md).
However you choose to contribute, please be mindful and respect our [code of conduct](https://github.com/SmallDoges/flash-sparse-attention/blob/main/CODE_OF_CONDUCT.md).

## Ways to contribute

Expand All @@ -16,7 +16,7 @@ There are several ways you can contribute to Flash-DMA:
* Contribute to the examples, benchmarks, or documentation.
* Improve CUDA kernel performance.

If you don't know where to start, there is a special [Good First Issue](https://github.com/SmallDoges/flash-dmattn/contribute) listing. It will give you a list of open issues that are beginner-friendly and help you start contributing to open-source.
If you don't know where to start, there is a special [Good First Issue](https://github.com/SmallDoges/flash-sparse-attention/contribute) listing. It will give you a list of open issues that are beginner-friendly and help you start contributing to open-source.

> All contributions are equally valuable to the community. 🥰

Expand Down Expand Up @@ -81,14 +81,14 @@ You will need basic `git` proficiency to contribute to Flash-DMA. You'll need **

### Development Setup

1. Fork the [repository](https://github.com/SmallDoges/flash-dmattn) by clicking on the **Fork** button.
1. Fork the [repository](https://github.com/SmallDoges/flash-sparse-attention) by clicking on the **Fork** button.

2. Clone your fork to your local disk, and add the base repository as a remote:

```bash
git clone https://github.com/<your Github handle>/flash-dmattn.git
cd flash-dmattn
git remote add upstream https://github.com/SmallDoges/flash-dmattn.git
git clone https://github.com/<your Github handle>/flash-sparse-attention.git
cd flash-sparse-attention
git remote add upstream https://github.com/SmallDoges/flash-sparse-attention.git
```

3. Create a new branch to hold your development changes:
Expand Down Expand Up @@ -157,7 +157,7 @@ You will need basic `git` proficiency to contribute to Flash-DMA. You'll need **

### Tests

An extensive test suite is included to test the library behavior and performance. Tests can be found in the [tests](https://github.com/SmallDoges/flash-dmattn/tree/main/tests) folder and benchmarks in the [benchmarks](https://github.com/SmallDoges/flash-dmattn/tree/main/benchmarks) folder.
An extensive test suite is included to test the library behavior and performance. Tests can be found in the [tests](https://github.com/SmallDoges/flash-sparse-attention/tree/main/tests) folder and benchmarks in the [benchmarks](https://github.com/SmallDoges/flash-sparse-attention/tree/main/benchmarks) folder.

We use `pytest` for testing. From the root of the repository, run:

Expand Down Expand Up @@ -200,6 +200,6 @@ If you discover a security vulnerability, please send an e-mail to the maintaine

## Questions?

If you have questions about contributing, feel free to ask in the [GitHub Discussions](https://github.com/SmallDoges/flash-dmattn/discussions) or open an issue.
If you have questions about contributing, feel free to ask in the [GitHub Discussions](https://github.com/SmallDoges/flash-sparse-attention/discussions) or open an issue.

Thank you for contributing to Flash Dynamic Mask Attention! 🚀
Thank you for contributing to Flash Sparse Attention! 🚀
184 changes: 92 additions & 92 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,95 +45,6 @@ Thus, a more effective approach is sparse attention: interacting each query with
- Further performance improvements for skipping memory access and computation


## Performance

We present the expected speedup of FSA over standard PyTorch SDPA under mask and bias conditions.

![FSA Performance Overview](assets/performance_overview.png)

---

### Forward Pass Performance

The following table shows the forward pass performance comparison between FSA and standard PyTorch SDPA on an NVIDIA A100-SXM4-80GB. Results are averaged over 3 runs after 2 warmup runs.

| Mode | Q len | K len | Window W | SDPA (ms) | FSA (ms) | Speedup |
|--------|-------|--------|----------|-----------|-----------|---------|
| Train | 256 | 256 | 1024 | 0.29 | 0.19 | 1.58x |
| Train | 512 | 512 | 1024 | 0.35 | 0.19 | 1.86x |
| Train | 1024 | 1024 | 1024 | 0.51 | 0.18 | 2.81x |
| Train | 2048 | 2048 | 1024 | 1.04 | 0.18 | 5.68x |
| Train | 4096 | 4096 | 1024 | 2.53 | 0.24 | 10.41x |
| Train | 8192 | 8192 | 1024 | 9.38 | 0.36 | 25.93x |
| Train | 16384 | 16384 | 1024 | 28.39 | 0.81 | 35.25x |
| Train | 32768 | 32768 | 1024 | 111.87 | 2.25 | 49.78x |
| Train | 32768 | 32768 | 32 | 113.19 | 2.10 | 53.97x |
| Train | 32768 | 32768 | 64 | 113.17 | 2.12 | 53.32x |
| Train | 32768 | 32768 | 128 | 113.14 | 2.10 | 53.78x |
| Train | 32768 | 32768 | 256 | 113.18 | 2.13 | 53.18x |
| Train | 32768 | 32768 | 512 | 113.19 | 2.17 | 52.17x |
| Train | 32768 | 32768 | 1024 | 113.19 | 2.24 | 50.45x |
| Train | 32768 | 32768 | 2048 | 113.15 | 2.39 | 47.35x |
| Train | 32768 | 32768 | 4096 | 113.16 | 2.67 | 42.39x |
| Train | 32768 | 32768 | 8192 | 113.11 | 3.20 | 35.29x |
| Train | 32768 | 32768 | 16384 | 113.15 | 3.97 | 28.51x |
| Train | 32768 | 32768 | 32768 | 113.11 | 4.90 | 23.10x |
| Infer | 1 | 256 | 1024 | 0.25 | 0.19 | 1.28x |
| Infer | 1 | 512 | 1024 | 0.25 | 0.19 | 1.27x |
| Infer | 1 | 1024 | 1024 | 0.25 | 0.20 | 1.28x |
| Infer | 1 | 2048 | 1024 | 0.25 | 0.20 | 1.24x |
| Infer | 1 | 4096 | 1024 | 0.25 | 0.19 | 1.29x |
| Infer | 1 | 8192 | 1024 | 0.25 | 0.20 | 1.25x |
| Infer | 1 | 16384 | 1024 | 0.25 | 0.19 | 1.29x |
| Infer | 1 | 32768 | 1024 | 0.27 | 0.20 | 1.33x |
| Infer | 1 | 65536 | 1024 | 0.42 | 0.20 | 2.10x |
| Infer | 1 | 131072 | 1024 | 0.72 | 0.20 | 3.65x |
| Infer | 1 | 262144 | 1024 | 1.31 | 0.22 | 6.06x |
| Infer | 1 | 524288 | 1024 | 2.49 | 0.24 | 10.45x |
| Infer | 1 | 524288 | 32 | 2.48 | 0.21 | 11.60x |
| Infer | 1 | 524288 | 64 | 2.44 | 0.21 | 11.66x |
| Infer | 1 | 524288 | 128 | 2.45 | 0.21 | 11.47x |
| Infer | 1 | 524288 | 256 | 2.43 | 0.21 | 11.47x |
| Infer | 1 | 524288 | 512 | 2.44 | 0.22 | 10.89x |
| Infer | 1 | 524288 | 1024 | 2.44 | 0.24 | 10.31x |
| Infer | 1 | 524288 | 2048 | 2.44 | 0.27 | 9.07x |
| Infer | 1 | 524288 | 4096 | 2.45 | 0.33 | 7.41x |
| Infer | 1 | 524288 | 8192 | 2.44 | 0.35 | 6.93x |
| Infer | 1 | 524288 | 16384 | 2.44 | 0.35 | 6.93x |
| Infer | 1 | 524288 | 32768 | 2.45 | 0.35 | 6.96x |
| Infer | 1 | 524288 | 65536 | 2.44 | 0.35 | 6.88x |

---

### Backward Pass Performance

The following table shows the backward pass performance comparison between FSA and standard PyTorch SDPA on an NVIDIA A100-SXM4-80GB. Results are averaged over 3 runs after 2 warmup runs.

| Mode | Q len | K len | Window W | SDPA-BWD (ms) | FSA-BWD (ms) | Speedup |
|-------|-------|--------|----------|---------------|---------------|---------|
| Train | 256 | 256 | 1024 | 0.42 | 0.62 | 0.7x |
| Train | 512 | 512 | 1024 | 0.56 | 0.60 | 0.9x |
| Train | 1024 | 1024 | 1024 | 0.94 | 0.61 | 1.5x |
| Train | 2048 | 2048 | 1024 | 1.79 | 0.69 | 2.6x |
| Train | 4096 | 4096 | 1024 | 3.76 | 1.08 | 3.5x |
| Train | 8192 | 8192 | 1024 | 14.39 | 2.06 | 7.0x |
| Train | 16384 | 16384 | 1024 | 39.56 | 4.97 | 8.0x |
| Train | 32768 | 32768 | 1024 | 142.07 | 25.63 | 5.5x |
| Train | 32768 | 32768 | 32 | 142.70 | 21.91 | 6.5x |
| Train | 32768 | 32768 | 64 | 142.65 | 22.29 | 6.4x |
| Train | 32768 | 32768 | 128 | 142.69 | 23.04 | 6.2x |
| Train | 32768 | 32768 | 256 | 142.69 | 24.27 | 5.9x |
| Train | 32768 | 32768 | 512 | 142.67 | 25.12 | 5.7x |
| Train | 32768 | 32768 | 1024 | 142.55 | 25.58 | 5.6x |
| Train | 32768 | 32768 | 2048 | 142.75 | 25.64 | 5.6x |
| Train | 32768 | 32768 | 4096 | 142.61 | 24.84 | 5.7x |
| Train | 32768 | 32768 | 8192 | 142.33 | 25.63 | 5.6x |
| Train | 32768 | 32768 | 16384 | 142.40 | 25.62 | 5.6x |
| Train | 32768 | 32768 | 32768 | 142.43 | 25.63 | 5.6x |

---


## Installation

### Requirements
Expand All @@ -150,14 +61,14 @@ The following table shows the backward pass performance comparison between FSA a
You can install FSA via pre-compiled wheels:

```bash
pip install flash_sparse_attn --no-build-isolation
pip install flash-sparse-attn --no-build-isolation
```

Alternatively, you can compile and install from source:

```bash
git clone https://github.com/SmallDoges/flash_sparse_attn.git
cd flash_sparse_attn
git clone https://github.com/SmallDoges/flash-sparse-attn.git
cd flash-sparse-attn
pip install . --no-build-isolation
```

Expand Down Expand Up @@ -245,6 +156,95 @@ print(f"Bias gradient shape: {attn_bias.grad.shape}")
```


## Performance

We present the expected speedup of FSA over standard PyTorch SDPA under mask and bias conditions.

![FSA Performance Overview](assets/performance_overview.png)

---

### Forward Pass Performance

The following table shows the forward pass performance comparison between FSA and standard PyTorch SDPA on an NVIDIA A100-SXM4-80GB. Results are averaged over 3 runs after 2 warmup runs.

| Mode | Q len | K len | Window W | SDPA (ms) | FSA (ms) | Speedup |
|--------|-------|--------|----------|-----------|-----------|---------|
| Train | 256 | 256 | 1024 | 0.29 | 0.19 | 1.58x |
| Train | 512 | 512 | 1024 | 0.35 | 0.19 | 1.86x |
| Train | 1024 | 1024 | 1024 | 0.51 | 0.18 | 2.81x |
| Train | 2048 | 2048 | 1024 | 1.04 | 0.18 | 5.68x |
| Train | 4096 | 4096 | 1024 | 2.53 | 0.24 | 10.41x |
| Train | 8192 | 8192 | 1024 | 9.38 | 0.36 | 25.93x |
| Train | 16384 | 16384 | 1024 | 28.39 | 0.81 | 35.25x |
| Train | 32768 | 32768 | 1024 | 111.87 | 2.25 | 49.78x |
| Train | 32768 | 32768 | 32 | 113.19 | 2.10 | 53.97x |
| Train | 32768 | 32768 | 64 | 113.17 | 2.12 | 53.32x |
| Train | 32768 | 32768 | 128 | 113.14 | 2.10 | 53.78x |
| Train | 32768 | 32768 | 256 | 113.18 | 2.13 | 53.18x |
| Train | 32768 | 32768 | 512 | 113.19 | 2.17 | 52.17x |
| Train | 32768 | 32768 | 1024 | 113.19 | 2.24 | 50.45x |
| Train | 32768 | 32768 | 2048 | 113.15 | 2.39 | 47.35x |
| Train | 32768 | 32768 | 4096 | 113.16 | 2.67 | 42.39x |
| Train | 32768 | 32768 | 8192 | 113.11 | 3.20 | 35.29x |
| Train | 32768 | 32768 | 16384 | 113.15 | 3.97 | 28.51x |
| Train | 32768 | 32768 | 32768 | 113.11 | 4.90 | 23.10x |
| Infer | 1 | 256 | 1024 | 0.25 | 0.19 | 1.28x |
| Infer | 1 | 512 | 1024 | 0.25 | 0.19 | 1.27x |
| Infer | 1 | 1024 | 1024 | 0.25 | 0.20 | 1.28x |
| Infer | 1 | 2048 | 1024 | 0.25 | 0.20 | 1.24x |
| Infer | 1 | 4096 | 1024 | 0.25 | 0.19 | 1.29x |
| Infer | 1 | 8192 | 1024 | 0.25 | 0.20 | 1.25x |
| Infer | 1 | 16384 | 1024 | 0.25 | 0.19 | 1.29x |
| Infer | 1 | 32768 | 1024 | 0.27 | 0.20 | 1.33x |
| Infer | 1 | 65536 | 1024 | 0.42 | 0.20 | 2.10x |
| Infer | 1 | 131072 | 1024 | 0.72 | 0.20 | 3.65x |
| Infer | 1 | 262144 | 1024 | 1.31 | 0.22 | 6.06x |
| Infer | 1 | 524288 | 1024 | 2.49 | 0.24 | 10.45x |
| Infer | 1 | 524288 | 32 | 2.48 | 0.21 | 11.60x |
| Infer | 1 | 524288 | 64 | 2.44 | 0.21 | 11.66x |
| Infer | 1 | 524288 | 128 | 2.45 | 0.21 | 11.47x |
| Infer | 1 | 524288 | 256 | 2.43 | 0.21 | 11.47x |
| Infer | 1 | 524288 | 512 | 2.44 | 0.22 | 10.89x |
| Infer | 1 | 524288 | 1024 | 2.44 | 0.24 | 10.31x |
| Infer | 1 | 524288 | 2048 | 2.44 | 0.27 | 9.07x |
| Infer | 1 | 524288 | 4096 | 2.45 | 0.33 | 7.41x |
| Infer | 1 | 524288 | 8192 | 2.44 | 0.35 | 6.93x |
| Infer | 1 | 524288 | 16384 | 2.44 | 0.35 | 6.93x |
| Infer | 1 | 524288 | 32768 | 2.45 | 0.35 | 6.96x |
| Infer | 1 | 524288 | 65536 | 2.44 | 0.35 | 6.88x |

---

### Backward Pass Performance

The following table shows the backward pass performance comparison between FSA and standard PyTorch SDPA on an NVIDIA A100-SXM4-80GB. Results are averaged over 3 runs after 2 warmup runs.

| Mode | Q len | K len | Window W | SDPA-BWD (ms) | FSA-BWD (ms) | Speedup |
|-------|-------|--------|----------|---------------|---------------|---------|
| Train | 256 | 256 | 1024 | 0.42 | 0.62 | 0.7x |
| Train | 512 | 512 | 1024 | 0.56 | 0.60 | 0.9x |
| Train | 1024 | 1024 | 1024 | 0.94 | 0.61 | 1.5x |
| Train | 2048 | 2048 | 1024 | 1.79 | 0.69 | 2.6x |
| Train | 4096 | 4096 | 1024 | 3.76 | 1.08 | 3.5x |
| Train | 8192 | 8192 | 1024 | 14.39 | 2.06 | 7.0x |
| Train | 16384 | 16384 | 1024 | 39.56 | 4.97 | 8.0x |
| Train | 32768 | 32768 | 1024 | 142.07 | 25.63 | 5.5x |
| Train | 32768 | 32768 | 32 | 142.70 | 21.91 | 6.5x |
| Train | 32768 | 32768 | 64 | 142.65 | 22.29 | 6.4x |
| Train | 32768 | 32768 | 128 | 142.69 | 23.04 | 6.2x |
| Train | 32768 | 32768 | 256 | 142.69 | 24.27 | 5.9x |
| Train | 32768 | 32768 | 512 | 142.67 | 25.12 | 5.7x |
| Train | 32768 | 32768 | 1024 | 142.55 | 25.58 | 5.6x |
| Train | 32768 | 32768 | 2048 | 142.75 | 25.64 | 5.6x |
| Train | 32768 | 32768 | 4096 | 142.61 | 24.84 | 5.7x |
| Train | 32768 | 32768 | 8192 | 142.33 | 25.63 | 5.6x |
| Train | 32768 | 32768 | 16384 | 142.40 | 25.62 | 5.6x |
| Train | 32768 | 32768 | 32768 | 142.43 | 25.63 | 5.6x |

---


## Benchmarking

FSA provides comprehensive benchmarking tools to evaluate performance across different configurations:
Expand Down
Loading