diff --git a/AUTHORS b/AUTHORS index 5874411..a5e4936 100644 --- a/AUTHORS +++ b/AUTHORS @@ -2,4 +2,6 @@ Jingze Shi, losercheems@gmail.com Yifan Wu, ywu012@connect.hkust-gz.edu.cn Bingheng Wu, wubingheng52136@gmail.com Yiran Peng, amagipeng@gmail.com -Tri Dao, trid@cs.stanford.edu \ No newline at end of file +Liangdong Wang, wangliangdong@baai.ac.cn +Guang Li, liuguang@baai.ac.cn +Yuyu Luo, yuyuluo@hkust-gz.edu.cn diff --git a/CITATION.cff b/CITATION.cff new file mode 100644 index 0000000..d8f3d0e --- /dev/null +++ b/CITATION.cff @@ -0,0 +1,50 @@ +cff-version: "1.2.0" +date-released: 2025-06 +message: "If you use this software, please cite it using these metadata." +title: "Flash Dynamic Mask Attention: Trainable Dynamic Mask Sparse Attention" +url: "https://github.com/SmallDoges/flash-dmattn" +authors: + - family-names: Shi + given-names: Jingze + email: losercheems@gmail.com + - family-names: Wu + given-names: Yifan + email: ywu012@connect.hkust-gz.edu.cn + - family-names: Wu + given-names: Bingheng + email: wubingheng52136@gmail.com + - family-names: Peng + given-names: Yiran + email: amagipeng@gmail.com + - family-names: Wang + given-names: Liangdong + email: wangliangdong@baai.ac.cn + - family-names: Liu + given-names: Guang + email: liuguang@baai.ac.cn + - family-names: Luo + given-names: Yuyu + email: yuyuluo@hkust-gz.edu.cn +preferred-citation: + type: article + authors: + - family-names: Shi + given-names: Jingze + - family-names: Wu + given-names: Yifan + - family-names: Wu + given-names: Bingheng + - family-names: Peng + given-names: Yiran + - family-names: Wang + given-names: Liangdong + - family-names: Liu + given-names: Guang + - family-names: Luo + given-names: Yuyu + title: "Trainable Dynamic Mask Sparse Attention" + year: 2025 + url: "https://arxiv.org/abs/2508.02124" + doi: "10.48550/arXiv.2508.02124" + journal: "arXiv preprint" + volume: "arXiv:2508.02124" diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md new file mode 100644 index 0000000..b586c53 --- /dev/null +++ b/CODE_OF_CONDUCT.md @@ -0,0 +1,132 @@ +# Contributor Covenant Code of Conduct + +## Our Pledge + +We as members, contributors, and leaders pledge to make participation in our +community a harassment-free experience for everyone, regardless of age, body +size, visible or invisible disability, ethnicity, sex characteristics, gender +identity and expression, level of experience, education, socio-economic status, +nationality, personal appearance, race, caste, color, religion, or sexual +identity and orientation. + +We pledge to act and interact in ways that contribute to an open, welcoming, +diverse, inclusive, and healthy community. + +## Our Standards + +Examples of behavior that contributes to a positive environment for our +community include: + +* Demonstrating empathy and kindness toward other people +* Being respectful of differing opinions, viewpoints, and experiences +* Giving and gracefully accepting constructive feedback +* Accepting responsibility and apologizing to those affected by our mistakes, + and learning from the experience +* Focusing on what is best not just for us as individuals, but for the overall + community + +Examples of unacceptable behavior include: + +* The use of sexualized language or imagery, and sexual attention or advances of + any kind +* Trolling, insulting or derogatory comments, and personal or political attacks +* Public or private harassment +* Publishing others' private information, such as a physical or email address, + without their explicit permission +* Other conduct which could reasonably be considered inappropriate in a + professional setting + +## Enforcement Responsibilities + +Community leaders are responsible for clarifying and enforcing our standards of +acceptable behavior and will take appropriate and fair corrective action in +response to any behavior that they deem inappropriate, threatening, offensive, +or harmful. + +Community leaders have the right and responsibility to remove, edit, or reject +comments, commits, code, wiki edits, issues, and other contributions that are +not aligned to this Code of Conduct, and will communicate reasons for moderation +decisions when appropriate. + +## Scope + +This Code of Conduct applies within all community spaces, and also applies when +an individual is officially representing the community in public spaces. +Examples of representing our community include using an official e-mail address, +posting via an official social media account, or acting as an appointed +representative at an online or offline event. + +## Enforcement + +Instances of abusive, harassing, or otherwise unacceptable behavior may be +reported to the community leaders responsible for enforcement at +losercheems@gmail.com. +All complaints will be reviewed and investigated promptly and fairly. + +All community leaders are obligated to respect the privacy and security of the +reporter of any incident. + +## Enforcement Guidelines + +Community leaders will follow these Community Impact Guidelines in determining +the consequences for any action they deem in violation of this Code of Conduct: + +### 1. Correction + +**Community Impact**: Use of inappropriate language or other behavior deemed +unprofessional or unwelcome in the community. + +**Consequence**: A private, written warning from community leaders, providing +clarity around the nature of the violation and an explanation of why the +behavior was inappropriate. A public apology may be requested. + +### 2. Warning + +**Community Impact**: A violation through a single incident or series of +actions. + +**Consequence**: A warning with consequences for continued behavior. No +interaction with the people involved, including unsolicited interaction with +those enforcing the Code of Conduct, for a specified period of time. This +includes avoiding interactions in community spaces as well as external channels +like social media. Violating these terms may lead to a temporary or permanent +ban. + +### 3. Temporary Ban + +**Community Impact**: A serious violation of community standards, including +sustained inappropriate behavior. + +**Consequence**: A temporary ban from any sort of interaction or public +communication with the community for a specified period of time. No public or +private interaction with the people involved, including unsolicited interaction +with those enforcing the Code of Conduct, is allowed during this period. +Violating these terms may lead to a permanent ban. + +### 4. Permanent Ban + +**Community Impact**: Demonstrating a pattern of violation of community +standards, including sustained inappropriate behavior, harassment of an +individual, or aggression toward or disparagement of classes of individuals. + +**Consequence**: A permanent ban from any sort of public interaction within the +community. + +## Attribution + +This Code of Conduct is adapted from the [Contributor Covenant][homepage], +version 2.1, available at +[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1]. + +Community Impact Guidelines were inspired by +[Mozilla's code of conduct enforcement ladder][Mozilla CoC]. + +For answers to common questions about this code of conduct, see the FAQ at +[https://www.contributor-covenant.org/faq][FAQ]. Translations are available at +[https://www.contributor-covenant.org/translations][translations]. + +[homepage]: https://www.contributor-covenant.org +[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html +[Mozilla CoC]: https://github.com/mozilla/diversity +[FAQ]: https://www.contributor-covenant.org/faq +[translations]: https://www.contributor-covenant.org/translations diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..d0368d1 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,205 @@ +# Contributing to Flash Dynamic Mask Attention + +Everyone is welcome to contribute, and we value everybody's contribution. Code contributions are not the only way to help the community. Answering questions, helping others, and improving the documentation are also immensely valuable. + +It also helps us if you spread the word! Reference the library in blog posts about the awesome projects it made possible, shout out on Twitter every time it has helped you, or simply ⭐️ the repository to say thank you. + +However you choose to contribute, please be mindful and respect our [code of conduct](https://github.com/SmallDoges/flash-dmattn/blob/main/CODE_OF_CONDUCT.md). + +## Ways to contribute + +There are several ways you can contribute to Flash-DMA: + +* Fix outstanding issues with the existing code. +* Submit issues related to bugs or desired new features. +* Implement new attention mechanisms or optimizations. +* Contribute to the examples, benchmarks, or documentation. +* Improve CUDA kernel performance. + +If you don't know where to start, there is a special [Good First Issue](https://github.com/SmallDoges/flash-dmattn/contribute) listing. It will give you a list of open issues that are beginner-friendly and help you start contributing to open-source. + +> All contributions are equally valuable to the community. 🥰 + +## Fixing outstanding issues + +If you notice an issue with the existing code and have a fix in mind, feel free to [start contributing](#create-a-pull-request) and open a Pull Request! + +## Submitting a bug-related issue or feature request + +Do your best to follow these guidelines when submitting a bug-related issue or a feature request. It will make it easier for us to come back to you quickly and with good feedback. + +### Did you find a bug? + +The Flash-DMA library is robust and reliable thanks to users who report the problems they encounter. + +Before you report an issue, we would really appreciate it if you could **make sure the bug was not already reported** (use the search bar on GitHub under Issues). Your issue should also be related to bugs in the library itself, and not your code. + +Once you've confirmed the bug hasn't already been reported, please include the following information in your issue so we can quickly resolve it: + +* Your **OS type and version** and **Python**, **PyTorch**, and **CUDA** versions. +* Your **GPU model** and **CUDA Compute Capability**. +* A short, self-contained, code snippet that allows us to reproduce the bug in less than 30s. +* The *full* traceback if an exception is raised. +* Attach any other additional information, like screenshots, you think may help. + +To get the environment information automatically, run: + +```bash +python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {torch.version.cuda}'); print(f'GPU: {torch.cuda.get_device_name() if torch.cuda.is_available() else \"None\"}')" +``` + +### Do you want a new feature? + +If there is a new feature you'd like to see in Flash-DMA, please open an issue and describe: + +1. What is the *motivation* behind this feature? Is it related to performance optimization, memory efficiency, or new attention mechanisms? + +2. Describe your requested feature in as much detail as possible. The more you can tell us about it, the better we'll be able to help you. + +3. Provide a *code snippet* that demonstrates the feature's usage. + +4. If the feature is related to a paper, please include a link. + +## Do you want to implement a new attention mechanism? + +New attention mechanisms and optimizations are constantly being developed. If you want to implement a new mechanism, please provide: + +* A short description of the attention mechanism and a link to the paper. +* Link to the implementation if it is open-sourced. +* Performance benchmarks compared to existing methods. +* CUDA compute capability requirements. + +## Do you want to add documentation? + +We're always looking for improvements to the documentation that make it more clear and accurate. Please let us know how the documentation can be improved such as typos and any content that is missing, unclear or inaccurate. + +## Create a Pull Request + +Before writing any code, we strongly advise you to search through the existing PRs or issues to make sure nobody is already working on the same thing. + +You will need basic `git` proficiency to contribute to Flash-DMA. You'll need **Python 3.8+** and **CUDA 11.8+** to contribute. + +### Development Setup + +1. Fork the [repository](https://github.com/SmallDoges/flash-dmattn) by clicking on the **Fork** button. + +2. Clone your fork to your local disk, and add the base repository as a remote: + + ```bash + git clone https://github.com//flash-dmattn.git + cd flash-dmattn + git remote add upstream https://github.com/SmallDoges/flash-dmattn.git + ``` + +3. Create a new branch to hold your development changes: + + ```bash + git checkout -b a-descriptive-name-for-my-changes + ``` + + 🚨 **Do not** work on the `main` branch! + +4. Set up a development environment: + + ```bash + # Ensure CUDA environment is properly set up + export CUDA_HOME=/usr/local/cuda # Adjust path as needed + + # Install in development mode + pip install -e . + + # Install development dependencies + pip install pytest numpy + ``` + +5. Develop the features in your branch. + + As you work on your code, you should make sure the test suite passes: + + ```bash + python -m pytest tests/ -v + ``` + + Flash-DMA also includes performance benchmarks. Run them to ensure your changes don't regress performance: + + ```bash + python benchmarks/forward_performance.py + python benchmarks/forward_equivalence.py + ``` + + For CUDA development, ensure your changes compile across supported architectures: + + ```bash + python setup.py build_ext --inplace + ``` + +6. Once you're happy with your changes, add changed files using `git add` and record your changes with `git commit`: + + ```bash + git add . + git commit -m "A descriptive commit message" + ``` + + Please write [good commit messages](https://chris.beams.io/posts/git-commit/). + +7. Go to your fork on GitHub and click on **Pull Request** to open a pull request. + +### Pull request checklist + +☐ The pull request title should summarize your contribution.
+☐ If your pull request addresses an issue, please mention the issue number in the pull request description to make sure they are linked.
+☐ To indicate a work in progress please prefix the title with `[WIP]`.
+☐ Make sure existing tests pass.
+☐ If adding a new feature, also add tests for it.
+☐ If implementing new CUDA kernels, ensure they work across all supported compute capabilities (SM 8.0+).
+☐ All public methods must have informative docstrings.
+☐ Performance benchmarks should not regress significantly.
+ +### Tests + +An extensive test suite is included to test the library behavior and performance. Tests can be found in the [tests](https://github.com/SmallDoges/flash-dmattn/tree/main/tests) folder and benchmarks in the [benchmarks](https://github.com/SmallDoges/flash-dmattn/tree/main/benchmarks) folder. + +We use `pytest` for testing. From the root of the repository, run: + +```bash +python -m pytest tests/ -v +``` + +For performance testing: + +```bash +python -m pytest benchmarks/ -v +``` + +### CUDA Development Guidelines + +When contributing CUDA code: + +1. **Test across architectures**: Ensure your code works on SM 8.0, 9.0, and 10.0. +2. **Memory efficiency**: Profile memory usage and ensure no memory leaks. +3. **Performance**: Benchmark against existing implementations. +4. **Documentation**: Document kernel parameters and expected performance characteristics. + +### Code Style + +We follow standard Python code style guidelines: + +* Use descriptive variable names +* Add type hints where applicable +* Follow PEP 8 guidelines +* Add docstrings to all public functions + +For CUDA code: +* Use clear variable names +* Comment complex kernel logic +* Follow NVIDIA CUDA best practices + +## Security + +If you discover a security vulnerability, please send an e-mail to the maintainers. All security vulnerabilities will be promptly addressed. + +## Questions? + +If you have questions about contributing, feel free to ask in the [GitHub Discussions](https://github.com/SmallDoges/flash-dmattn/discussions) or open an issue. + +Thank you for contributing to Flash Dynamic Mask Attention! 🚀 diff --git a/README.md b/README.md index f88a67d..18ea4fb 100644 --- a/README.md +++ b/README.md @@ -179,34 +179,29 @@ python -c "import flash_dma_cuda; print('✅ Flash DMA CUDA extension imported s **Note**: Flash Dynamic Mask Attention requires CUDA compute capability 8.0+ for optimal performance. Earlier architectures are not supported. + ## Benchmarking Flash-DMA provides comprehensive benchmarking tools to evaluate performance across different configurations: ### Forward Pass Equivalence ```bash -python benchmarks/benchmark_forward_equivalence.py +python benchmarks/forward_equivalence.py ``` Validates numerical consistency between Python reference and CUDA implementation. ### Performance Benchmarking ```bash -python benchmarks/benchmark_forward_performance.py +python benchmarks/forward_performance.py ``` -Compares Flash-DMA against standard Flash Attention across various sequence lengths and batch sizes. +Compares Flash-DMA against standard SDPA across various sequence lengths and batch sizes. ### Gradient Computation ```bash -python benchmarks/benchmark_grad.py +python benchmarks/grad_equivalence.py ``` Tests backward pass implementation and gradient equivalence. -### Multi-Query Associative Recall -```bash -python benchmarks/benchmark_mqar.py -``` -Evaluates performance on long-range reasoning tasks. - ## Troubleshooting @@ -254,10 +249,37 @@ print_memory_stats() torch.cuda.empty_cache() ``` + +## Contributing + +We welcome contributions from the community! Flash-DMA is an open-source project and we value all types of contributions. + +### How to Contribute + +- **Report bugs**: Found a bug? Please [open an issue](https://github.com/SmallDoges/flash-dmattn/issues/new/choose) +- **Request features**: Have an idea for improvement? [Let us know](https://github.com/SmallDoges/flash-dmattn/issues/new/choose) +- **Submit code**: Ready to contribute code? Check our [Contributing Guide](CONTRIBUTING.md) +- **Improve docs**: Help us make the documentation better + +### Quick Start for Contributors + +1. Fork the repository +2. Create a feature branch: `git checkout -b feature-name` +3. Make your changes and test them +4. Submit a pull request + +For detailed instructions, see our [Contributing Guide](CONTRIBUTING.md). + +### Code of Conduct + +This project follows the [Contributor Covenant Code of Conduct](CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code. + + ## License This project is licensed under the BSD 3-Clause License. See [LICENSE](LICENSE) for details. + ## Citation If you use Flash-DMA in your research, please cite: @@ -274,6 +296,7 @@ If you use Flash-DMA in your research, please cite: } ``` + ## Acknowledgments This project builds upon and integrates several excellent works: diff --git a/README_zh.md b/README_zh.md index a4551c2..96eede1 100644 --- a/README_zh.md +++ b/README_zh.md @@ -185,28 +185,22 @@ Flash-DMA 提供全面的基准测试工具,用于评估不同配置下的性 ### 前向传播等效性 ```bash -python benchmarks/benchmark_forward_equivalence.py +python benchmarks/forward_equivalence.py ``` 验证 Python 参考实现与 CUDA 实现之间的数值一致性。 ### 性能基准测试 ```bash -python benchmarks/benchmark_forward_performance.py +python benchmarks/forward_performance.py ``` -在各种序列长度和批大小下比较 Flash-DMA 与标准 Flash Attention。 +在各种序列长度和批大小下比较 Flash-DMA 与标准 SDPA。 ### 梯度计算 ```bash -python benchmarks/benchmark_grad.py +python benchmarks/grad_equivalence.py ``` 测试反向传播实现和梯度等效性。 -### 多查询联想回忆 -```bash -python benchmarks/benchmark_mqar.py -``` -评估长程推理任务的性能。 - ## 故障排除 @@ -254,6 +248,31 @@ print_memory_stats() torch.cuda.empty_cache() ``` + +## 贡献 + +我们欢迎社区的贡献!Flash-DMA 是一个开源项目,我们重视所有类型的贡献。 + +### 如何贡献 + +- **报告错误**: 发现了错误?请[提交 issue](https://github.com/SmallDoges/flash-dmattn/issues/new/choose) +- **功能请求**: 有改进想法?[告诉我们](https://github.com/SmallDoges/flash-dmattn/issues/new/choose) +- **提交代码**: 准备贡献代码?查看我们的[贡献指南](CONTRIBUTING.md) +- **改进文档**: 帮助我们完善文档 + +### 贡献者快速入门 + +1. Fork 仓库 +2. 创建功能分支: `git checkout -b feature-name` +3. 进行修改并测试 +4. 提交 Pull Request + +详细说明请参见我们的[贡献指南](CONTRIBUTING.md)。 + +### 行为准则 + +本项目遵循[贡献者公约行为准则](CODE_OF_CONDUCT.md)。参与时,您需要遵守此准则。 + ## 许可证 本项目采用 BSD 3-Clause 许可证。详情请参见 [LICENSE](LICENSE)。 diff --git a/SECURITY.md b/SECURITY.md new file mode 100644 index 0000000..020430e --- /dev/null +++ b/SECURITY.md @@ -0,0 +1,112 @@ +# Security Policy + +## Supported Versions + +We actively maintain and provide security updates for the following versions: + +| Version | Supported | +| ------- | ------------------ | +| Latest | :white_check_mark: | +| < Latest| :x: | + +## Security Considerations + +### CUDA Code Execution + +Flash Dynamic Mask Attention includes CUDA kernels and C++ extensions that execute on your GPU. When using this library: + +- Only install from trusted sources (official PyPI releases or verified builds) +- Be cautious when building from source with modifications +- Verify checksums when downloading pre-built binaries + +### Dependencies + +This library depends on: +- PyTorch (with CUDA support) +- NVIDIA CUTLASS library +- Standard Python scientific computing libraries + +We recommend keeping all dependencies up to date and using virtual environments for isolation. + +### Memory Safety + +Our CUDA kernels are designed with memory safety in mind: +- Bounds checking is implemented where performance allows +- Memory allocation patterns are tested across different input sizes +- We use established patterns from Flash Attention and CUTLASS + +However, as with any low-level CUDA code: +- Very large input tensors may cause out-of-memory errors +- Invalid input shapes may cause undefined behavior +- Custom modifications to kernel code should be thoroughly tested + +## Reporting a Vulnerability + +If you discover a security vulnerability, please report it responsibly: + +**For security issues:** +- Email: losercheems@gmail.com +- Subject: [SECURITY] Flash-DMA Vulnerability Report +- Include: Detailed description, reproduction steps, and potential impact + +**For general bugs:** +- Use our [GitHub Issues](https://github.com/SmallDoges/flash-dmattn/issues) +- Follow our [contributing guidelines](CONTRIBUTING.md) + +## Response Timeline + +- **Acknowledgment**: Within 48 hours +- **Initial Assessment**: Within 1 week +- **Resolution**: Depends on severity and complexity + +Critical security issues will be prioritized and may result in emergency releases. + +## Security Best Practices + +When using Flash Dynamic Mask Attention: + +1. **Environment Isolation** + ```bash + # Use virtual environments + python -m venv flash_dma_env + source flash_dma_env/bin/activate # Linux/Mac + # or + flash_dma_env\Scripts\activate # Windows + ``` + +2. **Dependency Management** + ```bash + # Keep dependencies updated + pip install --upgrade torch flash-dmattn + ``` + +3. **Input Validation** + ```python + # Validate tensor shapes and dtypes before processing + assert query.dtype in [torch.float16, torch.bfloat16, torch.float32] + assert query.shape == key.shape == value.shape + ``` + +4. **Resource Monitoring** + ```python + # Monitor GPU memory usage + import torch + print(f"GPU Memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB") + ``` + +## Disclosure Policy + +- Confirmed vulnerabilities will be disclosed responsibly +- Security fixes will be released as soon as safely possible +- CVE numbers will be requested for significant vulnerabilities +- Credit will be given to security researchers who report issues responsibly + +## Contact + +For security-related questions or concerns: +- Primary: losercheems@gmail.com +- Project maintainers: See [AUTHORS](AUTHORS) file + +For general support: +- GitHub Issues: https://github.com/SmallDoges/flash-dmattn/issues +- Documentation: https://github.com/SmallDoges/flash-dmattn/tree/main/docs/