flash-algo · LoserCheems · Aug 10, 2025 · Aug 10, 2025 · Aug 10, 2025
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -0,0 +1,47 @@
+---
+name: Bug report
+about: Create a report to help us improve Flash-DMA
+title: '[BUG] '
+labels: 'bug'
+assignees: ''
+
+---
+
+**Describe the bug**
+A clear and concise description of what the bug is.
+
+**To Reproduce**
+Steps to reproduce the behavior:
+1. Import flash_dmattn
+2. Run the following code:
+```python
+# Paste your code here
+```
+3. See error
+
+**Expected behavior**
+A clear and concise description of what you expected to happen.
+
+**Environment Information**
+Please run the following and paste the output:
+```bash
+python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {torch.version.cuda}'); print(f'GPU: {torch.cuda.get_device_name() if torch.cuda.is_available() else \"None\"}')"
+```
+
+**Additional context**
+- OS: [e.g. Ubuntu 20.04, Windows 10, macOS 12]
+- Python version: [e.g. 3.9.7]
+- Flash-DMA version: [e.g. 0.1.0]
+- CUDA Compute Capability: [e.g. 8.6]
+
+**Error traceback**
+If applicable, add the full error traceback:
+```
+Paste the full traceback here
+```
+
+**Debugging Information**
+Add any other context about the problem here, including:
+- Sequence lengths and batch sizes you're using
+- Whether this works with standard PyTorch SDPA
+- Any custom modifications to the code
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -0,0 +1,39 @@
+---
+name: Feature request
+about: Suggest an idea for Flash-DMA
+title: '[FEATURE] '
+labels: 'enhancement'
+assignees: ''
+
+---
+
+**Is your feature request related to a problem? Please describe.**
+A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
+
+**Describe the solution you'd like**
+A clear and concise description of what you want to happen.
+
+**Describe alternatives you've considered**
+A clear and concise description of any alternative solutions or features you've considered.
+
+**Implementation details**
+If you have thoughts on implementation:
+- Would this require CUDA kernel changes?
+- Does this affect the Python API?
+- Are there performance implications?
+- Any compatibility concerns with different GPU architectures?
+
+**Use case**
+Describe your specific use case:
+- What sequence lengths are you working with?
+- What is your target application (e.g., long document processing, code generation)?
+- How would this feature improve your workflow?
+
+**Additional context**
+Add any other context or screenshots about the feature request here.
+
+**Related work**
+If this feature is inspired by a paper or existing implementation, please provide:
+- Link to paper/implementation
+- Brief explanation of the technique
+- Why it would be valuable for Flash-DMA users
diff --git a/.github/ISSUE_TEMPLATE/performance_issue.md b/.github/ISSUE_TEMPLATE/performance_issue.md
@@ -0,0 +1,50 @@
+---
+name: Performance issue
+about: Report performance problems or optimization opportunities
+title: '[PERFORMANCE] '
+labels: 'performance'
+assignees: ''
+
+---
+
+**Performance Issue Description**
+Describe the performance problem you're experiencing.
+
+**Current Performance**
+Please provide benchmark results:
+- Sequence length: [e.g., 4096, 8192, 16384]
+- Batch size: [e.g., 1, 2, 4]
+- Number of heads: [e.g., 16, 32]
+- Head dimension: [e.g., 64, 128]
+- Current speed: [e.g., 15.2 ms/iteration]
+- Memory usage: [e.g., 8.5 GB]
+
+**Expected Performance**
+What performance would you expect, and why?
+- Expected speed: [e.g., <10 ms/iteration]
+- Comparison baseline: [e.g., PyTorch SDPA, Flash Attention]
+
+**Environment Information**
+```bash
+python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {torch.version.cuda}'); print(f'GPU: {torch.cuda.get_device_name() if torch.cuda.is_available() else \"None\"}')"
+```
+
+**Benchmark Code**
+Provide the code you used for benchmarking:
+```python
+# Paste your benchmark code here
+```
+
+**Profiling Information**
+If you have profiling data (from nsys, nvprof, or PyTorch profiler), please include relevant excerpts.
+
+**System Information**
+- GPU model and memory: [e.g., RTX 4090 24GB]
+- CUDA Compute Capability: [e.g., 8.9]
+- CPU: [e.g., Intel i9-12900K]
+- RAM: [e.g., 32GB DDR4]
+
+**Additional Context**
+- Is this a regression from a previous version?
+- Have you tried different batch sizes or sequence lengths?
+- Any specific attention patterns (causal, full, custom masks)?
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -0,0 +1,93 @@
+# Pull Request Template
+
+## Description
+Please provide a clear and concise description of your changes.
+
+## Type of Change
+Please check the relevant option(s):
+
+- [ ] Bug fix (non-breaking change which fixes an issue)
+- [ ] New feature (non-breaking change which adds functionality)
+- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
+- [ ] Documentation update
+- [ ] Performance optimization
+- [ ] CUDA kernel improvement
+- [ ] Code refactoring
+
+## Related Issues
+Please link any related issues:
+- Fixes #(issue number)
+- Related to #(issue number)
+
+## Changes Made
+Please describe the changes you made:
+
+### Code Changes
+- [ ] Modified Python API
+- [ ] Updated CUDA kernels
+- [ ] Changed build system
+- [ ] Updated dependencies
+
+### Documentation
+- [ ] Updated README
+- [ ] Updated API documentation
+- [ ] Added examples
+- [ ] Updated benchmarks
+
+## Testing
+Please describe the tests you ran to verify your changes:
+
+- [ ] Existing tests pass: `python -m pytest tests/ -v`
+- [ ] Added new tests for new functionality
+- [ ] Benchmarks show no performance regression
+- [ ] Tested on multiple GPU architectures (if applicable)
+
+### Test Configuration
+- OS: [e.g., Ubuntu 20.04]
+- Python: [e.g., 3.9.7]
+- PyTorch: [e.g., 2.1.0]
+- CUDA: [e.g., 11.8]
+- GPU: [e.g., RTX 4090]
+
+## Performance Impact
+If this change affects performance, please provide benchmarks:
+
+### Before
+```
+# Benchmark results before your changes
+```
+
+### After
+```
+# Benchmark results after your changes
+```
+
+## Breaking Changes
+If this PR introduces breaking changes, please describe:
+- What breaks
+- How users can migrate their code
+- Why the breaking change is necessary
+
+## Checklist
+Please check all that apply:
+
+- [ ] My code follows the project's style guidelines
+- [ ] I have performed a self-review of my own code
+- [ ] I have commented my code, particularly in hard-to-understand areas
+- [ ] I have made corresponding changes to the documentation
+- [ ] My changes generate no new warnings
+- [ ] I have added tests that prove my fix is effective or that my feature works
+- [ ] New and existing unit tests pass locally with my changes
+- [ ] Any dependent changes have been merged and published
+
+### CUDA-specific (if applicable)
+- [ ] CUDA kernels compile without warnings
+- [ ] Tested on SM 8.0+ architectures
+- [ ] Memory usage has been profiled
+- [ ] No memory leaks detected
+
+## Additional Notes
+Any additional information that reviewers should know:
+
+## Screenshots (if applicable)
+If your changes include visual elements or performance improvements, please add screenshots or graphs.