Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
name: Bug report
about: Create a report to help us improve Flash-DMA
title: '[BUG] '
labels: 'bug'
assignees: ''

---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior:
1. Import flash_dmattn
2. Run the following code:
```python
# Paste your code here
```
3. See error

**Expected behavior**
A clear and concise description of what you expected to happen.

**Environment Information**
Please run the following and paste the output:
```bash
python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {torch.version.cuda}'); print(f'GPU: {torch.cuda.get_device_name() if torch.cuda.is_available() else \"None\"}')"
```

**Additional context**
- OS: [e.g. Ubuntu 20.04, Windows 10, macOS 12]
- Python version: [e.g. 3.9.7]
- Flash-DMA version: [e.g. 0.1.0]
- CUDA Compute Capability: [e.g. 8.6]

**Error traceback**
If applicable, add the full error traceback:
```
Paste the full traceback here
```

**Debugging Information**
Add any other context about the problem here, including:
- Sequence lengths and batch sizes you're using
- Whether this works with standard PyTorch SDPA
- Any custom modifications to the code
39 changes: 39 additions & 0 deletions .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
name: Feature request
about: Suggest an idea for Flash-DMA
title: '[FEATURE] '
labels: 'enhancement'
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Implementation details**
If you have thoughts on implementation:
- Would this require CUDA kernel changes?
- Does this affect the Python API?
- Are there performance implications?
- Any compatibility concerns with different GPU architectures?

**Use case**
Describe your specific use case:
- What sequence lengths are you working with?
- What is your target application (e.g., long document processing, code generation)?
- How would this feature improve your workflow?

**Additional context**
Add any other context or screenshots about the feature request here.

**Related work**
If this feature is inspired by a paper or existing implementation, please provide:
- Link to paper/implementation
- Brief explanation of the technique
- Why it would be valuable for Flash-DMA users
50 changes: 50 additions & 0 deletions .github/ISSUE_TEMPLATE/performance_issue.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
name: Performance issue
about: Report performance problems or optimization opportunities
title: '[PERFORMANCE] '
labels: 'performance'
assignees: ''

---

**Performance Issue Description**
Describe the performance problem you're experiencing.

**Current Performance**
Please provide benchmark results:
- Sequence length: [e.g., 4096, 8192, 16384]
- Batch size: [e.g., 1, 2, 4]
- Number of heads: [e.g., 16, 32]
- Head dimension: [e.g., 64, 128]
- Current speed: [e.g., 15.2 ms/iteration]
- Memory usage: [e.g., 8.5 GB]

**Expected Performance**
What performance would you expect, and why?
- Expected speed: [e.g., <10 ms/iteration]
- Comparison baseline: [e.g., PyTorch SDPA, Flash Attention]

**Environment Information**
```bash
python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {torch.version.cuda}'); print(f'GPU: {torch.cuda.get_device_name() if torch.cuda.is_available() else \"None\"}')"
```

**Benchmark Code**
Provide the code you used for benchmarking:
```python
# Paste your benchmark code here
```

**Profiling Information**
If you have profiling data (from nsys, nvprof, or PyTorch profiler), please include relevant excerpts.

**System Information**
- GPU model and memory: [e.g., RTX 4090 24GB]
- CUDA Compute Capability: [e.g., 8.9]
- CPU: [e.g., Intel i9-12900K]
- RAM: [e.g., 32GB DDR4]

**Additional Context**
- Is this a regression from a previous version?
- Have you tried different batch sizes or sequence lengths?
- Any specific attention patterns (causal, full, custom masks)?
93 changes: 93 additions & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Pull Request Template

## Description
Please provide a clear and concise description of your changes.

## Type of Change
Please check the relevant option(s):

- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Documentation update
- [ ] Performance optimization
- [ ] CUDA kernel improvement
- [ ] Code refactoring

## Related Issues
Please link any related issues:
- Fixes #(issue number)
- Related to #(issue number)

## Changes Made
Please describe the changes you made:

### Code Changes
- [ ] Modified Python API
- [ ] Updated CUDA kernels
- [ ] Changed build system
- [ ] Updated dependencies

### Documentation
- [ ] Updated README
- [ ] Updated API documentation
- [ ] Added examples
- [ ] Updated benchmarks

## Testing
Please describe the tests you ran to verify your changes:

- [ ] Existing tests pass: `python -m pytest tests/ -v`
- [ ] Added new tests for new functionality
- [ ] Benchmarks show no performance regression
- [ ] Tested on multiple GPU architectures (if applicable)

### Test Configuration
- OS: [e.g., Ubuntu 20.04]
- Python: [e.g., 3.9.7]
- PyTorch: [e.g., 2.1.0]
- CUDA: [e.g., 11.8]
- GPU: [e.g., RTX 4090]

## Performance Impact
If this change affects performance, please provide benchmarks:

### Before
```
# Benchmark results before your changes
```

### After
```
# Benchmark results after your changes
```

## Breaking Changes
If this PR introduces breaking changes, please describe:
- What breaks
- How users can migrate their code
- Why the breaking change is necessary

## Checklist
Please check all that apply:

- [ ] My code follows the project's style guidelines
- [ ] I have performed a self-review of my own code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
- [ ] Any dependent changes have been merged and published

### CUDA-specific (if applicable)
- [ ] CUDA kernels compile without warnings
- [ ] Tested on SM 8.0+ architectures
- [ ] Memory usage has been profiled
- [ ] No memory leaks detected

## Additional Notes
Any additional information that reviewers should know:

## Screenshots (if applicable)
If your changes include visual elements or performance improvements, please add screenshots or graphs.