🎄 v0.4.1
What's Changed
- [GDN] fix oom on A6000 by @sustcsonglin in #622
- [KDA] Fix
_no_weight_decayterm by @yzhangcs in #629 - Lint all files by @zhiyuan1i in #628
- Fix kda.gate not enforcing contiguous mem layout by @Qubitium in #627
- Fix docstring for 'g' parameter shape by @GuoYiFantastic in #631
- Fix: Correct K/V dimension mismatch in path_attn bwd kernels"changing K/BK to V/BV for v and dv operations by @ReyJerry in #633
- KDA - fix: don't force fused_recurrent when in training mode with small sequences by @masc-it in #636
- [KDA]: Fuse beta.float().sigmoid() in fused_kda_gate by @zhiyuan1i in #642
- [Perf] add chunk_indices parameter to avoid redundant computation by @zhiyuan1i in #641
- Added a badge for Ask DeepWiki to the README to auto-refresh the wiki weekly by @richardodliu in #644
- [Tril] Enable precision autotune[skip test] by @zhiyuan1i in #646
- [BC] Capitalize all envs by @yzhangcs in #650
- deprecate
fused_chunk_glaandsafe_exp; fix kda exp mask by @sustcsonglin in #652 - [kda kernel optimization] implement token-parallel intra-chunk attention by @sustcsonglin in #653
- [KDA] Faster inter computation in 64x64 intra fwd by @yzhangcs in #658
- Add PTX softplus by @yzhangcs in #660
- [KDA] Support fused forget gate by @yzhangcs in #662
- [KDA] Remove beta from fused gate by @yzhangcs in #665
- [Softplus] Support AMD/Intel devices by @zhiyuan1i in #664
- [KDA] Fix ood bugs in intra fwd by @yzhangcs in #673
- [L2Norm] Avoid recompilation for variable-length inputs by @retonym in #669
- fix: set the dtype of RMSNorm to float32 to avoid precision underflow by @pprp in #676
- [KDA] Changed all exp to exp2 by @Nathancgy in #679
- [KDA] Fuse inplace add by @yzhangcs in #682
- Add head_dim parameter to NSA layer by @mutiann in #683
- [KDA] Fuse dAqk and dv by @yzhangcs in #689
- Temporary workaround to disable TritonGPUHoistTMEMAlloc in b_dk += tl.dot(tl.trans(b_dA), b_kb) by @rucnyz in #687
- [NSA] fix compression branch dkv kernel dk and dv pointer impl by @yibozhong in #690
- [KDA] fused bwd kernels inter and prepare wy by @Nathancgy in #688
- [GDN] Fix potential ood for long inputs by @yzhangcs in #692
- [GDN] Support beta in float32 by @AwesomeSeq in #693
- [GSA] Fix gate oob bugs by @yzhangcs in #694
New Contributors
- @Qubitium made their first contribution in #627
- @GuoYiFantastic made their first contribution in #631
- @ReyJerry made their first contribution in #633
- @masc-it made their first contribution in #636
- @retonym made their first contribution in #669
- @pprp made their first contribution in #676
- @mutiann made their first contribution in #683
Full Changelog: v0.4.0...v0.4.1