v0.2.12
What's Changed
- Fix TRTLLM NVFP4-out attention kernel scale factor dim issue by @elvischenv in #1460
- perf: add fast path to TopPRenormProbKernel for top_p >= 1.0, significantly boosting SGLang workloads by @TianyuZhang1214 in #1483
- fix: update cutedsl masked moe gemm by @yyihuang in #1488
- feat: Support fp8 qkv, fp16/bf16 out MHA for trtllm-gen. by @weireweire in #1490
- Add errors when dtype is anything other than int32 for ptr metatdata by @pavanimajety in #1492
- refactor: unify autotuner for bmm_fp8 by @ttyio in #1479
- fix: update masked moe gemm fp4 tensor reshape by @yyihuang in #1495
- Revert "feat: Support fp8 qkv, fp16/bf16 out MHA for trtllm-gen. (#1490) by @yzh119 in #1496
- fix(aot): unused compute in has_sm by @fecet in #1501
- fix: Replace cub Max/Min with cuda::maximum/minimum for cuda 13 compatibility by @yongwww in #1500
- doc: Update the masked grouped gemm doc by @kaixih in #1499
- Perf: support scale_a/scale_b instead of combined scale in cutlass bmm_fp8 by @ttyio in #1491
- feat: scaling at fp4 gemm epilogue by @yyihuang in #1498
- Add benchmark for cutedsl gemm by @fzyzcjy in #1502
- Do not import NVSHMEM in the AoT script unless explicitly requested by @nandor in #1506
- bugfix: Fix stream handling in cutedsl gemm by @fzyzcjy in #1509
- bump version to v0.2.12 by @yongwww in #1510
New Contributors
- @elvischenv made their first contribution in #1460
- @TianyuZhang1214 made their first contribution in #1483
- @pavanimajety made their first contribution in #1492
- @fecet made their first contribution in #1501
Full Changelog: v0.2.11.post3...v0.2.12