Skip to content

v0.2.12

Choose a tag to compare

@yongwww yongwww released this 18 Aug 16:46
· 1090 commits to main since this release
ae1480c

What's Changed

  • Fix TRTLLM NVFP4-out attention kernel scale factor dim issue by @elvischenv in #1460
  • perf: add fast path to TopPRenormProbKernel for top_p >= 1.0, significantly boosting SGLang workloads by @TianyuZhang1214 in #1483
  • fix: update cutedsl masked moe gemm by @yyihuang in #1488
  • feat: Support fp8 qkv, fp16/bf16 out MHA for trtllm-gen. by @weireweire in #1490
  • Add errors when dtype is anything other than int32 for ptr metatdata by @pavanimajety in #1492
  • refactor: unify autotuner for bmm_fp8 by @ttyio in #1479
  • fix: update masked moe gemm fp4 tensor reshape by @yyihuang in #1495
  • Revert "feat: Support fp8 qkv, fp16/bf16 out MHA for trtllm-gen. (#1490) by @yzh119 in #1496
  • fix(aot): unused compute in has_sm by @fecet in #1501
  • fix: Replace cub Max/Min with cuda::maximum/minimum for cuda 13 compatibility by @yongwww in #1500
  • doc: Update the masked grouped gemm doc by @kaixih in #1499
  • Perf: support scale_a/scale_b instead of combined scale in cutlass bmm_fp8 by @ttyio in #1491
  • feat: scaling at fp4 gemm epilogue by @yyihuang in #1498
  • Add benchmark for cutedsl gemm by @fzyzcjy in #1502
  • Do not import NVSHMEM in the AoT script unless explicitly requested by @nandor in #1506
  • bugfix: Fix stream handling in cutedsl gemm by @fzyzcjy in #1509
  • bump version to v0.2.12 by @yongwww in #1510

New Contributors

Full Changelog: v0.2.11.post3...v0.2.12