Skip to content

Release v0.6.14

Latest

Choose a tag to compare

@github-actions github-actions released this 02 Jul 00:14
v0.6.14
19f1a41

What's Changed

  • fix: use explicit .ptr on scalar struct fields in monolithic MLA decode to get cleaner logs by @jdebache in #3458
  • fix: One sided MOE A2A warp token policy hangs in some cases, disable… by @djns99 in #3371
  • test: correct XQA NVFP4 skip reason to SM120/SM121 by @deng451e in #3510
  • fix XQA NVFP4 head dim by @Njuapp in #3534
  • tests: split test_trtllm_gen_attention.py into prefill / decode / decode-xqa shards by @bkryu in #3162
  • docs(misc): expose DiT/RoPE norms; new RSTs for GDN decode/prefill/Mamba by @kangbintNV in #3446
  • perf: cache cudaGetDeviceProperties in CudaDevice (fmha_v2) by @aws-jiadingg in #3522
  • FMHAv2 on SM120 for head_dim 256/512 + sliding-window masks by @dbari in #3518
  • feat: add FlashInfer Trace Apply by @yongwww in #3240
  • Fix smem race in FilteredTopK overflow refinement by @awgu in #3529
  • Extend Require Workspace Size and Disable Overlapping Criritcal Workspace Sections w/ 'auto' backend + autotuning for MLA decode by @Vinnie6167 in #3465
  • perf(sampling): Optimize top_k_top_p_sampling_from_logits/from_probs for large-vocab small-k sampling by @bkryu in #3461
  • Use an environment variable to control the number of reserved SMs for overlapping in TRT-LLM fused MoE by @jinyangyuan-nvidia in #3483
  • Remove excessive CuteDslMoEWrapper memory allocation by @nvjullin in #3404
  • docs: close v0.6.13 doc-check gaps + fix(moe) misleading topk_indices ICHECK message by @kangbintNV in #3546
  • Unified MoE API: MoELayer with cross-backend NVFP4 autotune by @aleozlx in #3093
  • Support smaller DSv4 sparse MLA head counts by @PerkzZheng in #3545
  • feat(bench): add GDN routines to flashinfer_benchmark.py by @bkryu in #3572
  • Add BF16 MoE SwiGLU OA params by @ruoqianguo in #3532
  • perf(gemm): update mm_fp4 b12x SM120 NVFP4 dense GEMM kernel by @yichengj0 in #3560
  • feat: add sm90 delta rule dsl prefill by @guangyunh-nv in #3477
  • fix: headDim=512 GQA decode (#3343) by @djmmoss in #3393
  • Add Qidi Sang to CODEOWNERS by @saltyminty in #3593
  • [feat] Add gated tanh-GELU (GeluTanh) activation to CUTLASS fused MoE (GEMMA 4) by @jhaotingc in #3501
  • [chore] Add jiahanc to GDN commiter by @jiahanc in #3600
  • fix(ci): skip TRT-LLM Gen BF16 MoE SwiGLU test outside SM10.x by @ruoqianguo in #3580
  • fix(mla): warn when backend='auto' falls back to non-Blackwell kernel on SM>=100 by @elwhyjay in #3405
  • [fix] Fix SM100 GDN prefill hang by @jiahanc in #3581
  • feat: improve gated delta rule benchmark script by @guangyunh-nv in #3616
  • feat: delta rule work with zero length sequence by @guangyunh-nv in #3536
  • feat(cutile): introduce cuTile backend (mm_bf16 + bmm_bf16 + gemm_fp8_nt_groupwise) by @yifeis-nv in #3426
  • chore: remove leftover cpp srcs from #3477 by @guangyunh-nv in #3613
  • feat: add mxfp8 quant to moe a2a combine by @IwakuraRein in #3376
  • fix: scope trtllm-gen block_size guard to its own backend by @elwhyjay in #3428
  • feat(attention): add SM120 sparse MLA kernels by @lucifer1004 in #3395
  • test: match test output scaling with cutedsl kernel (scale before bf16 cast) by @saltyminty in #3596
  • fix(sampling): fix shared-memory race and out-of-range token id in SamplingFromLogitsKernel by @kahyunnam in #3624
  • feat(attention): head_dim=512 support for attention prefill & decode for Gemma 4 on SM120/121 by @bkryu in #3576
  • [fix] explicitly validate grouped_gemm_nt_masked layout contract to prevent NaN contamination by @kahyunnam in #3574
  • fix(gemm): relax b12x FP4 K constraint from 128 to 32 (TMA alignment) by @yichengj0 in #3646
  • Add MXFP8 MoE SwiGLU OA parameters by @ruoqianguo in #3504
  • Add support for non-multiple VEC_COLS in fused_rmsnorm_silu for bf16/fp8 by @xueweilnvidia in #3417
  • fix(topk): eliminate multi-CTA radix top-k stream hangs on SM120/SM121 by @waynehacking8 in #3615
  • feat: add sm120 delta rule dsl prefill by @guangyunh-nv in #3479
  • Add SM120 NVFP4 attention JIT path by @tiffany940107 in #3640
  • Add MXFP8 MoE GEMM entry (cute SM120 backend) by @CarstyYou in #3562
  • feat: Add BF16_FP4 GEMM with cuDNN and CuTe-DSL backends for SM120/121 for W4A16 workloads by @bkryu in #3597
  • perf(gdn): make GDN kernels compilation batch-size agnostic (support dynamic batch shapes for vLLM integration) by @kahyunnam in #3649
  • feat: add sm90 cp delta rule dsl by @guangyunh-nv in #3481
  • fix(sampling): terminate top-p search at adjacent float bounds by @djmmoss in #3623
  • fix: fix tinygemm barrier bug by @yweng0828 in #3630
  • feat(gdn): BF16 state recovery/decode kernel with per-request K and f… by @ameynaik-hub in #3502
  • feat(benchmark): add MLA --mla_is_var_seq / --mla_cute_dsl_impl knobs by @lunarz-dev in #3695
  • Enable SwapAB in mm_fp4 cute-dsl backend when M is not a multiple of 8. by @b8zhong in #3667
  • fix(moe): Fix unbounded weight-cache growth in b12x MoE by @bkryu in #3709
  • Add Relu2 + ungated MoE to CuteDSL MoE by @b8zhong in #3642
  • trtllm_batch_decode_with_kv_cache_mla trtllm-gen backend cum_seq_lens_q support by @saltyminty in #3238
  • fix: pass XQA NVFP4 scale-factor strides by @Njuapp in #3608
  • feat(attention): FP8 KV cache support for Hopper SM90 MLA by @qsang-nv in #3694
  • [GDN] sm100: support more state dtype by @Observer007 in #3715
  • feat: cuTile Grouped MXFP8 Quantization by @philipphack in #3657
  • Accept uint8 workspaces in CUTE DSL MLA decode by @leejnau in #3599
  • tests: split test_trtllm_gen_fused_moe.py into shards by @feih-nv in #3635
  • perf(moe): Enhance CuteDSL NVF4 MOE Perf by @liyuhannnnn in #3564
  • fix(mamba): reject SM120/SM121 in SSDCombined with a clear error by @waynehacking8 in #3668
  • docs(gemm): add missing .rst entries for mm_bf16_fp4 and prepare_bf16_fp4_weights by @kangbintNV in #3710
  • docs: add missing parameter entries to docstrings and env vars by @kangbintNV in #3627
  • rename back by @aleozlx in #3730
  • docs: document scale_major_mode param and FLASHINFER_AUTOTUNE_DIR env var by @kangbintNV in #3696
  • feat: add mxfp4/nvfp4 quant to moe a2a combine by @IwakuraRein in #3643
  • chore: fix enable_pdl for trtllm-gen routing and finalize kernel by @IwakuraRein in #3588
  • Bug fix (gdn): Layout contract fix from #3649 by @kahyunnam in #3693
  • Ameyn/fix fp32 mtp pool out indices by @ameynaik-hub in #3490
  • Prune moe tests by @aleozlx in #3733
  • Fix the output allocation consistency of trtllm-gen MoE APIs by @b8zhong in #3678
  • feat(comm): Support per-token LoRA Info in MoE a2a comm payloads by @JyChang012 in #3375
  • docs(gdn): document missing ssm_state_indices param in gated_delta_rule_mtp by @kangbintNV in #3725
  • feat(moe): enable DSFp8 + LoRA delta path by @zetacat in #3708
  • Improve GDN prefill perf by ~20-25% (mainloop efficiency) by @jhjpark in #3742
  • test: fix CUDA OOM in batch-prefill custom-mask test on 24GB CI GPUs by @waynehacking8 in #3609
  • Remove pagesize 16/32 assertion from xqa nvfp4 sm120 by @Njuapp in #3724

New Contributors

Full Changelog: v0.6.13rc2...v0.6.14