What's Changed
- fix: use explicit .ptr on scalar struct fields in monolithic MLA decode to get cleaner logs by @jdebache in #3458
- fix: One sided MOE A2A warp token policy hangs in some cases, disable… by @djns99 in #3371
- test: correct XQA NVFP4 skip reason to SM120/SM121 by @deng451e in #3510
- fix XQA NVFP4 head dim by @Njuapp in #3534
- tests: split test_trtllm_gen_attention.py into prefill / decode / decode-xqa shards by @bkryu in #3162
- docs(misc): expose DiT/RoPE norms; new RSTs for GDN decode/prefill/Mamba by @kangbintNV in #3446
- perf: cache cudaGetDeviceProperties in CudaDevice (fmha_v2) by @aws-jiadingg in #3522
- FMHAv2 on SM120 for head_dim 256/512 + sliding-window masks by @dbari in #3518
- feat: add FlashInfer Trace Apply by @yongwww in #3240
- Fix smem race in
FilteredTopKoverflow refinement by @awgu in #3529 - Extend Require Workspace Size and Disable Overlapping Criritcal Workspace Sections w/ 'auto' backend + autotuning for MLA decode by @Vinnie6167 in #3465
- perf(sampling): Optimize top_k_top_p_sampling_from_logits/from_probs for large-vocab small-k sampling by @bkryu in #3461
- Use an environment variable to control the number of reserved SMs for overlapping in TRT-LLM fused MoE by @jinyangyuan-nvidia in #3483
- Remove excessive CuteDslMoEWrapper memory allocation by @nvjullin in #3404
- docs: close v0.6.13 doc-check gaps + fix(moe) misleading topk_indices ICHECK message by @kangbintNV in #3546
- Unified MoE API: MoELayer with cross-backend NVFP4 autotune by @aleozlx in #3093
- Support smaller DSv4 sparse MLA head counts by @PerkzZheng in #3545
- feat(bench): add GDN routines to flashinfer_benchmark.py by @bkryu in #3572
- Add BF16 MoE SwiGLU OA params by @ruoqianguo in #3532
- perf(gemm): update mm_fp4 b12x SM120 NVFP4 dense GEMM kernel by @yichengj0 in #3560
- feat: add sm90 delta rule dsl prefill by @guangyunh-nv in #3477
- fix: headDim=512 GQA decode (#3343) by @djmmoss in #3393
- Add Qidi Sang to CODEOWNERS by @saltyminty in #3593
- [feat] Add gated tanh-GELU (GeluTanh) activation to CUTLASS fused MoE (GEMMA 4) by @jhaotingc in #3501
- [chore] Add jiahanc to GDN commiter by @jiahanc in #3600
- fix(ci): skip TRT-LLM Gen BF16 MoE SwiGLU test outside SM10.x by @ruoqianguo in #3580
- fix(mla): warn when backend='auto' falls back to non-Blackwell kernel on SM>=100 by @elwhyjay in #3405
- [fix] Fix SM100 GDN prefill hang by @jiahanc in #3581
- feat: improve gated delta rule benchmark script by @guangyunh-nv in #3616
- feat: delta rule work with zero length sequence by @guangyunh-nv in #3536
- feat(cutile): introduce cuTile backend (mm_bf16 + bmm_bf16 + gemm_fp8_nt_groupwise) by @yifeis-nv in #3426
- chore: remove leftover cpp srcs from #3477 by @guangyunh-nv in #3613
- feat: add mxfp8 quant to moe a2a combine by @IwakuraRein in #3376
- fix: scope trtllm-gen block_size guard to its own backend by @elwhyjay in #3428
- feat(attention): add SM120 sparse MLA kernels by @lucifer1004 in #3395
- test: match test output scaling with cutedsl kernel (scale before bf16 cast) by @saltyminty in #3596
- fix(sampling): fix shared-memory race and out-of-range token id in SamplingFromLogitsKernel by @kahyunnam in #3624
- feat(attention): head_dim=512 support for attention prefill & decode for Gemma 4 on SM120/121 by @bkryu in #3576
- [fix] explicitly validate grouped_gemm_nt_masked layout contract to prevent NaN contamination by @kahyunnam in #3574
- fix(gemm): relax b12x FP4 K constraint from 128 to 32 (TMA alignment) by @yichengj0 in #3646
- Add MXFP8 MoE SwiGLU OA parameters by @ruoqianguo in #3504
- Add support for non-multiple VEC_COLS in fused_rmsnorm_silu for bf16/fp8 by @xueweilnvidia in #3417
- fix(topk): eliminate multi-CTA radix top-k stream hangs on SM120/SM121 by @waynehacking8 in #3615
- feat: add sm120 delta rule dsl prefill by @guangyunh-nv in #3479
- Add SM120 NVFP4 attention JIT path by @tiffany940107 in #3640
- Add MXFP8 MoE GEMM entry (cute SM120 backend) by @CarstyYou in #3562
- feat: Add BF16_FP4 GEMM with cuDNN and CuTe-DSL backends for SM120/121 for W4A16 workloads by @bkryu in #3597
- perf(gdn): make GDN kernels compilation batch-size agnostic (support dynamic batch shapes for vLLM integration) by @kahyunnam in #3649
- feat: add sm90 cp delta rule dsl by @guangyunh-nv in #3481
- fix(sampling): terminate top-p search at adjacent float bounds by @djmmoss in #3623
- fix: fix tinygemm barrier bug by @yweng0828 in #3630
- feat(gdn): BF16 state recovery/decode kernel with per-request K and f… by @ameynaik-hub in #3502
- feat(benchmark): add MLA --mla_is_var_seq / --mla_cute_dsl_impl knobs by @lunarz-dev in #3695
- Enable SwapAB in
mm_fp4cute-dslbackend when M is not a multiple of 8. by @b8zhong in #3667 - fix(moe): Fix unbounded weight-cache growth in b12x MoE by @bkryu in #3709
- Add Relu2 + ungated MoE to CuteDSL MoE by @b8zhong in #3642
- trtllm_batch_decode_with_kv_cache_mla trtllm-gen backend cum_seq_lens_q support by @saltyminty in #3238
- fix: pass XQA NVFP4 scale-factor strides by @Njuapp in #3608
- feat(attention): FP8 KV cache support for Hopper SM90 MLA by @qsang-nv in #3694
- [GDN] sm100: support more state dtype by @Observer007 in #3715
- feat: cuTile Grouped MXFP8 Quantization by @philipphack in #3657
- Accept uint8 workspaces in CUTE DSL MLA decode by @leejnau in #3599
- tests: split test_trtllm_gen_fused_moe.py into shards by @feih-nv in #3635
- perf(moe): Enhance CuteDSL NVF4 MOE Perf by @liyuhannnnn in #3564
- fix(mamba): reject SM120/SM121 in SSDCombined with a clear error by @waynehacking8 in #3668
- docs(gemm): add missing .rst entries for mm_bf16_fp4 and prepare_bf16_fp4_weights by @kangbintNV in #3710
- docs: add missing parameter entries to docstrings and env vars by @kangbintNV in #3627
- rename back by @aleozlx in #3730
- docs: document scale_major_mode param and FLASHINFER_AUTOTUNE_DIR env var by @kangbintNV in #3696
- feat: add mxfp4/nvfp4 quant to moe a2a combine by @IwakuraRein in #3643
- chore: fix enable_pdl for trtllm-gen routing and finalize kernel by @IwakuraRein in #3588
- Bug fix (gdn): Layout contract fix from #3649 by @kahyunnam in #3693
- Ameyn/fix fp32 mtp pool out indices by @ameynaik-hub in #3490
- Prune moe tests by @aleozlx in #3733
- Fix the output allocation consistency of trtllm-gen MoE APIs by @b8zhong in #3678
- feat(comm): Support per-token LoRA Info in MoE a2a comm payloads by @JyChang012 in #3375
- docs(gdn): document missing ssm_state_indices param in gated_delta_rule_mtp by @kangbintNV in #3725
- feat(moe): enable DSFp8 + LoRA delta path by @zetacat in #3708
- Improve GDN prefill perf by ~20-25% (mainloop efficiency) by @jhjpark in #3742
- test: fix CUDA OOM in batch-prefill custom-mask test on 24GB CI GPUs by @waynehacking8 in #3609
- Remove pagesize 16/32 assertion from xqa nvfp4 sm120 by @Njuapp in #3724
New Contributors
- @deng451e made their first contribution in #3510
- @Njuapp made their first contribution in #3534
- @aws-jiadingg made their first contribution in #3522
- @awgu made their first contribution in #3529
- @ruoqianguo made their first contribution in #3532
- @yichengj0 made their first contribution in #3560
- @jhaotingc made their first contribution in #3501
- @elwhyjay made their first contribution in #3405
- @yifeis-nv made their first contribution in #3426
- @lucifer1004 made their first contribution in #3395
- @waynehacking8 made their first contribution in #3615
- @tiffany940107 made their first contribution in #3640
- @CarstyYou made their first contribution in #3562
- @feih-nv made their first contribution in #3635
- @liyuhannnnn made their first contribution in #3564
- @JyChang012 made their first contribution in #3375
- @zetacat made their first contribution in #3708
Full Changelog: v0.6.13rc2...v0.6.14