Release Release v0.6.14 · flashinfer-ai/flashinfer

What's Changed

fix: use explicit .ptr on scalar struct fields in monolithic MLA decode to get cleaner logs by @jdebache in #3458
fix: One sided MOE A2A warp token policy hangs in some cases, disable… by @djns99 in #3371
test: correct XQA NVFP4 skip reason to SM120/SM121 by @deng451e in #3510
fix XQA NVFP4 head dim by @Njuapp in #3534
tests: split test_trtllm_gen_attention.py into prefill / decode / decode-xqa shards by @bkryu in #3162
docs(misc): expose DiT/RoPE norms; new RSTs for GDN decode/prefill/Mamba by @kangbintNV in #3446
perf: cache cudaGetDeviceProperties in CudaDevice (fmha_v2) by @aws-jiadingg in #3522
FMHAv2 on SM120 for head_dim 256/512 + sliding-window masks by @dbari in #3518
feat: add FlashInfer Trace Apply by @yongwww in #3240
Fix smem race in FilteredTopK overflow refinement by @awgu in #3529
Extend Require Workspace Size and Disable Overlapping Criritcal Workspace Sections w/ 'auto' backend + autotuning for MLA decode by @Vinnie6167 in #3465
perf(sampling): Optimize top_k_top_p_sampling_from_logits/from_probs for large-vocab small-k sampling by @bkryu in #3461
Use an environment variable to control the number of reserved SMs for overlapping in TRT-LLM fused MoE by @jinyangyuan-nvidia in #3483
Remove excessive CuteDslMoEWrapper memory allocation by @nvjullin in #3404
docs: close v0.6.13 doc-check gaps + fix(moe) misleading topk_indices ICHECK message by @kangbintNV in #3546
Unified MoE API: MoELayer with cross-backend NVFP4 autotune by @aleozlx in #3093
Support smaller DSv4 sparse MLA head counts by @PerkzZheng in #3545
feat(bench): add GDN routines to flashinfer_benchmark.py by @bkryu in #3572
Add BF16 MoE SwiGLU OA params by @ruoqianguo in #3532
perf(gemm): update mm_fp4 b12x SM120 NVFP4 dense GEMM kernel by @yichengj0 in #3560
feat: add sm90 delta rule dsl prefill by @guangyunh-nv in #3477
fix: headDim=512 GQA decode (#3343) by @djmmoss in #3393
Add Qidi Sang to CODEOWNERS by @saltyminty in #3593
[feat] Add gated tanh-GELU (GeluTanh) activation to CUTLASS fused MoE (GEMMA 4) by @jhaotingc in #3501
[chore] Add jiahanc to GDN commiter by @jiahanc in #3600
fix(ci): skip TRT-LLM Gen BF16 MoE SwiGLU test outside SM10.x by @ruoqianguo in #3580
fix(mla): warn when backend='auto' falls back to non-Blackwell kernel on SM>=100 by @elwhyjay in #3405
[fix] Fix SM100 GDN prefill hang by @jiahanc in #3581
feat: improve gated delta rule benchmark script by @guangyunh-nv in #3616
feat: delta rule work with zero length sequence by @guangyunh-nv in #3536
feat(cutile): introduce cuTile backend (mm_bf16 + bmm_bf16 + gemm_fp8_nt_groupwise) by @yifeis-nv in #3426
chore: remove leftover cpp srcs from #3477 by @guangyunh-nv in #3613
feat: add mxfp8 quant to moe a2a combine by @IwakuraRein in #3376
fix: scope trtllm-gen block_size guard to its own backend by @elwhyjay in #3428
feat(attention): add SM120 sparse MLA kernels by @lucifer1004 in #3395
test: match test output scaling with cutedsl kernel (scale before bf16 cast) by @saltyminty in #3596
fix(sampling): fix shared-memory race and out-of-range token id in SamplingFromLogitsKernel by @kahyunnam in #3624
feat(attention): head_dim=512 support for attention prefill & decode for Gemma 4 on SM120/121 by @bkryu in #3576
[fix] explicitly validate grouped_gemm_nt_masked layout contract to prevent NaN contamination by @kahyunnam in #3574
fix(gemm): relax b12x FP4 K constraint from 128 to 32 (TMA alignment) by @yichengj0 in #3646
Add MXFP8 MoE SwiGLU OA parameters by @ruoqianguo in #3504
Add support for non-multiple VEC_COLS in fused_rmsnorm_silu for bf16/fp8 by @xueweilnvidia in #3417
fix(topk): eliminate multi-CTA radix top-k stream hangs on SM120/SM121 by @waynehacking8 in #3615
feat: add sm120 delta rule dsl prefill by @guangyunh-nv in #3479
Add SM120 NVFP4 attention JIT path by @tiffany940107 in #3640
Add MXFP8 MoE GEMM entry (cute SM120 backend) by @CarstyYou in #3562
feat: Add BF16_FP4 GEMM with cuDNN and CuTe-DSL backends for SM120/121 for W4A16 workloads by @bkryu in #3597
perf(gdn): make GDN kernels compilation batch-size agnostic (support dynamic batch shapes for vLLM integration) by @kahyunnam in #3649
feat: add sm90 cp delta rule dsl by @guangyunh-nv in #3481
fix(sampling): terminate top-p search at adjacent float bounds by @djmmoss in #3623
fix: fix tinygemm barrier bug by @yweng0828 in #3630
feat(gdn): BF16 state recovery/decode kernel with per-request K and f… by @ameynaik-hub in #3502
feat(benchmark): add MLA --mla_is_var_seq / --mla_cute_dsl_impl knobs by @lunarz-dev in #3695
Enable SwapAB in mm_fp4 cute-dsl backend when M is not a multiple of 8. by @b8zhong in #3667
fix(moe): Fix unbounded weight-cache growth in b12x MoE by @bkryu in #3709
Add Relu2 + ungated MoE to CuteDSL MoE by @b8zhong in #3642
trtllm_batch_decode_with_kv_cache_mla trtllm-gen backend cum_seq_lens_q support by @saltyminty in #3238
fix: pass XQA NVFP4 scale-factor strides by @Njuapp in #3608
feat(attention): FP8 KV cache support for Hopper SM90 MLA by @qsang-nv in #3694
[GDN] sm100: support more state dtype by @Observer007 in #3715
feat: cuTile Grouped MXFP8 Quantization by @philipphack in #3657
Accept uint8 workspaces in CUTE DSL MLA decode by @leejnau in #3599
tests: split test_trtllm_gen_fused_moe.py into shards by @feih-nv in #3635
perf(moe): Enhance CuteDSL NVF4 MOE Perf by @liyuhannnnn in #3564
fix(mamba): reject SM120/SM121 in SSDCombined with a clear error by @waynehacking8 in #3668
docs(gemm): add missing .rst entries for mm_bf16_fp4 and prepare_bf16_fp4_weights by @kangbintNV in #3710
docs: add missing parameter entries to docstrings and env vars by @kangbintNV in #3627
rename back by @aleozlx in #3730
docs: document scale_major_mode param and FLASHINFER_AUTOTUNE_DIR env var by @kangbintNV in #3696
feat: add mxfp4/nvfp4 quant to moe a2a combine by @IwakuraRein in #3643
chore: fix enable_pdl for trtllm-gen routing and finalize kernel by @IwakuraRein in #3588
Bug fix (gdn): Layout contract fix from #3649 by @kahyunnam in #3693
Ameyn/fix fp32 mtp pool out indices by @ameynaik-hub in #3490
Prune moe tests by @aleozlx in #3733
Fix the output allocation consistency of trtllm-gen MoE APIs by @b8zhong in #3678
feat(comm): Support per-token LoRA Info in MoE a2a comm payloads by @JyChang012 in #3375
docs(gdn): document missing ssm_state_indices param in gated_delta_rule_mtp by @kangbintNV in #3725
feat(moe): enable DSFp8 + LoRA delta path by @zetacat in #3708
Improve GDN prefill perf by ~20-25% (mainloop efficiency) by @jhjpark in #3742
test: fix CUDA OOM in batch-prefill custom-mask test on 24GB CI GPUs by @waynehacking8 in #3609
Remove pagesize 16/32 assertion from xqa nvfp4 sm120 by @Njuapp in #3724

New Contributors

@deng451e made their first contribution in #3510
@Njuapp made their first contribution in #3534
@aws-jiadingg made their first contribution in #3522
@awgu made their first contribution in #3529
@ruoqianguo made their first contribution in #3532
@yichengj0 made their first contribution in #3560
@jhaotingc made their first contribution in #3501
@elwhyjay made their first contribution in #3405
@yifeis-nv made their first contribution in #3426
@lucifer1004 made their first contribution in #3395
@waynehacking8 made their first contribution in #3615
@tiffany940107 made their first contribution in #3640
@CarstyYou made their first contribution in #3562
@feih-nv made their first contribution in #3635
@liyuhannnnn made their first contribution in #3564
@JyChang012 made their first contribution in #3375
@zetacat made their first contribution in #3708

Full Changelog: v0.6.13rc2...v0.6.14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v0.6.14

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!