feat(cute_dsl/moe): deterministic balanced autotune profile inputs by leejnau · Pull Request #3286 · flashinfer-ai/flashinfer

leejnau · 2026-05-11T16:55:43Z

Summary

Reduces fi-vs-trt parity distance for CuteDSL NVFP4 MoE autotune selections by giving the autotuner a deterministic, balanced approx-max-load distribution for token_selected_experts during profiling, plus seeded random tensor initializers for the other dynamic inputs. Brings fi's autotune-profile machinery into structural alignment with trt-llm's inputs_pre_hook mechanism (all 6 of trt-llm's CuteDSL TuningConfigs set this hook).

Three commits on the branch:

8faa2359 — Path A+C: seed all 4 tensor_initializers + port TuningConfig.inputs_pre_hook + CuteDslMoEInputsHelper (the load-bearing change)
a009ca18 — small parity-cleanup: align use_cold_l2_cache=True with the rest of the codebase / trt-llm (provably no-op for DeepSeek-V3, value is forward-looking; flagged separately so reviewers don't conflate it with the A+C perf claim)
819bd648 — pin the CuteDslMoEInputsHelper input-layout contract with a small unit test

Empirical validation

10-run 30-cell sweep on NGC rc14, B200, DeepSeek-V3 (hidden=7168, intermediate=2048, num_experts=256, top_k=8):

	baseline	with this patch	delta
Overall mean \|Δ%\| from trt parity	8.48%	6.45%	−2.03pp
Small (N=1..256)	10.50%	7.44%	−3.07pp
Mid (N=512..2048)	6.54%	5.80%	−0.74pp
Large (N=4096..16384)	4.35%	4.13%	−0.21pp

Cells closer to parity: 19 / further: 7 / neutral: 19. Cells now within 1% of trt parity: 19 of 45 (was ~12 of 45 baseline).

Trade-offs (deliberately retained for parity gain)

N=4096 EP=1 regresses from +1.24% to +8.89% (single cell; deterministic seed lands on a sub-optimal pick at this bucket).
N=2048 EP=16 still bimodal under the seeded distribution; mean +19.87%.

Test plan

pytest tests/moe/test_cute_dsl_fused_moe.py::TestInputsHelperContract — new unit test pins CuteDslMoEInputsHelper.inputs_pre_hook's input-layout contract (replaces inputs[2], passes through rest, deterministic across instances).
pytest tests/moe/test_cute_dsl_fused_moe.py — existing CuteDSL MoE tests pass (no behavior change on the fast path).
10-run 30-cell parity sweep on B200 rc14 — all 450 cells parity_ok=True; numerical equivalence preserved.

Summary by CodeRabbit

New Features
- Optional input-preprocessing callback for autotuning profiles, preserved across derived configs
- Deterministic MoE profiling inputs via a new helper and seeded initialization for reproducible tuning
- Autotune runner enables a cold L2 cache option to improve MoE tuning realism
Tests
- Added CPU tests validating the input-preprocessing hook replaces the expected tensor, preserves others, and is deterministic

coderabbitai · 2026-05-11T16:56:00Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cc9f7496-03f1-4b87-9980-f0d10737c30c

📥 Commits

Reviewing files that changed from the base of the PR and between 5b000bf and 640e32e.

📒 Files selected for processing (1)

flashinfer/fused_moe/cute_dsl/tuner.py

🚧 Files skipped from review as they are similar to previous changes (1)

flashinfer/fused_moe/cute_dsl/tuner.py

📝 Walkthrough

Walkthrough

Adds an optional inputs_pre_hook to TuningConfig, implements CuteDslMoEInputsHelper to deterministically replace token_selected_experts for MoE autotuning, wires the helper and seeded input generation into the CuteDSL runner, and adds CPU tests validating the hook's contract and determinism.

Changes

MoE Autotuner Inputs Determinism

Layer / File(s)	Summary
Autotuner Hook Contract `flashinfer/autotuner.py`	`TuningConfig` gains optional `inputs_pre_hook`; `_apply_tuning_overrides` preserves it; `AutoTuner.choose_one` invokes the hook on synthesized tensors before per-tactic profiling.
CuteDslMoEInputsHelper Implementation `flashinfer/fused_moe/cute_dsl/_inputs_helper.py`	New `CuteDslMoEInputsHelper` computes balanced per-local-expert token counts, builds a deterministic `(num_tokens, top_k)` `token_selected_experts` via seeded per-expert `randperm`, and exposes `inputs_pre_hook` that replaces only that tensor and forwards others unchanged.
Runner Integration with Seeded Initialization `flashinfer/fused_moe/cute_dsl/tuner.py`	Runner imports and instantiates `CuteDslMoEInputsHelper`, seeds autotune tensor generators with `torch.Generator.manual_seed(515)`, sets `inputs_pre_hook=self._inputs_helper.inputs_pre_hook` on `TuningConfig`, and enables `use_cold_l2_cache=True`.
Contract Validation Tests `tests/moe/test_cute_dsl_fused_moe.py`	Adds `TestInputsHelperContract` that builds synthetic inputs matching wrapper layout, asserts index-2 `token_selected_experts` is replaced with a fresh tensor preserving shape/dtype while other inputs pass through by identity, and verifies determinism across helper instances.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

flashinfer-ai/flashinfer#3025: Related edits to CuteDSL MoE autotuning inputs and token_selected_experts generation.
flashinfer-ai/flashinfer#2958: Modifies _apply_tuning_overrides and related tuning-config override logic touched here.
flashinfer-ai/flashinfer#3126: Changes to AutoTuner.choose_one profiling/input handling related to this hook integration.

Suggested reviewers

aleozlx
yzh119
IwakuraRein
samuellees
jiahanc
bkryu
jimmyzho
sricketts

🐰 I seed the tokens, one-two-three,
Reproducible hops for each expert tree,
A hook that swaps just one small slice,
Deterministic profiling — neat and precise,
Hop, tune, repeat — reproducible delight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(cute_dsl/moe): deterministic balanced autotune profile inputs' accurately reflects the main changes, which add deterministic and balanced profiling inputs for CuteDSL MoE autotuning.
Description check	✅ Passed	The pull request description comprehensively covers the changes, objectives, empirical validation, trade-offs, and test plan, following the repository template structure.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a mechanism for deterministic and realistic input distributions during MoE autotuning by adding an "inputs_pre_hook" to the "TuningConfig". A new "CuteDslMoEInputsHelper" class implements balanced expert assignment via rejection sampling, replacing the previous random initialization. The "CuteDslMoETuner" is updated to utilize this helper and seeded tensor initializers. Review feedback identifies three issues: the "inputs_pre_hook" is not propagated during tuning overrides, a logic error in the expert assignment loop allows assignments when token counts are zero, and a multi-GPU indexing bug exists in the prioritization logic.

gemini-code-assist · 2026-05-11T17:03:06Z

+        for j, num_tokens_j in enumerate(num_tokens_per_expert):
+            selection_order_j = selection_orders[j].tolist()


If num_tokens_per_expert[j] is 0, the current logic will still assign expert j to one token because the inner loop over selection_order_j is entered and the break condition num_tokens_j <= 0 is only checked after the first assignment. Adding a check at the beginning of the expert loop prevents this and avoids unnecessary computation.

Suggested change

for j, num_tokens_j in enumerate(num_tokens_per_expert):

selection_order_j = selection_orders[j].tolist()

for j, num_tokens_j in enumerate(num_tokens_per_expert):

if num_tokens_j <= 0:

continue

selection_order_j = selection_orders[j].tolist()

This is actually done deliberately to match the TRT-LLM behavior:

https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py#L152-L160

gemini-code-assist · 2026-05-11T17:03:06Z

+            # experts are exhausted. (For typical MoE configs with
+            # num_local_experts < num_experts the prioritized list is
+            # empty; preserved for parity with trt's algorithm.)
+            limit = self.top_k - (self.num_experts - j)


The limit calculation uses the local expert index j, but it should use the global expert index j + self.local_expert_offset to correctly identify tokens at risk of not hitting top_k experts across the entire MoE layer. Without this, the prioritization logic is effectively disabled for all ranks except the first one (where offset=0), breaking parity with the intended algorithm on multi-GPU setups.

Suggested change

limit = self.top_k - (self.num_experts - j)

limit = self.top_k - (self.num_experts - (j + self.local_expert_offset))

This is actually done deliberately to match the TRT-LLM behavior:

https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py#L146-L147

_apply_tuning_overrides constructs a new TuningConfig with only a subset of the original config's fields, silently dropping the new inputs_pre_hook field. Any code path that activates override_tuning_buckets or override_round_up would lose the helper and fall back to random profile inputs -- defeating the purpose of the balanced-load helper for those paths. Addresses Gemini review comment on PR flashinfer-ai#3286. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…etracted) All five audit-driven production PRs are now merged into flashinfer-ai/flashinfer:main (flashinfer-ai#3171, flashinfer-ai#3198, flashinfer-ai#3216, flashinfer-ai#3226, flashinfer-ai#3252). Updates the audit doc and the port-parity bench runbook to reflect this, plus retracts the wrapper-overhead framing for the residuals: CUDA graph replay captures host-side wrapper cost so it cannot be the residual's mechanism. The real residual is GPU-side non-kernel gap time, addressed by PR flashinfer-ai#3286 (deterministic balanced autotune profile) at large-N high-EP cells, and by `cab3ee50` (cutlass-dsl 4.3.x compat-hook removal, on nv-yunzheq's branch, not yet on main) for the small-N decode latency gap. audit doc: adds a "2026-05-11 post-close update" section at the top listing the merged PR table, the in-flight downstream PRs (flashinfer-ai#3286 deterministic-profile, flashinfer-ai#3292 bench refresh), the wrapper-overhead retraction, and the now-stale thin-adapter refactor recommendation. The existing "Final state (2026-04-29)" section is retitled to "historical snapshot at audit close" since it represents a snapshot not the current authoritative state. runbook: adds a status notice at the top noting PR flashinfer-ai#3252 is merged and the Step 4-7 collapse anticipated by the runbook's own "Future- proofing" note is now applicable. The collapsed flow is not yet re-validated end-to-end, so the existing 2/2-validated recipe is preserved until then. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ranches Updates the "2026-05-11 post-close update" section: - PR flashinfer-ai#3292 (bench refresh): MERGED 2026-05-11 as `4381afc1`. B200 verification on NGC rc14 confirmed PR flashinfer-ai#3126 obviates the f3beb60 pollution workaround at the source (CuteDSL 0.147 ms / TRTLLM 0.144 ms at bs=128 ep=8 — clean band matching the May 6 baseline with the workaround applied). - PR flashinfer-ai#3286 (A+C deterministic autotune profile): still OPEN with Gemini's high-priority comment about inputs_pre_hook propagation addressed in commit `36699b54`. Two medium-priority Gemini comments left for manual response; both mirror trt-llm's algorithm exactly so keeping for parity. Adds a "Parked alternative branches" subsection listing two non-merged branches kept for future use: - `bench-moe-deepseek-tighten-autotune-scope` (Codex follow-up to flashinfer-ai#3292, scopes autotune(True) to pre-warm only). - `nvtx_microbenchmark` rebased to single clean commit on top of main (NVTX markers + cudaProfilerStart/Stop wrap; drops the redundant _force_autotune_off references). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…are accepted as-is Earlier note said the two medium-priority Gemini comments were "left for manual response". Updated to reflect the actual policy: both mirror trt-llm's algorithm exactly (zero-token-expert loop entry and multi-GPU prioritization index), intentional for parity with trt-llm, no reply will be sent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…(pre-A+C snapshot) Re-ran the canonical PR flashinfer-ai#2398 sweep (EP=1, 8, 16 × 15 N from 1 to 16384) on current main (post-flashinfer-ai#3292, before PR flashinfer-ai#3286 / A+C merges). Headline at EP=1 N=128: CuteDSL 0.965 ms vs PR flashinfer-ai#2398's published 0.134 ms. nsys hardware-truth diagnostic at this cell showed the CuteDSL gemm kernel takes ~464 us per call at EP=1 vs ~63 us at EP=8 — a 7.3x ratio matching the ~8x theoretical scaling from 256 vs 32 local experts in the grouped GEMM. Per-iter sum (2 gemms + routing + topk + small elementwise) ~988 us at EP=1, matching bench-reported 962 us within run-to-run noise. Conclusion: today's bench is correctly measuring real GPU kernel work. PR flashinfer-ai#2398's published 134 us was physically impossible for the actual kernel work and should not be used as a baseline. Adds: - benchmarks/cute_dsl_moe_pr2398_rerun_2026_05_12_pre_a_c.csv Full 45-cell rerun data (3 EPs x 15 N). - New "2026-05-12 PR flashinfer-ai#2398 rerun + nsys verdict (pre-A+C snapshot)" section in the audit doc with headline table, nsys verdict, and crossover analysis. The data is a pre-A+C snapshot; PR flashinfer-ai#3286 (A+C deterministic autotune profile) is open as of this rerun. A post-merge full-matrix re-measurement is scheduled to produce the canonical post-audit reference numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

qiching · 2026-05-12T18:09:49Z

+                selection_order_j = prioritized + [
+                    i for i in selection_order_j if i not in p_set
+                ]
+            for i in selection_order_j:


assume when num_tokens_on_curr_rank < num_local_experts, the base of divmod(num_tokens_on_curr_rank, num_local_experts) is 0, and the remaining experts should be assigned 0 tokens. For DSv3 with EP=16 and num_tokens=1: average=0.5, extra≈0.83, num_tokens_on_curr_rank=2, divmod(2,16)=(0,2) so experts 0 and 1 should each receive 1 token, while experts 2–15 should each receive 0 tokens; however, the current logic causes experts 2–15 to incorrectly receive 1 additional token each.

it deviates.

so how do you think?

This is the same issue Gemini flagged earlier in this file, and your example reinforces this. However, TRT-LLM has identical behavior at https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py#L151-L159

Since the goal is a faithful port, ensuring FlashInfer's autotune sees the same profile inputs TRT-LLM does, fixing here would deviate. We could instead fix it here and file an issue to TRT-LLM.

qiching · 2026-05-12T18:11:07Z

+            # experts are exhausted. (For typical MoE configs with
+            # num_local_experts < num_experts the prioritized list is
+            # empty; preserved for parity with trt's algorithm.)
+            limit = self.top_k - (self.num_experts - j)


i agree with Gemini and i think here we should use global index:

global_j = j + self.local_expert_offset limit = self.top_k - (self.num_experts - global_j)

This was deliberately done for parity with TRT-LLM. TRT-LLM uses local j identically at https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py#L146-L147, and the goal is port alignment. Switching to global j would deviate from that. We could instead fix it here and file an issue to TRT-LLM. (This would also apply to Gemini's other comment.)

qiching

Agree to merge. how about add a comment with the issue link to these two bug reports in the code to avoid repeated flagging?

leejnau · 2026-05-14T15:34:03Z

Agree to merge. how about add a comment with the issue link to these two bug reports in the code to avoid repeated flagging?

Filed both as a single upstream tracking issue: [Bug]: [CuteDSL MoE] generate_token_selected_experts: ghost-token assignment to zero-allocation experts + dead-on-multi-rank prioritization NVIDIA/TensorRT-LLM#14146
Added inline comments at both sites pointing to the TRT-LLM issue (commit 04b28b3) so future readers don't have to re-flag.

qiching

LGTM now! thanks your work

nv-yunzheq · 2026-05-14T20:00:48Z

/bot run

flashinfer-bot · 2026-05-14T20:02:01Z

GitLab MR !671 has been created, and the CI pipeline #51324686 is currently running. I'll report back once the pipeline job completes.

nv-yunzheq · 2026-05-14T20:02:17Z

+                            shapes,
+                            dtype=torch.uint8,
+                            device=device,
+                            generator=torch.Generator(device=device).manual_seed(515),


Do you want to use different seed value for tensor initialization?

The shared seed is ok because the four lambdas draw from different shapes/dtypes/distributions, so the outputs are uncorrelated in practice despite sharing a seed.

…oken + multi-rank-prioritization quirks in generate_token_selected_experts Adds in-source pointers to NVIDIA/TensorRT-LLM#14146 at the two sites Gemini + qiching flagged, so future readers can find the upstream tracking issue without re-flagging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nv-yunzheq · 2026-05-15T18:05:50Z

/bot run

flashinfer-bot · 2026-05-15T18:06:28Z

GitLab MR !671 has been updated with latest changes, and the CI pipeline #51428234 is currently running. I'll report back once the pipeline job completes.

Setting use_cold_l2_cache=True on CuteDslFusedMoENvfp4Runner.tuning_config (introduced in 9f466b2) triggers NaN in TestCuteDslMoEWrapper::test_wrapper_with_autotune on B200 CI when a prior test in the same file runs with use_cuda_graph=True. A latent reference cycle in CuteDslMoEWrapper retains CUDA resources across test boundaries: CuteDslMoEWrapper -> _runner -> forward_impl (bound method) -> CuteDslMoEWrapper Cold-L2 cycling during the next autotune profile interacts with the retained state and produces NaN. Four independent interventions all fix the failure on this branch: - gc.collect() between tests - Breaking the wrapper-runner cycle directly - Nulling wrapper-owned CUDA resources - Weakref trampoline replacing forward_impl=self._forward_with_tactic Cold-L2 is the empirical trigger; the wrapper-runner reference cycle is the underlying defect. This change unsets the flag to unblock CI. The cycle itself will be addressed separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nv-yunzheq · 2026-05-16T01:52:32Z

/bot run

flashinfer-bot · 2026-05-16T01:52:56Z

GitLab MR !671 has been updated with latest changes, and the CI pipeline #51471201 is currently running. I'll report back once the pipeline job completes.

…shinfer-ai#3340 fix Updates the post-close audit doc and reproduction runbook to capture the new state of in-flight follow-up PRs after a latent CuteDslMoEWrapper lifecycle bug was uncovered on 2026-05-15. Audit doc (cute_dsl_moe_port_audit.md): - Downstream PRs status updated 2026-05-12 -> 2026-05-16 - PR flashinfer-ai#3286 entry: noted cold-L2 unset workaround (640e32e) and the surfaced wrapper-runner reference cycle - PR flashinfer-ai#3340 entry added (weakref trampoline lifecycle fix) - PR flashinfer-ai#3328 entry added (SGLang Phase 1 cudaMemsetAsync wrapper) - Other residual follow-ups: SGLang Phase 1 promoted from deferred to in-flight; cold-L2 re-enable (task flashinfer-ai#108) added as new deferred item - New dated section "2026-05-15 wrapper-runner reference cycle discovered + PR flashinfer-ai#3340 fix" — full diagnostic chain, Codex's 4-way convergent experiments, weakref trampoline code, connection to task flashinfer-ai#103, and the bisectability process lesson from audit task flashinfer-ai#91 Runbook (cute_dsl_moe_port_runbook.md): - "Last updated" 2026-05-11 -> 2026-05-16 - Added "In-flight follow-up PRs" section noting flashinfer-ai#3286 / flashinfer-ai#3340 / flashinfer-ai#3328 do not change the bench procedure The audit's task flashinfer-ai#91 framing of use_cold_l2_cache=True as "provably no-op for DeepSeek V3" remains correct for the buffer-cycle bytes-math but is empirically falsified for the autotune profile path — cold-L2 cycling during profile interacts with the retained CUDA state from a prior CUDA-graph wrapper test (via the wrapper-runner reference cycle) and produces NaN. The cycle is the load-bearing defect; cold-L2 was the trigger. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nv-yunzheq

Approved as the unit test passed

…e43023 PR flashinfer-ai#3286 (deterministic balanced autotune profile + cold-L2 unset workaround, all 6 commits squashed) merged to flashinfer-ai/main on 2026-05-17T04:17:35Z by nv-yunzheq. Audit doc: - Downstream PRs block: 2026-05-16 -> 2026-05-17; flashinfer-ai#3286 OPEN -> MERGED with squash-commit `ce430238`. - Cold-L2 re-enable item: clarified that flashinfer-ai#3286 has merged; the deferral now waits only on PR flashinfer-ai#3340. - 2026-05-12 PR flashinfer-ai#2398 rerun section: added "2026-05-17 update" line noting flashinfer-ai#3286 merge unblocks task flashinfer-ai#102 (post-A+C rerun). - 2026-05-15 wrapper-cycle section: workaround paragraph past-tensed and amended with the merge SHA. Runbook: - "Last updated" 2026-05-16 -> 2026-05-17. - In-flight follow-up PRs section split: flashinfer-ai#3286 moved to "Recently merged"; flashinfer-ai#3340 and flashinfer-ai#3328 retained as "Still in-flight"; closing note refocused on task flashinfer-ai#108 contingent on flashinfer-ai#3340. Bench procedure (Steps 1-11) unchanged — flashinfer-ai#3286 affects autotune profile inputs, not the bench reproduction recipe. Once the audit branch rebases onto main >= ce43023, post-A+C bench numbers (task flashinfer-ai#102) are runnable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…4fe1ff PR flashinfer-ai#3328 (SGLang Phase 1 dense moe_output_memset_inplace wrapper around cudaMemsetAsync) merged to flashinfer-ai/main on 2026-05-18T21:17:39Z by nv-yunzheq. Audit doc: - Header: "last updated 2026-05-17" -> "2026-05-18"; 31-day -> 32-day investigation window. - Downstream PRs block: 2026-05-17 -> 2026-05-18; flashinfer-ai#3328 OPEN -> MERGED with squash-commit `34fe1ff0`; framing shifted from "Promotes follow-up flashinfer-ai#11 to in-flight" to "Closes follow-up flashinfer-ai#11". - Other residual follow-ups: flashinfer-ai#3328 moved from "in flight" to "CLOSED". - Follow-up flashinfer-ai#11 Phase 1 body: status "in flight ... awaiting review" -> "MERGED ... by nv-yunzheq". Runbook: - "Last updated" 2026-05-17 -> 2026-05-18. - Recently merged follow-up PRs: split into two bullets (flashinfer-ai#3286 + flashinfer-ai#3328). - Still in-flight follow-up PR(s): tightened to singular (only flashinfer-ai#3340 remains). No bench-procedure changes (Phase-1 wrapper change is server-side, not bench-tool-side). Audit branch will pick up flashinfer-ai#3328's content on its next rebase onto main >= 34fe1ff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A force-tactic ablation on 2026-05-20 (5 cells × 5 forced runs, all parity_ok=True) demonstrates that forcing fi to use trt's autotune- selected tactic closes 77-100% of the gap at the EP=16/32 small/mid-N regression cells: EP=1 N=4096 : +8.3% → +0.0% (fully closed) EP=16 N=8 : +18.5% → -4.0% (fi wins after forcing) EP=16 N=1024 : +17.2% → +4.0% (77% closed) EP=16/32 large-N: fi already wins; forcing changes nothing The bias is dominantly tactic-pick-driven. This invalidates the 2026-05-19 hypothesis (autotune-candidate-set divergence due to missing `ceil_div` filter in fi's `get_gemm2_valid_tactics`): 1. At the cells where the bias appears, `permuted_m` is always ≥ 2048 due to `get_max_num_permuted_tokens`'s formula. The `ceil_div(m, mma_tiler_mn[0]) ≥ cluster_shape_mn[0]` predicate is always true; the filter never rejects in practice. 2. fi DOES already call `can_implement` on both kernel templates at autotune time (tuner.py:435-467), and the kernel-level `can_implement` bodies are byte-identical between fi and trt (subagent diff verified). 3. So fi and trt accept the same candidate set at these cells. The candidate-set framing doesn't explain why fi picks different tactics. The ACTUAL divergence is in the autotune TuningConfig: - fi (tuner.py:295-364): `use_cold_l2_cache` intentionally unset (workaround for the latent wrapper- cycle bug per PR flashinfer-ai#3286's `640e32e7`) - trt (cute_dsl_custom_ops.py:2032-2054): `use_cold_l2_cache=True` With cold-L2 ON, autotune profile measures conservative timings and picks tactics that are robustly fast under cold-cache conditions. With cold-L2 OFF, autotune profile measures warm-L2 timings; some tactics look fast due to L2-hit reuse during back-to-back profile iterations but aren't actually faster in production. fi currently picks tactics that "look fast" under warm L2 but aren't robustly fast — which matches the ablation finding that trt's tactic is faster on fi than fi's autotune-chosen tactic. Two file edits: 1. Header date stamp: 2026-05-19 → 2026-05-20; revised reading order; framing pivot from candidate-set to cold-L2. 2. New top-of-doc "2026-05-20 ablation + cold-L2 mechanism" section labeled "read this FIRST". Includes the ablation table, the candidate-set refutation, the cold-L2 mechanism, and the task status changes (flashinfer-ai#125 deleted as refuted; flashinfer-ai#108 promoted to load-bearing perf fix; new task flashinfer-ai#126 to stage the cold-L2 re-enable branch ready for flashinfer-ai#3340 landing). 3. The 2026-05-19 tactic-divergence section reordered to THIRD, marked HISTORICAL with a banner noting the mechanism conclusion has been revised. Observations (fi shows 9-10 distinct signatures, trt 1-3) still valid as empirical fact; only the proposed mechanism (missing ceil_div) was wrong. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three edits in response to PR flashinfer-ai#3340 review: 1. flashinfer/fused_moe/cute_dsl/tuner.py: update the use_cold_l2_cache TuningConfig comment. The previous block described the wrapper-cycle bug that motivated unsetting the flag; with the cycle fixed by this PR, that description becomes stale on merge. Reworded to note the fix and that re-enabling cold-L2 is a follow-up. 2. tests/moe/test_cute_dsl_fused_moe.py test_cuda_graph_wrapper_lifetime_before_autotune: tighten `assert finalized` to `assert finalized == [True]` so the test verifies the finalizer fired exactly once. 3. tests/moe/test_cute_dsl_fused_moe.py Add test_cuda_graph_wrapper_lifetime_after_autotune: parallel to the existing pre-autotune lifetime test but wraps the warmup call in `with autotune(True):` to exercise the post-profiling code path (where the autotuner's profile pass actively reaches into the runner and the runner's reference back to the wrapper). Clears the autotuner cache immediately before the autotune block so a prior test's cache hit can't bypass the profile pass; asserts a `CuteDslMoEWrapper::run` entry exists in the cache afterward to confirm profiling actually ran. Same gc.disable() / try-finally harness as the existing test. Branch was rebased onto current main HEAD to bring in the use_cold_l2_cache comment block (added in PR flashinfer-ai#3286). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

## 📌 Description `CuteDslMoEWrapper` currently passes `self._forward_with_tactic` as a bound method into `CuteDslFusedMoENvfp4Runner`, creating a strong reference cycle: `wrapper -> runner -> bound method -> wrapper`. When the wrapper is used with `use_cuda_graph=True`, this can keep wrapper-owned CUDA graph resources alive after user code has dropped the wrapper, until Python cyclic GC eventually runs. This PR replaces that bound-method callback with a weakref trampoline. The runner can still call into a live wrapper, but it no longer owns the wrapper lifetime. This prevents stale wrapper CUDA resources from surviving across same-process tests or later autotune runs. ## 🔍 Related Issues #3286 #3301 #3252 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests Adds a focused regression test that warms a CUDA-graph wrapper, verifies it is finalized before cyclic GC, and then runs a subsequent autotuned wrapper call to ensure the output remains NaN-free. - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Bug Fixes** * Improved handling and cleanup of CUDA-graph wrappers to prevent resource leaks and provide a clear error when a wrapper is no longer available. * **Tests** * Added lifetime tests covering CUDA-graph wrappers before and after autotune; verify stable, non-NaN outputs during autotune. * **Documentation** * Updated comment about cold-L2 cache behavior and noted follow-up to re-enable it once a related issue is addressed.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/flashinfer-ai/flashinfer/pull/3340?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)   --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…8f4534 PR flashinfer-ai#3340 (cute_dsl: avoid MoE wrapper runner reference cycle) was merged 2026-05-21T16:56:58Z as squash-commit `18f45345` by nv-yunzheq. With flashinfer-ai#3286 (2026-05-17) and flashinfer-ai#3340 (2026-05-21) both landed, task flashinfer-ai#108 (re-enable `use_cold_l2_cache=True`) is unblocked; branch `cute-dsl-moe-tuner-reenable-cold-l2` (HEAD `97c89c2f` on top of upstream/main `18f45345`) is staged locally with a single- commit diff (comment swap + flag flip). Audit doc updates: - Header date stamp 2026-05-20 -> 2026-05-21; 34-day -> 35-day investigation window; framing line notes PR flashinfer-ai#3340 merged and task flashinfer-ai#108 unblocked. Underlying TuningConfig-divergence finding (2026-05-20) and reading order otherwise unchanged. - "TuningConfig divergences" table row for `use_cold_l2_cache`: "gated on PR flashinfer-ai#3340" -> "PR flashinfer-ai#3340 merged 2026-05-21 as `18f45345`, task flashinfer-ai#108 now unblocked". - "Task status changes 2026-05-20" block: task flashinfer-ai#108 marked unblocked, task flashinfer-ai#126 marked completed (branch staged locally), new task flashinfer-ai#129 added for the post-enable cold-L2 re-sweep that becomes the canonical post-audit reference. - Status-caveat paragraph: "Re-enabling cold L2 cache (task flashinfer-ai#108) is blocked on PR flashinfer-ai#3340" -> "PR flashinfer-ai#3340 merged 2026-05-21 unblocked task flashinfer-ai#108; ... a cold-L2 re-sweep (task flashinfer-ai#129) becomes the canonical reference once it lands". - Other candidate mechanisms list, cold-L2 bullet: gate language updated. - Downstream PRs block: PR flashinfer-ai#3286 framing tightened ("PR flashinfer-ai#3340 (merged 2026-05-21) shipped the underlying cycle fix"); PR flashinfer-ai#3340 entry rewritten from "OPEN 2026-05-16, mergeable" to "MERGED 2026-05-21T16:56:58Z as squash-commit `18f45345`" with the three pre-squash commits documented (`e471ae73` fix, `7fda2948` test, `fb440413` qiching review responses). - Residual follow-ups block: "Re-enable use_cold_l2_cache ... deferred until PR flashinfer-ai#3340" -> "unblocked 2026-05-21; branch staged locally; task flashinfer-ai#129 covers the canonical re-sweep". - 2026-05-15 dated section: PR flashinfer-ai#3340 line "single commit `ebb192a4`, 2026-05-16" -> "opened 2026-05-16, MERGED 2026-05-21 as squash-commit `18f45345`". Runbook updates: - Header date stamp 2026-05-19 -> 2026-05-21; framing line notes PR flashinfer-ai#3340 merged and task flashinfer-ai#108 unblocked, task flashinfer-ai#129 added. - "Still in-flight follow-up PR" block: PR flashinfer-ai#3340 moved from in-flight to recently-merged (third entry in that list); "in-flight" wording removed; final paragraph rewritten to note all three follow-ups landed, task flashinfer-ai#108 unblocked, task flashinfer-ai#129 covers the canonical post-audit re-sweep. Memories (separate, not in this commit since they live on Mac): reference_tactic_divergence_ep_scaling.md, reference_kernel_coverage_1to1.md, project_cutedsl_wrapper_cycle_bug.md, project_cutedsl_moe_fp4_port_audit_closed.md, MEMORY.md (index) have all received the same gate-lift / merged-status updates; project_cutedsl_wrapper_cycle_bug.md's frontmatter name updated to include the merge tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…TuningConfig (#3384) ## 📌 Description Sets `use_cold_l2_cache=True` on the autotuner `TuningConfig` in `flashinfer/fused_moe/cute_dsl/tuner.py`, matching TRT-LLM's `CuteDslFusedMoENvfp4Runner.tuning_config`. With cold-L2 ON, the autotuner flushes L2 between profile iterations and measures conservative timings, so the picked tactic is robustly fast under cold-cache conditions; without it, back-to-back iterations of the same tactic benefit from L2-hit reuse and bias the pick toward tactics that look fast during profiling but aren't faster in production. ## 🔍 Related Issues #3286 #3340 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Bug Fixes** * Improved tuner's cold L2 cache measurement behavior for more accurate performance profiling.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/flashinfer-ai/flashinfer/pull/3384?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)   Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>