feat(cute_dsl/moe): deterministic balanced autotune profile inputs#3286
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds an optional ChangesMoE Autotuner Inputs Determinism
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism for deterministic and realistic input distributions during MoE autotuning by adding an "inputs_pre_hook" to the "TuningConfig". A new "CuteDslMoEInputsHelper" class implements balanced expert assignment via rejection sampling, replacing the previous random initialization. The "CuteDslMoETuner" is updated to utilize this helper and seeded tensor initializers. Review feedback identifies three issues: the "inputs_pre_hook" is not propagated during tuning overrides, a logic error in the expert assignment loop allows assignments when token counts are zero, and a multi-GPU indexing bug exists in the prioritization logic.
| for j, num_tokens_j in enumerate(num_tokens_per_expert): | ||
| selection_order_j = selection_orders[j].tolist() |
There was a problem hiding this comment.
If num_tokens_per_expert[j] is 0, the current logic will still assign expert j to one token because the inner loop over selection_order_j is entered and the break condition num_tokens_j <= 0 is only checked after the first assignment. Adding a check at the beginning of the expert loop prevents this and avoids unnecessary computation.
| for j, num_tokens_j in enumerate(num_tokens_per_expert): | |
| selection_order_j = selection_orders[j].tolist() | |
| for j, num_tokens_j in enumerate(num_tokens_per_expert): | |
| if num_tokens_j <= 0: | |
| continue | |
| selection_order_j = selection_orders[j].tolist() |
There was a problem hiding this comment.
This is actually done deliberately to match the TRT-LLM behavior:
| # experts are exhausted. (For typical MoE configs with | ||
| # num_local_experts < num_experts the prioritized list is | ||
| # empty; preserved for parity with trt's algorithm.) | ||
| limit = self.top_k - (self.num_experts - j) |
There was a problem hiding this comment.
The limit calculation uses the local expert index j, but it should use the global expert index j + self.local_expert_offset to correctly identify tokens at risk of not hitting top_k experts across the entire MoE layer. Without this, the prioritization logic is effectively disabled for all ranks except the first one (where offset=0), breaking parity with the intended algorithm on multi-GPU setups.
| limit = self.top_k - (self.num_experts - j) | |
| limit = self.top_k - (self.num_experts - (j + self.local_expert_offset)) |
There was a problem hiding this comment.
This is actually done deliberately to match the TRT-LLM behavior:
_apply_tuning_overrides constructs a new TuningConfig with only a subset of the original config's fields, silently dropping the new inputs_pre_hook field. Any code path that activates override_tuning_buckets or override_round_up would lose the helper and fall back to random profile inputs -- defeating the purpose of the balanced-load helper for those paths. Addresses Gemini review comment on PR flashinfer-ai#3286. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…etracted) All five audit-driven production PRs are now merged into flashinfer-ai/flashinfer:main (flashinfer-ai#3171, flashinfer-ai#3198, flashinfer-ai#3216, flashinfer-ai#3226, flashinfer-ai#3252). Updates the audit doc and the port-parity bench runbook to reflect this, plus retracts the wrapper-overhead framing for the residuals: CUDA graph replay captures host-side wrapper cost so it cannot be the residual's mechanism. The real residual is GPU-side non-kernel gap time, addressed by PR flashinfer-ai#3286 (deterministic balanced autotune profile) at large-N high-EP cells, and by `cab3ee50` (cutlass-dsl 4.3.x compat-hook removal, on nv-yunzheq's branch, not yet on main) for the small-N decode latency gap. audit doc: adds a "2026-05-11 post-close update" section at the top listing the merged PR table, the in-flight downstream PRs (flashinfer-ai#3286 deterministic-profile, flashinfer-ai#3292 bench refresh), the wrapper-overhead retraction, and the now-stale thin-adapter refactor recommendation. The existing "Final state (2026-04-29)" section is retitled to "historical snapshot at audit close" since it represents a snapshot not the current authoritative state. runbook: adds a status notice at the top noting PR flashinfer-ai#3252 is merged and the Step 4-7 collapse anticipated by the runbook's own "Future- proofing" note is now applicable. The collapsed flow is not yet re-validated end-to-end, so the existing 2/2-validated recipe is preserved until then. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ranches Updates the "2026-05-11 post-close update" section: - PR flashinfer-ai#3292 (bench refresh): MERGED 2026-05-11 as `4381afc1`. B200 verification on NGC rc14 confirmed PR flashinfer-ai#3126 obviates the f3beb60 pollution workaround at the source (CuteDSL 0.147 ms / TRTLLM 0.144 ms at bs=128 ep=8 — clean band matching the May 6 baseline with the workaround applied). - PR flashinfer-ai#3286 (A+C deterministic autotune profile): still OPEN with Gemini's high-priority comment about inputs_pre_hook propagation addressed in commit `36699b54`. Two medium-priority Gemini comments left for manual response; both mirror trt-llm's algorithm exactly so keeping for parity. Adds a "Parked alternative branches" subsection listing two non-merged branches kept for future use: - `bench-moe-deepseek-tighten-autotune-scope` (Codex follow-up to flashinfer-ai#3292, scopes autotune(True) to pre-warm only). - `nvtx_microbenchmark` rebased to single clean commit on top of main (NVTX markers + cudaProfilerStart/Stop wrap; drops the redundant _force_autotune_off references). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…are accepted as-is Earlier note said the two medium-priority Gemini comments were "left for manual response". Updated to reflect the actual policy: both mirror trt-llm's algorithm exactly (zero-token-expert loop entry and multi-GPU prioritization index), intentional for parity with trt-llm, no reply will be sent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(pre-A+C snapshot) Re-ran the canonical PR flashinfer-ai#2398 sweep (EP=1, 8, 16 × 15 N from 1 to 16384) on current main (post-flashinfer-ai#3292, before PR flashinfer-ai#3286 / A+C merges). Headline at EP=1 N=128: CuteDSL 0.965 ms vs PR flashinfer-ai#2398's published 0.134 ms. nsys hardware-truth diagnostic at this cell showed the CuteDSL gemm kernel takes ~464 us per call at EP=1 vs ~63 us at EP=8 — a 7.3x ratio matching the ~8x theoretical scaling from 256 vs 32 local experts in the grouped GEMM. Per-iter sum (2 gemms + routing + topk + small elementwise) ~988 us at EP=1, matching bench-reported 962 us within run-to-run noise. Conclusion: today's bench is correctly measuring real GPU kernel work. PR flashinfer-ai#2398's published 134 us was physically impossible for the actual kernel work and should not be used as a baseline. Adds: - benchmarks/cute_dsl_moe_pr2398_rerun_2026_05_12_pre_a_c.csv Full 45-cell rerun data (3 EPs x 15 N). - New "2026-05-12 PR flashinfer-ai#2398 rerun + nsys verdict (pre-A+C snapshot)" section in the audit doc with headline table, nsys verdict, and crossover analysis. The data is a pre-A+C snapshot; PR flashinfer-ai#3286 (A+C deterministic autotune profile) is open as of this rerun. A post-merge full-matrix re-measurement is scheduled to produce the canonical post-audit reference numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| selection_order_j = prioritized + [ | ||
| i for i in selection_order_j if i not in p_set | ||
| ] | ||
| for i in selection_order_j: |
There was a problem hiding this comment.
assume when num_tokens_on_curr_rank < num_local_experts, the base of divmod(num_tokens_on_curr_rank, num_local_experts) is 0, and the remaining experts should be assigned 0 tokens. For DSv3 with EP=16 and num_tokens=1: average=0.5, extra≈0.83, num_tokens_on_curr_rank=2, divmod(2,16)=(0,2) so experts 0 and 1 should each receive 1 token, while experts 2–15 should each receive 0 tokens; however, the current logic causes experts 2–15 to incorrectly receive 1 additional token each.
it deviates.
so how do you think?
There was a problem hiding this comment.
This is the same issue Gemini flagged earlier in this file, and your example reinforces this. However, TRT-LLM has identical behavior at https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py#L151-L159
Since the goal is a faithful port, ensuring FlashInfer's autotune sees the same profile inputs TRT-LLM does, fixing here would deviate. We could instead fix it here and file an issue to TRT-LLM.
| # experts are exhausted. (For typical MoE configs with | ||
| # num_local_experts < num_experts the prioritized list is | ||
| # empty; preserved for parity with trt's algorithm.) | ||
| limit = self.top_k - (self.num_experts - j) |
There was a problem hiding this comment.
i agree with Gemini and i think here we should use global index:
global_j = j + self.local_expert_offset
limit = self.top_k - (self.num_experts - global_j)
There was a problem hiding this comment.
This was deliberately done for parity with TRT-LLM. TRT-LLM uses local j identically at https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py#L146-L147, and the goal is port alignment. Switching to global j would deviate from that. We could instead fix it here and file an issue to TRT-LLM. (This would also apply to Gemini's other comment.)
qiching
left a comment
There was a problem hiding this comment.
Agree to merge. how about add a comment with the issue link to these two bug reports in the code to avoid repeated flagging?
|
qiching
left a comment
There was a problem hiding this comment.
LGTM now! thanks your work
|
/bot run |
| shapes, | ||
| dtype=torch.uint8, | ||
| device=device, | ||
| generator=torch.Generator(device=device).manual_seed(515), |
There was a problem hiding this comment.
Do you want to use different seed value for tensor initialization?
There was a problem hiding this comment.
The shared seed is ok because the four lambdas draw from different shapes/dtypes/distributions, so the outputs are uncorrelated in practice despite sharing a seed.
…oken + multi-rank-prioritization quirks in generate_token_selected_experts Adds in-source pointers to NVIDIA/TensorRT-LLM#14146 at the two sites Gemini + qiching flagged, so future readers can find the upstream tracking issue without re-flagging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
04b28b3 to
5b000bf
Compare
|
/bot run |
Setting use_cold_l2_cache=True on CuteDslFusedMoENvfp4Runner.tuning_config (introduced in 9f466b2) triggers NaN in TestCuteDslMoEWrapper::test_wrapper_with_autotune on B200 CI when a prior test in the same file runs with use_cuda_graph=True. A latent reference cycle in CuteDslMoEWrapper retains CUDA resources across test boundaries: CuteDslMoEWrapper -> _runner -> forward_impl (bound method) -> CuteDslMoEWrapper Cold-L2 cycling during the next autotune profile interacts with the retained state and produces NaN. Four independent interventions all fix the failure on this branch: - gc.collect() between tests - Breaking the wrapper-runner cycle directly - Nulling wrapper-owned CUDA resources - Weakref trampoline replacing forward_impl=self._forward_with_tactic Cold-L2 is the empirical trigger; the wrapper-runner reference cycle is the underlying defect. This change unsets the flag to unblock CI. The cycle itself will be addressed separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/bot run |
…shinfer-ai#3340 fix Updates the post-close audit doc and reproduction runbook to capture the new state of in-flight follow-up PRs after a latent CuteDslMoEWrapper lifecycle bug was uncovered on 2026-05-15. Audit doc (cute_dsl_moe_port_audit.md): - Downstream PRs status updated 2026-05-12 -> 2026-05-16 - PR flashinfer-ai#3286 entry: noted cold-L2 unset workaround (640e32e) and the surfaced wrapper-runner reference cycle - PR flashinfer-ai#3340 entry added (weakref trampoline lifecycle fix) - PR flashinfer-ai#3328 entry added (SGLang Phase 1 cudaMemsetAsync wrapper) - Other residual follow-ups: SGLang Phase 1 promoted from deferred to in-flight; cold-L2 re-enable (task flashinfer-ai#108) added as new deferred item - New dated section "2026-05-15 wrapper-runner reference cycle discovered + PR flashinfer-ai#3340 fix" — full diagnostic chain, Codex's 4-way convergent experiments, weakref trampoline code, connection to task flashinfer-ai#103, and the bisectability process lesson from audit task flashinfer-ai#91 Runbook (cute_dsl_moe_port_runbook.md): - "Last updated" 2026-05-11 -> 2026-05-16 - Added "In-flight follow-up PRs" section noting flashinfer-ai#3286 / flashinfer-ai#3340 / flashinfer-ai#3328 do not change the bench procedure The audit's task flashinfer-ai#91 framing of use_cold_l2_cache=True as "provably no-op for DeepSeek V3" remains correct for the buffer-cycle bytes-math but is empirically falsified for the autotune profile path — cold-L2 cycling during profile interacts with the retained CUDA state from a prior CUDA-graph wrapper test (via the wrapper-runner reference cycle) and produces NaN. The cycle is the load-bearing defect; cold-L2 was the trigger. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nv-yunzheq
left a comment
There was a problem hiding this comment.
Approved as the unit test passed
…e43023 PR flashinfer-ai#3286 (deterministic balanced autotune profile + cold-L2 unset workaround, all 6 commits squashed) merged to flashinfer-ai/main on 2026-05-17T04:17:35Z by nv-yunzheq. Audit doc: - Downstream PRs block: 2026-05-16 -> 2026-05-17; flashinfer-ai#3286 OPEN -> MERGED with squash-commit `ce430238`. - Cold-L2 re-enable item: clarified that flashinfer-ai#3286 has merged; the deferral now waits only on PR flashinfer-ai#3340. - 2026-05-12 PR flashinfer-ai#2398 rerun section: added "2026-05-17 update" line noting flashinfer-ai#3286 merge unblocks task flashinfer-ai#102 (post-A+C rerun). - 2026-05-15 wrapper-cycle section: workaround paragraph past-tensed and amended with the merge SHA. Runbook: - "Last updated" 2026-05-16 -> 2026-05-17. - In-flight follow-up PRs section split: flashinfer-ai#3286 moved to "Recently merged"; flashinfer-ai#3340 and flashinfer-ai#3328 retained as "Still in-flight"; closing note refocused on task flashinfer-ai#108 contingent on flashinfer-ai#3340. Bench procedure (Steps 1-11) unchanged — flashinfer-ai#3286 affects autotune profile inputs, not the bench reproduction recipe. Once the audit branch rebases onto main >= ce43023, post-A+C bench numbers (task flashinfer-ai#102) are runnable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…4fe1ff PR flashinfer-ai#3328 (SGLang Phase 1 dense moe_output_memset_inplace wrapper around cudaMemsetAsync) merged to flashinfer-ai/main on 2026-05-18T21:17:39Z by nv-yunzheq. Audit doc: - Header: "last updated 2026-05-17" -> "2026-05-18"; 31-day -> 32-day investigation window. - Downstream PRs block: 2026-05-17 -> 2026-05-18; flashinfer-ai#3328 OPEN -> MERGED with squash-commit `34fe1ff0`; framing shifted from "Promotes follow-up flashinfer-ai#11 to in-flight" to "Closes follow-up flashinfer-ai#11". - Other residual follow-ups: flashinfer-ai#3328 moved from "in flight" to "CLOSED". - Follow-up flashinfer-ai#11 Phase 1 body: status "in flight ... awaiting review" -> "MERGED ... by nv-yunzheq". Runbook: - "Last updated" 2026-05-17 -> 2026-05-18. - Recently merged follow-up PRs: split into two bullets (flashinfer-ai#3286 + flashinfer-ai#3328). - Still in-flight follow-up PR(s): tightened to singular (only flashinfer-ai#3340 remains). No bench-procedure changes (Phase-1 wrapper change is server-side, not bench-tool-side). Audit branch will pick up flashinfer-ai#3328's content on its next rebase onto main >= 34fe1ff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A force-tactic ablation on 2026-05-20 (5 cells × 5 forced runs, all
parity_ok=True) demonstrates that forcing fi to use trt's autotune-
selected tactic closes 77-100% of the gap at the EP=16/32 small/mid-N
regression cells:
EP=1 N=4096 : +8.3% → +0.0% (fully closed)
EP=16 N=8 : +18.5% → -4.0% (fi wins after forcing)
EP=16 N=1024 : +17.2% → +4.0% (77% closed)
EP=16/32 large-N: fi already wins; forcing changes nothing
The bias is dominantly tactic-pick-driven. This invalidates the
2026-05-19 hypothesis (autotune-candidate-set divergence due to
missing `ceil_div` filter in fi's `get_gemm2_valid_tactics`):
1. At the cells where the bias appears, `permuted_m` is always ≥
2048 due to `get_max_num_permuted_tokens`'s formula. The
`ceil_div(m, mma_tiler_mn[0]) ≥ cluster_shape_mn[0]` predicate
is always true; the filter never rejects in practice.
2. fi DOES already call `can_implement` on both kernel templates
at autotune time (tuner.py:435-467), and the kernel-level
`can_implement` bodies are byte-identical between fi and trt
(subagent diff verified).
3. So fi and trt accept the same candidate set at these cells.
The candidate-set framing doesn't explain why fi picks
different tactics.
The ACTUAL divergence is in the autotune TuningConfig:
- fi (tuner.py:295-364): `use_cold_l2_cache` intentionally unset
(workaround for the latent wrapper-
cycle bug per PR flashinfer-ai#3286's `640e32e7`)
- trt (cute_dsl_custom_ops.py:2032-2054): `use_cold_l2_cache=True`
With cold-L2 ON, autotune profile measures conservative timings and
picks tactics that are robustly fast under cold-cache conditions.
With cold-L2 OFF, autotune profile measures warm-L2 timings; some
tactics look fast due to L2-hit reuse during back-to-back profile
iterations but aren't actually faster in production. fi currently
picks tactics that "look fast" under warm L2 but aren't robustly
fast — which matches the ablation finding that trt's tactic is
faster on fi than fi's autotune-chosen tactic.
Two file edits:
1. Header date stamp: 2026-05-19 → 2026-05-20; revised reading
order; framing pivot from candidate-set to cold-L2.
2. New top-of-doc "2026-05-20 ablation + cold-L2 mechanism"
section labeled "read this FIRST". Includes the ablation
table, the candidate-set refutation, the cold-L2 mechanism,
and the task status changes (flashinfer-ai#125 deleted as refuted; flashinfer-ai#108
promoted to load-bearing perf fix; new task flashinfer-ai#126 to stage the
cold-L2 re-enable branch ready for flashinfer-ai#3340 landing).
3. The 2026-05-19 tactic-divergence section reordered to THIRD,
marked HISTORICAL with a banner noting the mechanism conclusion
has been revised. Observations (fi shows 9-10 distinct
signatures, trt 1-3) still valid as empirical fact; only the
proposed mechanism (missing ceil_div) was wrong.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three edits in response to PR flashinfer-ai#3340 review: 1. flashinfer/fused_moe/cute_dsl/tuner.py: update the use_cold_l2_cache TuningConfig comment. The previous block described the wrapper-cycle bug that motivated unsetting the flag; with the cycle fixed by this PR, that description becomes stale on merge. Reworded to note the fix and that re-enabling cold-L2 is a follow-up. 2. tests/moe/test_cute_dsl_fused_moe.py test_cuda_graph_wrapper_lifetime_before_autotune: tighten `assert finalized` to `assert finalized == [True]` so the test verifies the finalizer fired exactly once. 3. tests/moe/test_cute_dsl_fused_moe.py Add test_cuda_graph_wrapper_lifetime_after_autotune: parallel to the existing pre-autotune lifetime test but wraps the warmup call in `with autotune(True):` to exercise the post-profiling code path (where the autotuner's profile pass actively reaches into the runner and the runner's reference back to the wrapper). Clears the autotuner cache immediately before the autotune block so a prior test's cache hit can't bypass the profile pass; asserts a `CuteDslMoEWrapper::run` entry exists in the cache afterward to confirm profiling actually ran. Same gc.disable() / try-finally harness as the existing test. Branch was rebased onto current main HEAD to bring in the use_cold_l2_cache comment block (added in PR flashinfer-ai#3286). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
<!-- .github/pull_request_template.md --> ## 📌 Description `CuteDslMoEWrapper` currently passes `self._forward_with_tactic` as a bound method into `CuteDslFusedMoENvfp4Runner`, creating a strong reference cycle: `wrapper -> runner -> bound method -> wrapper`. When the wrapper is used with `use_cuda_graph=True`, this can keep wrapper-owned CUDA graph resources alive after user code has dropped the wrapper, until Python cyclic GC eventually runs. This PR replaces that bound-method callback with a weakref trampoline. The runner can still call into a live wrapper, but it no longer owns the wrapper lifetime. This prevents stale wrapper CUDA resources from surviving across same-process tests or later autotune runs. ## 🔍 Related Issues #3286 #3301 #3252 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests Adds a focused regression test that warms a CUDA-graph wrapper, verifies it is finalized before cyclic GC, and then runs a subsequent autotuned wrapper call to ensure the output remains NaN-free. - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved handling and cleanup of CUDA-graph wrappers to prevent resource leaks and provide a clear error when a wrapper is no longer available. * **Tests** * Added lifetime tests covering CUDA-graph wrappers before and after autotune; verify stable, non-NaN outputs during autotune. * **Documentation** * Updated comment about cold-L2 cache behavior and noted follow-up to re-enable it once a related issue is addressed. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/flashinfer-ai/flashinfer/pull/3340?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…8f4534 PR flashinfer-ai#3340 (cute_dsl: avoid MoE wrapper runner reference cycle) was merged 2026-05-21T16:56:58Z as squash-commit `18f45345` by nv-yunzheq. With flashinfer-ai#3286 (2026-05-17) and flashinfer-ai#3340 (2026-05-21) both landed, task flashinfer-ai#108 (re-enable `use_cold_l2_cache=True`) is unblocked; branch `cute-dsl-moe-tuner-reenable-cold-l2` (HEAD `97c89c2f` on top of upstream/main `18f45345`) is staged locally with a single- commit diff (comment swap + flag flip). Audit doc updates: - Header date stamp 2026-05-20 -> 2026-05-21; 34-day -> 35-day investigation window; framing line notes PR flashinfer-ai#3340 merged and task flashinfer-ai#108 unblocked. Underlying TuningConfig-divergence finding (2026-05-20) and reading order otherwise unchanged. - "TuningConfig divergences" table row for `use_cold_l2_cache`: "gated on PR flashinfer-ai#3340" -> "PR flashinfer-ai#3340 merged 2026-05-21 as `18f45345`, task flashinfer-ai#108 now unblocked". - "Task status changes 2026-05-20" block: task flashinfer-ai#108 marked unblocked, task flashinfer-ai#126 marked completed (branch staged locally), new task flashinfer-ai#129 added for the post-enable cold-L2 re-sweep that becomes the canonical post-audit reference. - Status-caveat paragraph: "Re-enabling cold L2 cache (task flashinfer-ai#108) is blocked on PR flashinfer-ai#3340" -> "PR flashinfer-ai#3340 merged 2026-05-21 unblocked task flashinfer-ai#108; ... a cold-L2 re-sweep (task flashinfer-ai#129) becomes the canonical reference once it lands". - Other candidate mechanisms list, cold-L2 bullet: gate language updated. - Downstream PRs block: PR flashinfer-ai#3286 framing tightened ("PR flashinfer-ai#3340 (merged 2026-05-21) shipped the underlying cycle fix"); PR flashinfer-ai#3340 entry rewritten from "OPEN 2026-05-16, mergeable" to "MERGED 2026-05-21T16:56:58Z as squash-commit `18f45345`" with the three pre-squash commits documented (`e471ae73` fix, `7fda2948` test, `fb440413` qiching review responses). - Residual follow-ups block: "Re-enable use_cold_l2_cache ... deferred until PR flashinfer-ai#3340" -> "unblocked 2026-05-21; branch staged locally; task flashinfer-ai#129 covers the canonical re-sweep". - 2026-05-15 dated section: PR flashinfer-ai#3340 line "single commit `ebb192a4`, 2026-05-16" -> "opened 2026-05-16, MERGED 2026-05-21 as squash-commit `18f45345`". Runbook updates: - Header date stamp 2026-05-19 -> 2026-05-21; framing line notes PR flashinfer-ai#3340 merged and task flashinfer-ai#108 unblocked, task flashinfer-ai#129 added. - "Still in-flight follow-up PR" block: PR flashinfer-ai#3340 moved from in-flight to recently-merged (third entry in that list); "in-flight" wording removed; final paragraph rewritten to note all three follow-ups landed, task flashinfer-ai#108 unblocked, task flashinfer-ai#129 covers the canonical post-audit re-sweep. Memories (separate, not in this commit since they live on Mac): reference_tactic_divergence_ep_scaling.md, reference_kernel_coverage_1to1.md, project_cutedsl_wrapper_cycle_bug.md, project_cutedsl_moe_fp4_port_audit_closed.md, MEMORY.md (index) have all received the same gate-lift / merged-status updates; project_cutedsl_wrapper_cycle_bug.md's frontmatter name updated to include the merge tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…TuningConfig (#3384) ## 📌 Description Sets `use_cold_l2_cache=True` on the autotuner `TuningConfig` in `flashinfer/fused_moe/cute_dsl/tuner.py`, matching TRT-LLM's `CuteDslFusedMoENvfp4Runner.tuning_config`. With cold-L2 ON, the autotuner flushes L2 between profile iterations and measures conservative timings, so the picked tactic is robustly fast under cold-cache conditions; without it, back-to-back iterations of the same tactic benefit from L2-hit reuse and bias the pick toward tactics that look fast during profiling but aren't faster in production. ## 🔍 Related Issues #3286 #3340 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved tuner's cold L2 cache measurement behavior for more accurate performance profiling. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/flashinfer-ai/flashinfer/pull/3384?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…etracted) All five audit-driven production PRs are now merged into flashinfer-ai/flashinfer:main (flashinfer-ai#3171, flashinfer-ai#3198, flashinfer-ai#3216, flashinfer-ai#3226, flashinfer-ai#3252). Updates the audit doc and the port-parity bench runbook to reflect this, plus retracts the wrapper-overhead framing for the residuals: CUDA graph replay captures host-side wrapper cost so it cannot be the residual's mechanism. The real residual is GPU-side non-kernel gap time, addressed by PR flashinfer-ai#3286 (deterministic balanced autotune profile) at large-N high-EP cells, and by `cab3ee50` (cutlass-dsl 4.3.x compat-hook removal, on nv-yunzheq's branch, not yet on main) for the small-N decode latency gap. audit doc: adds a "2026-05-11 post-close update" section at the top listing the merged PR table, the in-flight downstream PRs (flashinfer-ai#3286 deterministic-profile, flashinfer-ai#3292 bench refresh), the wrapper-overhead retraction, and the now-stale thin-adapter refactor recommendation. The existing "Final state (2026-04-29)" section is retitled to "historical snapshot at audit close" since it represents a snapshot not the current authoritative state. runbook: adds a status notice at the top noting PR flashinfer-ai#3252 is merged and the Step 4-7 collapse anticipated by the runbook's own "Future- proofing" note is now applicable. The collapsed flow is not yet re-validated end-to-end, so the existing 2/2-validated recipe is preserved until then. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ranches Updates the "2026-05-11 post-close update" section: - PR flashinfer-ai#3292 (bench refresh): MERGED 2026-05-11 as `4381afc1`. B200 verification on NGC rc14 confirmed PR flashinfer-ai#3126 obviates the f3beb60 pollution workaround at the source (CuteDSL 0.147 ms / TRTLLM 0.144 ms at bs=128 ep=8 — clean band matching the May 6 baseline with the workaround applied). - PR flashinfer-ai#3286 (A+C deterministic autotune profile): still OPEN with Gemini's high-priority comment about inputs_pre_hook propagation addressed in commit `36699b54`. Two medium-priority Gemini comments left for manual response; both mirror trt-llm's algorithm exactly so keeping for parity. Adds a "Parked alternative branches" subsection listing two non-merged branches kept for future use: - `bench-moe-deepseek-tighten-autotune-scope` (Codex follow-up to flashinfer-ai#3292, scopes autotune(True) to pre-warm only). - `nvtx_microbenchmark` rebased to single clean commit on top of main (NVTX markers + cudaProfilerStart/Stop wrap; drops the redundant _force_autotune_off references). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…are accepted as-is Earlier note said the two medium-priority Gemini comments were "left for manual response". Updated to reflect the actual policy: both mirror trt-llm's algorithm exactly (zero-token-expert loop entry and multi-GPU prioritization index), intentional for parity with trt-llm, no reply will be sent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(pre-A+C snapshot) Re-ran the canonical PR flashinfer-ai#2398 sweep (EP=1, 8, 16 × 15 N from 1 to 16384) on current main (post-flashinfer-ai#3292, before PR flashinfer-ai#3286 / A+C merges). Headline at EP=1 N=128: CuteDSL 0.965 ms vs PR flashinfer-ai#2398's published 0.134 ms. nsys hardware-truth diagnostic at this cell showed the CuteDSL gemm kernel takes ~464 us per call at EP=1 vs ~63 us at EP=8 — a 7.3x ratio matching the ~8x theoretical scaling from 256 vs 32 local experts in the grouped GEMM. Per-iter sum (2 gemms + routing + topk + small elementwise) ~988 us at EP=1, matching bench-reported 962 us within run-to-run noise. Conclusion: today's bench is correctly measuring real GPU kernel work. PR flashinfer-ai#2398's published 134 us was physically impossible for the actual kernel work and should not be used as a baseline. Adds: - benchmarks/cute_dsl_moe_pr2398_rerun_2026_05_12_pre_a_c.csv Full 45-cell rerun data (3 EPs x 15 N). - New "2026-05-12 PR flashinfer-ai#2398 rerun + nsys verdict (pre-A+C snapshot)" section in the audit doc with headline table, nsys verdict, and crossover analysis. The data is a pre-A+C snapshot; PR flashinfer-ai#3286 (A+C deterministic autotune profile) is open as of this rerun. A post-merge full-matrix re-measurement is scheduled to produce the canonical post-audit reference numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…shinfer-ai#3340 fix Updates the post-close audit doc and reproduction runbook to capture the new state of in-flight follow-up PRs after a latent CuteDslMoEWrapper lifecycle bug was uncovered on 2026-05-15. Audit doc (cute_dsl_moe_port_audit.md): - Downstream PRs status updated 2026-05-12 -> 2026-05-16 - PR flashinfer-ai#3286 entry: noted cold-L2 unset workaround (640e32e) and the surfaced wrapper-runner reference cycle - PR flashinfer-ai#3340 entry added (weakref trampoline lifecycle fix) - PR flashinfer-ai#3328 entry added (SGLang Phase 1 cudaMemsetAsync wrapper) - Other residual follow-ups: SGLang Phase 1 promoted from deferred to in-flight; cold-L2 re-enable (task flashinfer-ai#108) added as new deferred item - New dated section "2026-05-15 wrapper-runner reference cycle discovered + PR flashinfer-ai#3340 fix" — full diagnostic chain, Codex's 4-way convergent experiments, weakref trampoline code, connection to task flashinfer-ai#103, and the bisectability process lesson from audit task flashinfer-ai#91 Runbook (cute_dsl_moe_port_runbook.md): - "Last updated" 2026-05-11 -> 2026-05-16 - Added "In-flight follow-up PRs" section noting flashinfer-ai#3286 / flashinfer-ai#3340 / flashinfer-ai#3328 do not change the bench procedure The audit's task flashinfer-ai#91 framing of use_cold_l2_cache=True as "provably no-op for DeepSeek V3" remains correct for the buffer-cycle bytes-math but is empirically falsified for the autotune profile path — cold-L2 cycling during profile interacts with the retained CUDA state from a prior CUDA-graph wrapper test (via the wrapper-runner reference cycle) and produces NaN. The cycle is the load-bearing defect; cold-L2 was the trigger. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e43023 PR flashinfer-ai#3286 (deterministic balanced autotune profile + cold-L2 unset workaround, all 6 commits squashed) merged to flashinfer-ai/main on 2026-05-17T04:17:35Z by nv-yunzheq. Audit doc: - Downstream PRs block: 2026-05-16 -> 2026-05-17; flashinfer-ai#3286 OPEN -> MERGED with squash-commit `ce430238`. - Cold-L2 re-enable item: clarified that flashinfer-ai#3286 has merged; the deferral now waits only on PR flashinfer-ai#3340. - 2026-05-12 PR flashinfer-ai#2398 rerun section: added "2026-05-17 update" line noting flashinfer-ai#3286 merge unblocks task flashinfer-ai#102 (post-A+C rerun). - 2026-05-15 wrapper-cycle section: workaround paragraph past-tensed and amended with the merge SHA. Runbook: - "Last updated" 2026-05-16 -> 2026-05-17. - In-flight follow-up PRs section split: flashinfer-ai#3286 moved to "Recently merged"; flashinfer-ai#3340 and flashinfer-ai#3328 retained as "Still in-flight"; closing note refocused on task flashinfer-ai#108 contingent on flashinfer-ai#3340. Bench procedure (Steps 1-11) unchanged — flashinfer-ai#3286 affects autotune profile inputs, not the bench reproduction recipe. Once the audit branch rebases onto main >= ce43023, post-A+C bench numbers (task flashinfer-ai#102) are runnable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…4fe1ff PR flashinfer-ai#3328 (SGLang Phase 1 dense moe_output_memset_inplace wrapper around cudaMemsetAsync) merged to flashinfer-ai/main on 2026-05-18T21:17:39Z by nv-yunzheq. Audit doc: - Header: "last updated 2026-05-17" -> "2026-05-18"; 31-day -> 32-day investigation window. - Downstream PRs block: 2026-05-17 -> 2026-05-18; flashinfer-ai#3328 OPEN -> MERGED with squash-commit `34fe1ff0`; framing shifted from "Promotes follow-up flashinfer-ai#11 to in-flight" to "Closes follow-up flashinfer-ai#11". - Other residual follow-ups: flashinfer-ai#3328 moved from "in flight" to "CLOSED". - Follow-up flashinfer-ai#11 Phase 1 body: status "in flight ... awaiting review" -> "MERGED ... by nv-yunzheq". Runbook: - "Last updated" 2026-05-17 -> 2026-05-18. - Recently merged follow-up PRs: split into two bullets (flashinfer-ai#3286 + flashinfer-ai#3328). - Still in-flight follow-up PR(s): tightened to singular (only flashinfer-ai#3340 remains). No bench-procedure changes (Phase-1 wrapper change is server-side, not bench-tool-side). Audit branch will pick up flashinfer-ai#3328's content on its next rebase onto main >= 34fe1ff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A force-tactic ablation on 2026-05-20 (5 cells × 5 forced runs, all
parity_ok=True) demonstrates that forcing fi to use trt's autotune-
selected tactic closes 77-100% of the gap at the EP=16/32 small/mid-N
regression cells:
EP=1 N=4096 : +8.3% → +0.0% (fully closed)
EP=16 N=8 : +18.5% → -4.0% (fi wins after forcing)
EP=16 N=1024 : +17.2% → +4.0% (77% closed)
EP=16/32 large-N: fi already wins; forcing changes nothing
The bias is dominantly tactic-pick-driven. This invalidates the
2026-05-19 hypothesis (autotune-candidate-set divergence due to
missing `ceil_div` filter in fi's `get_gemm2_valid_tactics`):
1. At the cells where the bias appears, `permuted_m` is always ≥
2048 due to `get_max_num_permuted_tokens`'s formula. The
`ceil_div(m, mma_tiler_mn[0]) ≥ cluster_shape_mn[0]` predicate
is always true; the filter never rejects in practice.
2. fi DOES already call `can_implement` on both kernel templates
at autotune time (tuner.py:435-467), and the kernel-level
`can_implement` bodies are byte-identical between fi and trt
(subagent diff verified).
3. So fi and trt accept the same candidate set at these cells.
The candidate-set framing doesn't explain why fi picks
different tactics.
The ACTUAL divergence is in the autotune TuningConfig:
- fi (tuner.py:295-364): `use_cold_l2_cache` intentionally unset
(workaround for the latent wrapper-
cycle bug per PR flashinfer-ai#3286's `640e32e7`)
- trt (cute_dsl_custom_ops.py:2032-2054): `use_cold_l2_cache=True`
With cold-L2 ON, autotune profile measures conservative timings and
picks tactics that are robustly fast under cold-cache conditions.
With cold-L2 OFF, autotune profile measures warm-L2 timings; some
tactics look fast due to L2-hit reuse during back-to-back profile
iterations but aren't actually faster in production. fi currently
picks tactics that "look fast" under warm L2 but aren't robustly
fast — which matches the ablation finding that trt's tactic is
faster on fi than fi's autotune-chosen tactic.
Two file edits:
1. Header date stamp: 2026-05-19 → 2026-05-20; revised reading
order; framing pivot from candidate-set to cold-L2.
2. New top-of-doc "2026-05-20 ablation + cold-L2 mechanism"
section labeled "read this FIRST". Includes the ablation
table, the candidate-set refutation, the cold-L2 mechanism,
and the task status changes (flashinfer-ai#125 deleted as refuted; flashinfer-ai#108
promoted to load-bearing perf fix; new task flashinfer-ai#126 to stage the
cold-L2 re-enable branch ready for flashinfer-ai#3340 landing).
3. The 2026-05-19 tactic-divergence section reordered to THIRD,
marked HISTORICAL with a banner noting the mechanism conclusion
has been revised. Observations (fi shows 9-10 distinct
signatures, trt 1-3) still valid as empirical fact; only the
proposed mechanism (missing ceil_div) was wrong.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…8f4534 PR flashinfer-ai#3340 (cute_dsl: avoid MoE wrapper runner reference cycle) was merged 2026-05-21T16:56:58Z as squash-commit `18f45345` by nv-yunzheq. With flashinfer-ai#3286 (2026-05-17) and flashinfer-ai#3340 (2026-05-21) both landed, task flashinfer-ai#108 (re-enable `use_cold_l2_cache=True`) is unblocked; branch `cute-dsl-moe-tuner-reenable-cold-l2` (HEAD `97c89c2f` on top of upstream/main `18f45345`) is staged locally with a single- commit diff (comment swap + flag flip). Audit doc updates: - Header date stamp 2026-05-20 -> 2026-05-21; 34-day -> 35-day investigation window; framing line notes PR flashinfer-ai#3340 merged and task flashinfer-ai#108 unblocked. Underlying TuningConfig-divergence finding (2026-05-20) and reading order otherwise unchanged. - "TuningConfig divergences" table row for `use_cold_l2_cache`: "gated on PR flashinfer-ai#3340" -> "PR flashinfer-ai#3340 merged 2026-05-21 as `18f45345`, task flashinfer-ai#108 now unblocked". - "Task status changes 2026-05-20" block: task flashinfer-ai#108 marked unblocked, task flashinfer-ai#126 marked completed (branch staged locally), new task flashinfer-ai#129 added for the post-enable cold-L2 re-sweep that becomes the canonical post-audit reference. - Status-caveat paragraph: "Re-enabling cold L2 cache (task flashinfer-ai#108) is blocked on PR flashinfer-ai#3340" -> "PR flashinfer-ai#3340 merged 2026-05-21 unblocked task flashinfer-ai#108; ... a cold-L2 re-sweep (task flashinfer-ai#129) becomes the canonical reference once it lands". - Other candidate mechanisms list, cold-L2 bullet: gate language updated. - Downstream PRs block: PR flashinfer-ai#3286 framing tightened ("PR flashinfer-ai#3340 (merged 2026-05-21) shipped the underlying cycle fix"); PR flashinfer-ai#3340 entry rewritten from "OPEN 2026-05-16, mergeable" to "MERGED 2026-05-21T16:56:58Z as squash-commit `18f45345`" with the three pre-squash commits documented (`e471ae73` fix, `7fda2948` test, `fb440413` qiching review responses). - Residual follow-ups block: "Re-enable use_cold_l2_cache ... deferred until PR flashinfer-ai#3340" -> "unblocked 2026-05-21; branch staged locally; task flashinfer-ai#129 covers the canonical re-sweep". - 2026-05-15 dated section: PR flashinfer-ai#3340 line "single commit `ebb192a4`, 2026-05-16" -> "opened 2026-05-16, MERGED 2026-05-21 as squash-commit `18f45345`". Runbook updates: - Header date stamp 2026-05-19 -> 2026-05-21; framing line notes PR flashinfer-ai#3340 merged and task flashinfer-ai#108 unblocked, task flashinfer-ai#129 added. - "Still in-flight follow-up PR" block: PR flashinfer-ai#3340 moved from in-flight to recently-merged (third entry in that list); "in-flight" wording removed; final paragraph rewritten to note all three follow-ups landed, task flashinfer-ai#108 unblocked, task flashinfer-ai#129 covers the canonical post-audit re-sweep. Memories (separate, not in this commit since they live on Mac): reference_tactic_divergence_ep_scaling.md, reference_kernel_coverage_1to1.md, project_cutedsl_wrapper_cycle_bug.md, project_cutedsl_moe_fp4_port_audit_closed.md, MEMORY.md (index) have all received the same gate-lift / merged-status updates; project_cutedsl_wrapper_cycle_bug.md's frontmatter name updated to include the merge tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Reduces fi-vs-trt parity distance for CuteDSL NVFP4 MoE autotune selections by giving the autotuner a deterministic, balanced approx-max-load distribution for
token_selected_expertsduring profiling, plus seeded random tensor initializers for the other dynamic inputs. Brings fi's autotune-profile machinery into structural alignment with trt-llm'sinputs_pre_hookmechanism (all 6 of trt-llm's CuteDSLTuningConfigs set this hook).Three commits on the branch:
8faa2359— Path A+C: seed all 4tensor_initializers+ portTuningConfig.inputs_pre_hook+CuteDslMoEInputsHelper(the load-bearing change)a009ca18— small parity-cleanup: alignuse_cold_l2_cache=Truewith the rest of the codebase / trt-llm (provably no-op for DeepSeek-V3, value is forward-looking; flagged separately so reviewers don't conflate it with the A+C perf claim)819bd648— pin theCuteDslMoEInputsHelperinput-layout contract with a small unit testEmpirical validation
10-run 30-cell sweep on NGC rc14, B200, DeepSeek-V3 (hidden=7168, intermediate=2048, num_experts=256, top_k=8):
Cells closer to parity: 19 / further: 7 / neutral: 19. Cells now within 1% of trt parity: 19 of 45 (was ~12 of 45 baseline).
Trade-offs (deliberately retained for parity gain)
Test plan
pytest tests/moe/test_cute_dsl_fused_moe.py::TestInputsHelperContract— new unit test pinsCuteDslMoEInputsHelper.inputs_pre_hook's input-layout contract (replacesinputs[2], passes through rest, deterministic across instances).pytest tests/moe/test_cute_dsl_fused_moe.py— existing CuteDSL MoE tests pass (no behavior change on the fast path).parity_ok=True; numerical equivalence preserved.Summary by CodeRabbit
New Features
Tests