Skip to content

feat(cute_dsl/moe): deterministic balanced autotune profile inputs#3286

Merged
nv-yunzheq merged 6 commits into
flashinfer-ai:mainfrom
leejnau:cute-dsl-moe-deterministic-profile
May 17, 2026
Merged

feat(cute_dsl/moe): deterministic balanced autotune profile inputs#3286
nv-yunzheq merged 6 commits into
flashinfer-ai:mainfrom
leejnau:cute-dsl-moe-deterministic-profile

Conversation

@leejnau
Copy link
Copy Markdown
Contributor

@leejnau leejnau commented May 11, 2026

Summary

Reduces fi-vs-trt parity distance for CuteDSL NVFP4 MoE autotune selections by giving the autotuner a deterministic, balanced approx-max-load distribution for token_selected_experts during profiling, plus seeded random tensor initializers for the other dynamic inputs. Brings fi's autotune-profile machinery into structural alignment with trt-llm's inputs_pre_hook mechanism (all 6 of trt-llm's CuteDSL TuningConfigs set this hook).

Three commits on the branch:

  • 8faa2359 — Path A+C: seed all 4 tensor_initializers + port TuningConfig.inputs_pre_hook + CuteDslMoEInputsHelper (the load-bearing change)
  • a009ca18 — small parity-cleanup: align use_cold_l2_cache=True with the rest of the codebase / trt-llm (provably no-op for DeepSeek-V3, value is forward-looking; flagged separately so reviewers don't conflate it with the A+C perf claim)
  • 819bd648 — pin the CuteDslMoEInputsHelper input-layout contract with a small unit test

Empirical validation

10-run 30-cell sweep on NGC rc14, B200, DeepSeek-V3 (hidden=7168, intermediate=2048, num_experts=256, top_k=8):

baseline with this patch delta
Overall mean |Δ%| from trt parity 8.48% 6.45% −2.03pp
Small (N=1..256) 10.50% 7.44% −3.07pp
Mid (N=512..2048) 6.54% 5.80% −0.74pp
Large (N=4096..16384) 4.35% 4.13% −0.21pp

Cells closer to parity: 19 / further: 7 / neutral: 19. Cells now within 1% of trt parity: 19 of 45 (was ~12 of 45 baseline).

Trade-offs (deliberately retained for parity gain)

  • N=4096 EP=1 regresses from +1.24% to +8.89% (single cell; deterministic seed lands on a sub-optimal pick at this bucket).
  • N=2048 EP=16 still bimodal under the seeded distribution; mean +19.87%.

Test plan

  • pytest tests/moe/test_cute_dsl_fused_moe.py::TestInputsHelperContract — new unit test pins CuteDslMoEInputsHelper.inputs_pre_hook's input-layout contract (replaces inputs[2], passes through rest, deterministic across instances).
  • pytest tests/moe/test_cute_dsl_fused_moe.py — existing CuteDSL MoE tests pass (no behavior change on the fast path).
  • 10-run 30-cell parity sweep on B200 rc14 — all 450 cells parity_ok=True; numerical equivalence preserved.

Summary by CodeRabbit

  • New Features

    • Optional input-preprocessing callback for autotuning profiles, preserved across derived configs
    • Deterministic MoE profiling inputs via a new helper and seeded initialization for reproducible tuning
    • Autotune runner enables a cold L2 cache option to improve MoE tuning realism
  • Tests

    • Added CPU tests validating the input-preprocessing hook replaces the expected tensor, preserves others, and is deterministic

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cc9f7496-03f1-4b87-9980-f0d10737c30c

📥 Commits

Reviewing files that changed from the base of the PR and between 5b000bf and 640e32e.

📒 Files selected for processing (1)
  • flashinfer/fused_moe/cute_dsl/tuner.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • flashinfer/fused_moe/cute_dsl/tuner.py

📝 Walkthrough

Walkthrough

Adds an optional inputs_pre_hook to TuningConfig, implements CuteDslMoEInputsHelper to deterministically replace token_selected_experts for MoE autotuning, wires the helper and seeded input generation into the CuteDSL runner, and adds CPU tests validating the hook's contract and determinism.

Changes

MoE Autotuner Inputs Determinism

Layer / File(s) Summary
Autotuner Hook Contract
flashinfer/autotuner.py
TuningConfig gains optional inputs_pre_hook; _apply_tuning_overrides preserves it; AutoTuner.choose_one invokes the hook on synthesized tensors before per-tactic profiling.
CuteDslMoEInputsHelper Implementation
flashinfer/fused_moe/cute_dsl/_inputs_helper.py
New CuteDslMoEInputsHelper computes balanced per-local-expert token counts, builds a deterministic (num_tokens, top_k) token_selected_experts via seeded per-expert randperm, and exposes inputs_pre_hook that replaces only that tensor and forwards others unchanged.
Runner Integration with Seeded Initialization
flashinfer/fused_moe/cute_dsl/tuner.py
Runner imports and instantiates CuteDslMoEInputsHelper, seeds autotune tensor generators with torch.Generator.manual_seed(515), sets inputs_pre_hook=self._inputs_helper.inputs_pre_hook on TuningConfig, and enables use_cold_l2_cache=True.
Contract Validation Tests
tests/moe/test_cute_dsl_fused_moe.py
Adds TestInputsHelperContract that builds synthetic inputs matching wrapper layout, asserts index-2 token_selected_experts is replaced with a fresh tensor preserving shape/dtype while other inputs pass through by identity, and verifies determinism across helper instances.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • aleozlx
  • yzh119
  • IwakuraRein
  • samuellees
  • jiahanc
  • bkryu
  • jimmyzho
  • sricketts

🐰 I seed the tokens, one-two-three,
Reproducible hops for each expert tree,
A hook that swaps just one small slice,
Deterministic profiling — neat and precise,
Hop, tune, repeat — reproducible delight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(cute_dsl/moe): deterministic balanced autotune profile inputs' accurately reflects the main changes, which add deterministic and balanced profiling inputs for CuteDSL MoE autotuning.
Description check ✅ Passed The pull request description comprehensively covers the changes, objectives, empirical validation, trade-offs, and test plan, following the repository template structure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism for deterministic and realistic input distributions during MoE autotuning by adding an "inputs_pre_hook" to the "TuningConfig". A new "CuteDslMoEInputsHelper" class implements balanced expert assignment via rejection sampling, replacing the previous random initialization. The "CuteDslMoETuner" is updated to utilize this helper and seeded tensor initializers. Review feedback identifies three issues: the "inputs_pre_hook" is not propagated during tuning overrides, a logic error in the expert assignment loop allows assignments when token counts are zero, and a multi-GPU indexing bug exists in the prioritization logic.

Comment thread flashinfer/autotuner.py
Comment on lines +135 to +136
for j, num_tokens_j in enumerate(num_tokens_per_expert):
selection_order_j = selection_orders[j].tolist()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If num_tokens_per_expert[j] is 0, the current logic will still assign expert j to one token because the inner loop over selection_order_j is entered and the break condition num_tokens_j <= 0 is only checked after the first assignment. Adding a check at the beginning of the expert loop prevents this and avoids unnecessary computation.

Suggested change
for j, num_tokens_j in enumerate(num_tokens_per_expert):
selection_order_j = selection_orders[j].tolist()
for j, num_tokens_j in enumerate(num_tokens_per_expert):
if num_tokens_j <= 0:
continue
selection_order_j = selection_orders[j].tolist()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# experts are exhausted. (For typical MoE configs with
# num_local_experts < num_experts the prioritized list is
# empty; preserved for parity with trt's algorithm.)
limit = self.top_k - (self.num_experts - j)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The limit calculation uses the local expert index j, but it should use the global expert index j + self.local_expert_offset to correctly identify tokens at risk of not hitting top_k experts across the entire MoE layer. Without this, the prioritization logic is effectively disabled for all ranks except the first one (where offset=0), breaking parity with the intended algorithm on multi-GPU setups.

Suggested change
limit = self.top_k - (self.num_experts - j)
limit = self.top_k - (self.num_experts - (j + self.local_expert_offset))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leejnau added a commit to leejnau/flashinfer that referenced this pull request May 11, 2026
_apply_tuning_overrides constructs a new TuningConfig with only a
subset of the original config's fields, silently dropping the new
inputs_pre_hook field. Any code path that activates
override_tuning_buckets or override_round_up would lose the helper
and fall back to random profile inputs -- defeating the purpose of the
balanced-load helper for those paths.

Addresses Gemini review comment on PR flashinfer-ai#3286.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 12, 2026
…etracted)

All five audit-driven production PRs are now merged into
flashinfer-ai/flashinfer:main (flashinfer-ai#3171, flashinfer-ai#3198, flashinfer-ai#3216, flashinfer-ai#3226, flashinfer-ai#3252).
Updates the audit doc and the port-parity bench runbook to reflect
this, plus retracts the wrapper-overhead framing for the residuals:
CUDA graph replay captures host-side wrapper cost so it cannot be
the residual's mechanism. The real residual is GPU-side non-kernel
gap time, addressed by PR flashinfer-ai#3286 (deterministic balanced autotune
profile) at large-N high-EP cells, and by `cab3ee50` (cutlass-dsl
4.3.x compat-hook removal, on nv-yunzheq's branch, not yet on main)
for the small-N decode latency gap.

audit doc: adds a "2026-05-11 post-close update" section at the top
listing the merged PR table, the in-flight downstream PRs (flashinfer-ai#3286
deterministic-profile, flashinfer-ai#3292 bench refresh), the wrapper-overhead
retraction, and the now-stale thin-adapter refactor recommendation.
The existing "Final state (2026-04-29)" section is retitled to
"historical snapshot at audit close" since it represents a snapshot
not the current authoritative state.

runbook: adds a status notice at the top noting PR flashinfer-ai#3252 is merged
and the Step 4-7 collapse anticipated by the runbook's own "Future-
proofing" note is now applicable. The collapsed flow is not yet
re-validated end-to-end, so the existing 2/2-validated recipe is
preserved until then.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 12, 2026
…ranches

Updates the "2026-05-11 post-close update" section:

- PR flashinfer-ai#3292 (bench refresh): MERGED 2026-05-11 as `4381afc1`.
  B200 verification on NGC rc14 confirmed PR flashinfer-ai#3126 obviates the
  f3beb60 pollution workaround at the source (CuteDSL 0.147 ms /
  TRTLLM 0.144 ms at bs=128 ep=8 — clean band matching the May 6
  baseline with the workaround applied).

- PR flashinfer-ai#3286 (A+C deterministic autotune profile): still OPEN with
  Gemini's high-priority comment about inputs_pre_hook propagation
  addressed in commit `36699b54`. Two medium-priority Gemini
  comments left for manual response; both mirror trt-llm's
  algorithm exactly so keeping for parity.

Adds a "Parked alternative branches" subsection listing two
non-merged branches kept for future use:

- `bench-moe-deepseek-tighten-autotune-scope` (Codex follow-up to
  flashinfer-ai#3292, scopes autotune(True) to pre-warm only).
- `nvtx_microbenchmark` rebased to single clean commit on top of
  main (NVTX markers + cudaProfilerStart/Stop wrap; drops the
  redundant _force_autotune_off references).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 12, 2026
…are accepted as-is

Earlier note said the two medium-priority Gemini comments were "left
for manual response".  Updated to reflect the actual policy: both
mirror trt-llm's algorithm exactly (zero-token-expert loop entry and
multi-GPU prioritization index), intentional for parity with trt-llm,
no reply will be sent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 12, 2026
…(pre-A+C snapshot)

Re-ran the canonical PR flashinfer-ai#2398 sweep (EP=1, 8, 16 × 15 N from 1 to
16384) on current main (post-flashinfer-ai#3292, before PR flashinfer-ai#3286 / A+C merges).
Headline at EP=1 N=128: CuteDSL 0.965 ms vs PR flashinfer-ai#2398's published
0.134 ms. nsys hardware-truth diagnostic at this cell showed the
CuteDSL gemm kernel takes ~464 us per call at EP=1 vs ~63 us at
EP=8 — a 7.3x ratio matching the ~8x theoretical scaling from
256 vs 32 local experts in the grouped GEMM. Per-iter sum
(2 gemms + routing + topk + small elementwise) ~988 us at EP=1,
matching bench-reported 962 us within run-to-run noise.

Conclusion: today's bench is correctly measuring real GPU kernel
work. PR flashinfer-ai#2398's published 134 us was physically impossible for
the actual kernel work and should not be used as a baseline.

Adds:
- benchmarks/cute_dsl_moe_pr2398_rerun_2026_05_12_pre_a_c.csv
  Full 45-cell rerun data (3 EPs x 15 N).
- New "2026-05-12 PR flashinfer-ai#2398 rerun + nsys verdict (pre-A+C snapshot)"
  section in the audit doc with headline table, nsys verdict, and
  crossover analysis.

The data is a pre-A+C snapshot; PR flashinfer-ai#3286 (A+C deterministic
autotune profile) is open as of this rerun. A post-merge
full-matrix re-measurement is scheduled to produce the canonical
post-audit reference numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
selection_order_j = prioritized + [
i for i in selection_order_j if i not in p_set
]
for i in selection_order_j:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assume when num_tokens_on_curr_rank < num_local_experts, the base of divmod(num_tokens_on_curr_rank, num_local_experts) is 0, and the remaining experts should be assigned 0 tokens. For DSv3 with EP=16 and num_tokens=1: average=0.5, extra≈0.83, num_tokens_on_curr_rank=2, divmod(2,16)=(0,2) so experts 0 and 1 should each receive 1 token, while experts 2–15 should each receive 0 tokens; however, the current logic causes experts 2–15 to incorrectly receive 1 additional token each.

it deviates.

so how do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same issue Gemini flagged earlier in this file, and your example reinforces this. However, TRT-LLM has identical behavior at https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py#L151-L159

Since the goal is a faithful port, ensuring FlashInfer's autotune sees the same profile inputs TRT-LLM does, fixing here would deviate. We could instead fix it here and file an issue to TRT-LLM.

# experts are exhausted. (For typical MoE configs with
# num_local_experts < num_experts the prioritized list is
# empty; preserved for parity with trt's algorithm.)
limit = self.top_k - (self.num_experts - j)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with Gemini and i think here we should use global index:

global_j = j + self.local_expert_offset
limit = self.top_k - (self.num_experts - global_j)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was deliberately done for parity with TRT-LLM. TRT-LLM uses local j identically at https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py#L146-L147, and the goal is port alignment. Switching to global j would deviate from that. We could instead fix it here and file an issue to TRT-LLM. (This would also apply to Gemini's other comment.)

Copy link
Copy Markdown
Collaborator

@qiching qiching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree to merge. how about add a comment with the issue link to these two bug reports in the code to avoid repeated flagging?

@leejnau
Copy link
Copy Markdown
Contributor Author

leejnau commented May 14, 2026

Agree to merge. how about add a comment with the issue link to these two bug reports in the code to avoid repeated flagging?

Copy link
Copy Markdown
Collaborator

@qiching qiching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now! thanks your work

@nv-yunzheq
Copy link
Copy Markdown
Collaborator

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !671 has been created, and the CI pipeline #51324686 is currently running. I'll report back once the pipeline job completes.

shapes,
dtype=torch.uint8,
device=device,
generator=torch.Generator(device=device).manual_seed(515),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to use different seed value for tensor initialization?

Copy link
Copy Markdown
Contributor Author

@leejnau leejnau May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shared seed is ok because the four lambdas draw from different shapes/dtypes/distributions, so the outputs are uncorrelated in practice despite sharing a seed.

…oken + multi-rank-prioritization quirks in generate_token_selected_experts

Adds in-source pointers to NVIDIA/TensorRT-LLM#14146
at the two sites Gemini + qiching flagged, so future readers can find
the upstream tracking issue without re-flagging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@leejnau leejnau force-pushed the cute-dsl-moe-deterministic-profile branch from 04b28b3 to 5b000bf Compare May 15, 2026 17:51
@leejnau leejnau requested a review from dhiraj113 as a code owner May 15, 2026 17:51
@nv-yunzheq
Copy link
Copy Markdown
Collaborator

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !671 has been updated with latest changes, and the CI pipeline #51428234 is currently running. I'll report back once the pipeline job completes.

Setting use_cold_l2_cache=True on CuteDslFusedMoENvfp4Runner.tuning_config
(introduced in 9f466b2) triggers NaN in
TestCuteDslMoEWrapper::test_wrapper_with_autotune on B200 CI when a prior
test in the same file runs with use_cuda_graph=True.

A latent reference cycle in CuteDslMoEWrapper retains CUDA resources
across test boundaries:

  CuteDslMoEWrapper -> _runner -> forward_impl (bound method)
                              -> CuteDslMoEWrapper

Cold-L2 cycling during the next autotune profile interacts with the
retained state and produces NaN. Four independent interventions all
fix the failure on this branch:

  - gc.collect() between tests
  - Breaking the wrapper-runner cycle directly
  - Nulling wrapper-owned CUDA resources
  - Weakref trampoline replacing forward_impl=self._forward_with_tactic

Cold-L2 is the empirical trigger; the wrapper-runner reference cycle is
the underlying defect. This change unsets the flag to unblock CI. The
cycle itself will be addressed separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nv-yunzheq
Copy link
Copy Markdown
Collaborator

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !671 has been updated with latest changes, and the CI pipeline #51471201 is currently running. I'll report back once the pipeline job completes.

leejnau added a commit to leejnau/flashinfer that referenced this pull request May 16, 2026
…shinfer-ai#3340 fix

Updates the post-close audit doc and reproduction runbook to capture the
new state of in-flight follow-up PRs after a latent CuteDslMoEWrapper
lifecycle bug was uncovered on 2026-05-15.

Audit doc (cute_dsl_moe_port_audit.md):
  - Downstream PRs status updated 2026-05-12 -> 2026-05-16
  - PR flashinfer-ai#3286 entry: noted cold-L2 unset workaround (640e32e) and the
    surfaced wrapper-runner reference cycle
  - PR flashinfer-ai#3340 entry added (weakref trampoline lifecycle fix)
  - PR flashinfer-ai#3328 entry added (SGLang Phase 1 cudaMemsetAsync wrapper)
  - Other residual follow-ups: SGLang Phase 1 promoted from deferred to
    in-flight; cold-L2 re-enable (task flashinfer-ai#108) added as new deferred item
  - New dated section "2026-05-15 wrapper-runner reference cycle
    discovered + PR flashinfer-ai#3340 fix" — full diagnostic chain, Codex's 4-way
    convergent experiments, weakref trampoline code, connection to
    task flashinfer-ai#103, and the bisectability process lesson from audit task flashinfer-ai#91

Runbook (cute_dsl_moe_port_runbook.md):
  - "Last updated" 2026-05-11 -> 2026-05-16
  - Added "In-flight follow-up PRs" section noting flashinfer-ai#3286 / flashinfer-ai#3340 / flashinfer-ai#3328
    do not change the bench procedure

The audit's task flashinfer-ai#91 framing of use_cold_l2_cache=True as "provably
no-op for DeepSeek V3" remains correct for the buffer-cycle bytes-math
but is empirically falsified for the autotune profile path — cold-L2
cycling during profile interacts with the retained CUDA state from a
prior CUDA-graph wrapper test (via the wrapper-runner reference cycle)
and produces NaN. The cycle is the load-bearing defect; cold-L2 was
the trigger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@nv-yunzheq nv-yunzheq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved as the unit test passed

@nv-yunzheq nv-yunzheq merged commit ce43023 into flashinfer-ai:main May 17, 2026
30 of 31 checks passed
@leejnau leejnau deleted the cute-dsl-moe-deterministic-profile branch May 17, 2026 17:00
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 17, 2026
…e43023

PR flashinfer-ai#3286 (deterministic balanced autotune profile + cold-L2 unset
workaround, all 6 commits squashed) merged to flashinfer-ai/main on
2026-05-17T04:17:35Z by nv-yunzheq.

Audit doc:
  - Downstream PRs block: 2026-05-16 -> 2026-05-17; flashinfer-ai#3286 OPEN -> MERGED
    with squash-commit `ce430238`.
  - Cold-L2 re-enable item: clarified that flashinfer-ai#3286 has merged; the deferral
    now waits only on PR flashinfer-ai#3340.
  - 2026-05-12 PR flashinfer-ai#2398 rerun section: added "2026-05-17 update" line
    noting flashinfer-ai#3286 merge unblocks task flashinfer-ai#102 (post-A+C rerun).
  - 2026-05-15 wrapper-cycle section: workaround paragraph past-tensed
    and amended with the merge SHA.

Runbook:
  - "Last updated" 2026-05-16 -> 2026-05-17.
  - In-flight follow-up PRs section split: flashinfer-ai#3286 moved to "Recently
    merged"; flashinfer-ai#3340 and flashinfer-ai#3328 retained as "Still in-flight"; closing note
    refocused on task flashinfer-ai#108 contingent on flashinfer-ai#3340.

Bench procedure (Steps 1-11) unchanged — flashinfer-ai#3286 affects autotune profile
inputs, not the bench reproduction recipe. Once the audit branch rebases
onto main >= ce43023, post-A+C bench numbers (task flashinfer-ai#102) are runnable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 18, 2026
…4fe1ff

PR flashinfer-ai#3328 (SGLang Phase 1 dense moe_output_memset_inplace wrapper around
cudaMemsetAsync) merged to flashinfer-ai/main on 2026-05-18T21:17:39Z
by nv-yunzheq.

Audit doc:
  - Header: "last updated 2026-05-17" -> "2026-05-18"; 31-day -> 32-day
    investigation window.
  - Downstream PRs block: 2026-05-17 -> 2026-05-18; flashinfer-ai#3328 OPEN -> MERGED
    with squash-commit `34fe1ff0`; framing shifted from "Promotes
    follow-up flashinfer-ai#11 to in-flight" to "Closes follow-up flashinfer-ai#11".
  - Other residual follow-ups: flashinfer-ai#3328 moved from "in flight" to "CLOSED".
  - Follow-up flashinfer-ai#11 Phase 1 body: status "in flight ... awaiting review"
    -> "MERGED ... by nv-yunzheq".

Runbook:
  - "Last updated" 2026-05-17 -> 2026-05-18.
  - Recently merged follow-up PRs: split into two bullets (flashinfer-ai#3286 + flashinfer-ai#3328).
  - Still in-flight follow-up PR(s): tightened to singular (only flashinfer-ai#3340
    remains).

No bench-procedure changes (Phase-1 wrapper change is server-side, not
bench-tool-side). Audit branch will pick up flashinfer-ai#3328's content on its next
rebase onto main >= 34fe1ff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 20, 2026
A force-tactic ablation on 2026-05-20 (5 cells × 5 forced runs, all
parity_ok=True) demonstrates that forcing fi to use trt's autotune-
selected tactic closes 77-100% of the gap at the EP=16/32 small/mid-N
regression cells:

  EP=1 N=4096  : +8.3% → +0.0%   (fully closed)
  EP=16 N=8    : +18.5% → -4.0%  (fi wins after forcing)
  EP=16 N=1024 : +17.2% → +4.0%  (77% closed)
  EP=16/32 large-N: fi already wins; forcing changes nothing

The bias is dominantly tactic-pick-driven. This invalidates the
2026-05-19 hypothesis (autotune-candidate-set divergence due to
missing `ceil_div` filter in fi's `get_gemm2_valid_tactics`):

  1. At the cells where the bias appears, `permuted_m` is always ≥
     2048 due to `get_max_num_permuted_tokens`'s formula. The
     `ceil_div(m, mma_tiler_mn[0]) ≥ cluster_shape_mn[0]` predicate
     is always true; the filter never rejects in practice.

  2. fi DOES already call `can_implement` on both kernel templates
     at autotune time (tuner.py:435-467), and the kernel-level
     `can_implement` bodies are byte-identical between fi and trt
     (subagent diff verified).

  3. So fi and trt accept the same candidate set at these cells.
     The candidate-set framing doesn't explain why fi picks
     different tactics.

The ACTUAL divergence is in the autotune TuningConfig:

  - fi  (tuner.py:295-364):  `use_cold_l2_cache` intentionally unset
                              (workaround for the latent wrapper-
                              cycle bug per PR flashinfer-ai#3286's `640e32e7`)
  - trt (cute_dsl_custom_ops.py:2032-2054):  `use_cold_l2_cache=True`

With cold-L2 ON, autotune profile measures conservative timings and
picks tactics that are robustly fast under cold-cache conditions.
With cold-L2 OFF, autotune profile measures warm-L2 timings; some
tactics look fast due to L2-hit reuse during back-to-back profile
iterations but aren't actually faster in production. fi currently
picks tactics that "look fast" under warm L2 but aren't robustly
fast — which matches the ablation finding that trt's tactic is
faster on fi than fi's autotune-chosen tactic.

Two file edits:

  1. Header date stamp: 2026-05-19 → 2026-05-20; revised reading
     order; framing pivot from candidate-set to cold-L2.

  2. New top-of-doc "2026-05-20 ablation + cold-L2 mechanism"
     section labeled "read this FIRST". Includes the ablation
     table, the candidate-set refutation, the cold-L2 mechanism,
     and the task status changes (flashinfer-ai#125 deleted as refuted; flashinfer-ai#108
     promoted to load-bearing perf fix; new task flashinfer-ai#126 to stage the
     cold-L2 re-enable branch ready for flashinfer-ai#3340 landing).

  3. The 2026-05-19 tactic-divergence section reordered to THIRD,
     marked HISTORICAL with a banner noting the mechanism conclusion
     has been revised. Observations (fi shows 9-10 distinct
     signatures, trt 1-3) still valid as empirical fact; only the
     proposed mechanism (missing ceil_div) was wrong.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 20, 2026
Three edits in response to PR flashinfer-ai#3340 review:

1. flashinfer/fused_moe/cute_dsl/tuner.py: update the use_cold_l2_cache
   TuningConfig comment. The previous block described the wrapper-cycle
   bug that motivated unsetting the flag; with the cycle fixed by this
   PR, that description becomes stale on merge. Reworded to note the
   fix and that re-enabling cold-L2 is a follow-up.

2. tests/moe/test_cute_dsl_fused_moe.py
   test_cuda_graph_wrapper_lifetime_before_autotune: tighten
   `assert finalized` to `assert finalized == [True]` so the test
   verifies the finalizer fired exactly once.

3. tests/moe/test_cute_dsl_fused_moe.py
   Add test_cuda_graph_wrapper_lifetime_after_autotune: parallel to
   the existing pre-autotune lifetime test but wraps the warmup call
   in `with autotune(True):` to exercise the post-profiling code path
   (where the autotuner's profile pass actively reaches into the
   runner and the runner's reference back to the wrapper). Clears the
   autotuner cache immediately before the autotune block so a prior
   test's cache hit can't bypass the profile pass; asserts a
   `CuteDslMoEWrapper::run` entry exists in the cache afterward to
   confirm profiling actually ran. Same gc.disable() / try-finally
   harness as the existing test.

Branch was rebased onto current main HEAD to bring in the
use_cold_l2_cache comment block (added in PR flashinfer-ai#3286).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nv-yunzheq pushed a commit that referenced this pull request May 21, 2026
<!-- .github/pull_request_template.md -->

## 📌 Description

`CuteDslMoEWrapper` currently passes `self._forward_with_tactic` as a
bound method into `CuteDslFusedMoENvfp4Runner`, creating a strong
reference cycle: `wrapper -> runner -> bound method -> wrapper`. When
the wrapper is used with `use_cuda_graph=True`, this can keep
wrapper-owned CUDA graph resources alive after user code has dropped the
wrapper, until Python cyclic GC eventually runs.

This PR replaces that bound-method callback with a weakref trampoline.
The runner can still call into a live wrapper, but it no longer owns the
wrapper lifetime. This prevents stale wrapper CUDA resources from
surviving across same-process tests or later autotune runs.

## 🔍 Related Issues

#3286
#3301
#3252

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

Adds a focused regression test that warms a CUDA-graph wrapper, verifies
it is finalized before cyclic GC, and then runs a subsequent autotuned
wrapper call to ensure the output remains NaN-free.

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Improved handling and cleanup of CUDA-graph wrappers to prevent
resource leaks and provide a clear error when a wrapper is no longer
available.

* **Tests**
* Added lifetime tests covering CUDA-graph wrappers before and after
autotune; verify stable, non-NaN outputs during autotune.

* **Documentation**
* Updated comment about cold-L2 cache behavior and noted follow-up to
re-enable it once a related issue is addressed.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/flashinfer-ai/flashinfer/pull/3340?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 21, 2026
…8f4534

PR flashinfer-ai#3340 (cute_dsl: avoid MoE wrapper runner reference cycle) was
merged 2026-05-21T16:56:58Z as squash-commit `18f45345` by
nv-yunzheq. With flashinfer-ai#3286 (2026-05-17) and flashinfer-ai#3340 (2026-05-21) both
landed, task flashinfer-ai#108 (re-enable `use_cold_l2_cache=True`) is unblocked;
branch `cute-dsl-moe-tuner-reenable-cold-l2` (HEAD `97c89c2f` on
top of upstream/main `18f45345`) is staged locally with a single-
commit diff (comment swap + flag flip).

Audit doc updates:
  - Header date stamp 2026-05-20 -> 2026-05-21; 34-day -> 35-day
    investigation window; framing line notes PR flashinfer-ai#3340 merged and
    task flashinfer-ai#108 unblocked. Underlying TuningConfig-divergence finding
    (2026-05-20) and reading order otherwise unchanged.
  - "TuningConfig divergences" table row for `use_cold_l2_cache`:
    "gated on PR flashinfer-ai#3340" -> "PR flashinfer-ai#3340 merged 2026-05-21 as
    `18f45345`, task flashinfer-ai#108 now unblocked".
  - "Task status changes 2026-05-20" block: task flashinfer-ai#108 marked
    unblocked, task flashinfer-ai#126 marked completed (branch staged locally),
    new task flashinfer-ai#129 added for the post-enable cold-L2 re-sweep that
    becomes the canonical post-audit reference.
  - Status-caveat paragraph: "Re-enabling cold L2 cache (task
    flashinfer-ai#108) is blocked on PR flashinfer-ai#3340" -> "PR flashinfer-ai#3340 merged 2026-05-21
    unblocked task flashinfer-ai#108; ... a cold-L2 re-sweep (task flashinfer-ai#129)
    becomes the canonical reference once it lands".
  - Other candidate mechanisms list, cold-L2 bullet: gate language
    updated.
  - Downstream PRs block: PR flashinfer-ai#3286 framing tightened ("PR flashinfer-ai#3340
    (merged 2026-05-21) shipped the underlying cycle fix"); PR
    flashinfer-ai#3340 entry rewritten from "OPEN 2026-05-16, mergeable" to
    "MERGED 2026-05-21T16:56:58Z as squash-commit `18f45345`"
    with the three pre-squash commits documented (`e471ae73` fix,
    `7fda2948` test, `fb440413` qiching review responses).
  - Residual follow-ups block: "Re-enable use_cold_l2_cache ...
    deferred until PR flashinfer-ai#3340" -> "unblocked 2026-05-21; branch
    staged locally; task flashinfer-ai#129 covers the canonical re-sweep".
  - 2026-05-15 dated section: PR flashinfer-ai#3340 line "single commit
    `ebb192a4`, 2026-05-16" -> "opened 2026-05-16, MERGED 2026-05-21
    as squash-commit `18f45345`".

Runbook updates:
  - Header date stamp 2026-05-19 -> 2026-05-21; framing line notes
    PR flashinfer-ai#3340 merged and task flashinfer-ai#108 unblocked, task flashinfer-ai#129 added.
  - "Still in-flight follow-up PR" block: PR flashinfer-ai#3340 moved from
    in-flight to recently-merged (third entry in that list);
    "in-flight" wording removed; final paragraph rewritten to
    note all three follow-ups landed, task flashinfer-ai#108 unblocked, task
    flashinfer-ai#129 covers the canonical post-audit re-sweep.

Memories (separate, not in this commit since they live on Mac):
  reference_tactic_divergence_ep_scaling.md,
  reference_kernel_coverage_1to1.md,
  project_cutedsl_wrapper_cycle_bug.md,
  project_cutedsl_moe_fp4_port_audit_closed.md,
  MEMORY.md (index)
have all received the same gate-lift / merged-status updates;
project_cutedsl_wrapper_cycle_bug.md's frontmatter name updated
to include the merge tag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nv-yunzheq pushed a commit that referenced this pull request May 21, 2026
…TuningConfig (#3384)

## 📌 Description

Sets `use_cold_l2_cache=True` on the autotuner `TuningConfig` in
`flashinfer/fused_moe/cute_dsl/tuner.py`, matching TRT-LLM's
`CuteDslFusedMoENvfp4Runner.tuning_config`. With cold-L2 ON, the
autotuner flushes L2 between profile iterations and measures
conservative timings, so the picked tactic is robustly fast under
cold-cache conditions; without it, back-to-back iterations of the same
tactic benefit from L2-hit reuse and bias the pick toward tactics that
look fast during profiling but aren't faster in production.

## 🔍 Related Issues

#3286
#3340

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved tuner's cold L2 cache measurement behavior for more accurate
performance profiling.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/flashinfer-ai/flashinfer/pull/3384?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 21, 2026
…etracted)

All five audit-driven production PRs are now merged into
flashinfer-ai/flashinfer:main (flashinfer-ai#3171, flashinfer-ai#3198, flashinfer-ai#3216, flashinfer-ai#3226, flashinfer-ai#3252).
Updates the audit doc and the port-parity bench runbook to reflect
this, plus retracts the wrapper-overhead framing for the residuals:
CUDA graph replay captures host-side wrapper cost so it cannot be
the residual's mechanism. The real residual is GPU-side non-kernel
gap time, addressed by PR flashinfer-ai#3286 (deterministic balanced autotune
profile) at large-N high-EP cells, and by `cab3ee50` (cutlass-dsl
4.3.x compat-hook removal, on nv-yunzheq's branch, not yet on main)
for the small-N decode latency gap.

audit doc: adds a "2026-05-11 post-close update" section at the top
listing the merged PR table, the in-flight downstream PRs (flashinfer-ai#3286
deterministic-profile, flashinfer-ai#3292 bench refresh), the wrapper-overhead
retraction, and the now-stale thin-adapter refactor recommendation.
The existing "Final state (2026-04-29)" section is retitled to
"historical snapshot at audit close" since it represents a snapshot
not the current authoritative state.

runbook: adds a status notice at the top noting PR flashinfer-ai#3252 is merged
and the Step 4-7 collapse anticipated by the runbook's own "Future-
proofing" note is now applicable. The collapsed flow is not yet
re-validated end-to-end, so the existing 2/2-validated recipe is
preserved until then.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 21, 2026
…ranches

Updates the "2026-05-11 post-close update" section:

- PR flashinfer-ai#3292 (bench refresh): MERGED 2026-05-11 as `4381afc1`.
  B200 verification on NGC rc14 confirmed PR flashinfer-ai#3126 obviates the
  f3beb60 pollution workaround at the source (CuteDSL 0.147 ms /
  TRTLLM 0.144 ms at bs=128 ep=8 — clean band matching the May 6
  baseline with the workaround applied).

- PR flashinfer-ai#3286 (A+C deterministic autotune profile): still OPEN with
  Gemini's high-priority comment about inputs_pre_hook propagation
  addressed in commit `36699b54`. Two medium-priority Gemini
  comments left for manual response; both mirror trt-llm's
  algorithm exactly so keeping for parity.

Adds a "Parked alternative branches" subsection listing two
non-merged branches kept for future use:

- `bench-moe-deepseek-tighten-autotune-scope` (Codex follow-up to
  flashinfer-ai#3292, scopes autotune(True) to pre-warm only).
- `nvtx_microbenchmark` rebased to single clean commit on top of
  main (NVTX markers + cudaProfilerStart/Stop wrap; drops the
  redundant _force_autotune_off references).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 21, 2026
…are accepted as-is

Earlier note said the two medium-priority Gemini comments were "left
for manual response".  Updated to reflect the actual policy: both
mirror trt-llm's algorithm exactly (zero-token-expert loop entry and
multi-GPU prioritization index), intentional for parity with trt-llm,
no reply will be sent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 21, 2026
…(pre-A+C snapshot)

Re-ran the canonical PR flashinfer-ai#2398 sweep (EP=1, 8, 16 × 15 N from 1 to
16384) on current main (post-flashinfer-ai#3292, before PR flashinfer-ai#3286 / A+C merges).
Headline at EP=1 N=128: CuteDSL 0.965 ms vs PR flashinfer-ai#2398's published
0.134 ms. nsys hardware-truth diagnostic at this cell showed the
CuteDSL gemm kernel takes ~464 us per call at EP=1 vs ~63 us at
EP=8 — a 7.3x ratio matching the ~8x theoretical scaling from
256 vs 32 local experts in the grouped GEMM. Per-iter sum
(2 gemms + routing + topk + small elementwise) ~988 us at EP=1,
matching bench-reported 962 us within run-to-run noise.

Conclusion: today's bench is correctly measuring real GPU kernel
work. PR flashinfer-ai#2398's published 134 us was physically impossible for
the actual kernel work and should not be used as a baseline.

Adds:
- benchmarks/cute_dsl_moe_pr2398_rerun_2026_05_12_pre_a_c.csv
  Full 45-cell rerun data (3 EPs x 15 N).
- New "2026-05-12 PR flashinfer-ai#2398 rerun + nsys verdict (pre-A+C snapshot)"
  section in the audit doc with headline table, nsys verdict, and
  crossover analysis.

The data is a pre-A+C snapshot; PR flashinfer-ai#3286 (A+C deterministic
autotune profile) is open as of this rerun. A post-merge
full-matrix re-measurement is scheduled to produce the canonical
post-audit reference numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 21, 2026
…shinfer-ai#3340 fix

Updates the post-close audit doc and reproduction runbook to capture the
new state of in-flight follow-up PRs after a latent CuteDslMoEWrapper
lifecycle bug was uncovered on 2026-05-15.

Audit doc (cute_dsl_moe_port_audit.md):
  - Downstream PRs status updated 2026-05-12 -> 2026-05-16
  - PR flashinfer-ai#3286 entry: noted cold-L2 unset workaround (640e32e) and the
    surfaced wrapper-runner reference cycle
  - PR flashinfer-ai#3340 entry added (weakref trampoline lifecycle fix)
  - PR flashinfer-ai#3328 entry added (SGLang Phase 1 cudaMemsetAsync wrapper)
  - Other residual follow-ups: SGLang Phase 1 promoted from deferred to
    in-flight; cold-L2 re-enable (task flashinfer-ai#108) added as new deferred item
  - New dated section "2026-05-15 wrapper-runner reference cycle
    discovered + PR flashinfer-ai#3340 fix" — full diagnostic chain, Codex's 4-way
    convergent experiments, weakref trampoline code, connection to
    task flashinfer-ai#103, and the bisectability process lesson from audit task flashinfer-ai#91

Runbook (cute_dsl_moe_port_runbook.md):
  - "Last updated" 2026-05-11 -> 2026-05-16
  - Added "In-flight follow-up PRs" section noting flashinfer-ai#3286 / flashinfer-ai#3340 / flashinfer-ai#3328
    do not change the bench procedure

The audit's task flashinfer-ai#91 framing of use_cold_l2_cache=True as "provably
no-op for DeepSeek V3" remains correct for the buffer-cycle bytes-math
but is empirically falsified for the autotune profile path — cold-L2
cycling during profile interacts with the retained CUDA state from a
prior CUDA-graph wrapper test (via the wrapper-runner reference cycle)
and produces NaN. The cycle is the load-bearing defect; cold-L2 was
the trigger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 21, 2026
…e43023

PR flashinfer-ai#3286 (deterministic balanced autotune profile + cold-L2 unset
workaround, all 6 commits squashed) merged to flashinfer-ai/main on
2026-05-17T04:17:35Z by nv-yunzheq.

Audit doc:
  - Downstream PRs block: 2026-05-16 -> 2026-05-17; flashinfer-ai#3286 OPEN -> MERGED
    with squash-commit `ce430238`.
  - Cold-L2 re-enable item: clarified that flashinfer-ai#3286 has merged; the deferral
    now waits only on PR flashinfer-ai#3340.
  - 2026-05-12 PR flashinfer-ai#2398 rerun section: added "2026-05-17 update" line
    noting flashinfer-ai#3286 merge unblocks task flashinfer-ai#102 (post-A+C rerun).
  - 2026-05-15 wrapper-cycle section: workaround paragraph past-tensed
    and amended with the merge SHA.

Runbook:
  - "Last updated" 2026-05-16 -> 2026-05-17.
  - In-flight follow-up PRs section split: flashinfer-ai#3286 moved to "Recently
    merged"; flashinfer-ai#3340 and flashinfer-ai#3328 retained as "Still in-flight"; closing note
    refocused on task flashinfer-ai#108 contingent on flashinfer-ai#3340.

Bench procedure (Steps 1-11) unchanged — flashinfer-ai#3286 affects autotune profile
inputs, not the bench reproduction recipe. Once the audit branch rebases
onto main >= ce43023, post-A+C bench numbers (task flashinfer-ai#102) are runnable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 21, 2026
…4fe1ff

PR flashinfer-ai#3328 (SGLang Phase 1 dense moe_output_memset_inplace wrapper around
cudaMemsetAsync) merged to flashinfer-ai/main on 2026-05-18T21:17:39Z
by nv-yunzheq.

Audit doc:
  - Header: "last updated 2026-05-17" -> "2026-05-18"; 31-day -> 32-day
    investigation window.
  - Downstream PRs block: 2026-05-17 -> 2026-05-18; flashinfer-ai#3328 OPEN -> MERGED
    with squash-commit `34fe1ff0`; framing shifted from "Promotes
    follow-up flashinfer-ai#11 to in-flight" to "Closes follow-up flashinfer-ai#11".
  - Other residual follow-ups: flashinfer-ai#3328 moved from "in flight" to "CLOSED".
  - Follow-up flashinfer-ai#11 Phase 1 body: status "in flight ... awaiting review"
    -> "MERGED ... by nv-yunzheq".

Runbook:
  - "Last updated" 2026-05-17 -> 2026-05-18.
  - Recently merged follow-up PRs: split into two bullets (flashinfer-ai#3286 + flashinfer-ai#3328).
  - Still in-flight follow-up PR(s): tightened to singular (only flashinfer-ai#3340
    remains).

No bench-procedure changes (Phase-1 wrapper change is server-side, not
bench-tool-side). Audit branch will pick up flashinfer-ai#3328's content on its next
rebase onto main >= 34fe1ff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 21, 2026
A force-tactic ablation on 2026-05-20 (5 cells × 5 forced runs, all
parity_ok=True) demonstrates that forcing fi to use trt's autotune-
selected tactic closes 77-100% of the gap at the EP=16/32 small/mid-N
regression cells:

  EP=1 N=4096  : +8.3% → +0.0%   (fully closed)
  EP=16 N=8    : +18.5% → -4.0%  (fi wins after forcing)
  EP=16 N=1024 : +17.2% → +4.0%  (77% closed)
  EP=16/32 large-N: fi already wins; forcing changes nothing

The bias is dominantly tactic-pick-driven. This invalidates the
2026-05-19 hypothesis (autotune-candidate-set divergence due to
missing `ceil_div` filter in fi's `get_gemm2_valid_tactics`):

  1. At the cells where the bias appears, `permuted_m` is always ≥
     2048 due to `get_max_num_permuted_tokens`'s formula. The
     `ceil_div(m, mma_tiler_mn[0]) ≥ cluster_shape_mn[0]` predicate
     is always true; the filter never rejects in practice.

  2. fi DOES already call `can_implement` on both kernel templates
     at autotune time (tuner.py:435-467), and the kernel-level
     `can_implement` bodies are byte-identical between fi and trt
     (subagent diff verified).

  3. So fi and trt accept the same candidate set at these cells.
     The candidate-set framing doesn't explain why fi picks
     different tactics.

The ACTUAL divergence is in the autotune TuningConfig:

  - fi  (tuner.py:295-364):  `use_cold_l2_cache` intentionally unset
                              (workaround for the latent wrapper-
                              cycle bug per PR flashinfer-ai#3286's `640e32e7`)
  - trt (cute_dsl_custom_ops.py:2032-2054):  `use_cold_l2_cache=True`

With cold-L2 ON, autotune profile measures conservative timings and
picks tactics that are robustly fast under cold-cache conditions.
With cold-L2 OFF, autotune profile measures warm-L2 timings; some
tactics look fast due to L2-hit reuse during back-to-back profile
iterations but aren't actually faster in production. fi currently
picks tactics that "look fast" under warm L2 but aren't robustly
fast — which matches the ablation finding that trt's tactic is
faster on fi than fi's autotune-chosen tactic.

Two file edits:

  1. Header date stamp: 2026-05-19 → 2026-05-20; revised reading
     order; framing pivot from candidate-set to cold-L2.

  2. New top-of-doc "2026-05-20 ablation + cold-L2 mechanism"
     section labeled "read this FIRST". Includes the ablation
     table, the candidate-set refutation, the cold-L2 mechanism,
     and the task status changes (flashinfer-ai#125 deleted as refuted; flashinfer-ai#108
     promoted to load-bearing perf fix; new task flashinfer-ai#126 to stage the
     cold-L2 re-enable branch ready for flashinfer-ai#3340 landing).

  3. The 2026-05-19 tactic-divergence section reordered to THIRD,
     marked HISTORICAL with a banner noting the mechanism conclusion
     has been revised. Observations (fi shows 9-10 distinct
     signatures, trt 1-3) still valid as empirical fact; only the
     proposed mechanism (missing ceil_div) was wrong.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 21, 2026
…8f4534

PR flashinfer-ai#3340 (cute_dsl: avoid MoE wrapper runner reference cycle) was
merged 2026-05-21T16:56:58Z as squash-commit `18f45345` by
nv-yunzheq. With flashinfer-ai#3286 (2026-05-17) and flashinfer-ai#3340 (2026-05-21) both
landed, task flashinfer-ai#108 (re-enable `use_cold_l2_cache=True`) is unblocked;
branch `cute-dsl-moe-tuner-reenable-cold-l2` (HEAD `97c89c2f` on
top of upstream/main `18f45345`) is staged locally with a single-
commit diff (comment swap + flag flip).

Audit doc updates:
  - Header date stamp 2026-05-20 -> 2026-05-21; 34-day -> 35-day
    investigation window; framing line notes PR flashinfer-ai#3340 merged and
    task flashinfer-ai#108 unblocked. Underlying TuningConfig-divergence finding
    (2026-05-20) and reading order otherwise unchanged.
  - "TuningConfig divergences" table row for `use_cold_l2_cache`:
    "gated on PR flashinfer-ai#3340" -> "PR flashinfer-ai#3340 merged 2026-05-21 as
    `18f45345`, task flashinfer-ai#108 now unblocked".
  - "Task status changes 2026-05-20" block: task flashinfer-ai#108 marked
    unblocked, task flashinfer-ai#126 marked completed (branch staged locally),
    new task flashinfer-ai#129 added for the post-enable cold-L2 re-sweep that
    becomes the canonical post-audit reference.
  - Status-caveat paragraph: "Re-enabling cold L2 cache (task
    flashinfer-ai#108) is blocked on PR flashinfer-ai#3340" -> "PR flashinfer-ai#3340 merged 2026-05-21
    unblocked task flashinfer-ai#108; ... a cold-L2 re-sweep (task flashinfer-ai#129)
    becomes the canonical reference once it lands".
  - Other candidate mechanisms list, cold-L2 bullet: gate language
    updated.
  - Downstream PRs block: PR flashinfer-ai#3286 framing tightened ("PR flashinfer-ai#3340
    (merged 2026-05-21) shipped the underlying cycle fix"); PR
    flashinfer-ai#3340 entry rewritten from "OPEN 2026-05-16, mergeable" to
    "MERGED 2026-05-21T16:56:58Z as squash-commit `18f45345`"
    with the three pre-squash commits documented (`e471ae73` fix,
    `7fda2948` test, `fb440413` qiching review responses).
  - Residual follow-ups block: "Re-enable use_cold_l2_cache ...
    deferred until PR flashinfer-ai#3340" -> "unblocked 2026-05-21; branch
    staged locally; task flashinfer-ai#129 covers the canonical re-sweep".
  - 2026-05-15 dated section: PR flashinfer-ai#3340 line "single commit
    `ebb192a4`, 2026-05-16" -> "opened 2026-05-16, MERGED 2026-05-21
    as squash-commit `18f45345`".

Runbook updates:
  - Header date stamp 2026-05-19 -> 2026-05-21; framing line notes
    PR flashinfer-ai#3340 merged and task flashinfer-ai#108 unblocked, task flashinfer-ai#129 added.
  - "Still in-flight follow-up PR" block: PR flashinfer-ai#3340 moved from
    in-flight to recently-merged (third entry in that list);
    "in-flight" wording removed; final paragraph rewritten to
    note all three follow-ups landed, task flashinfer-ai#108 unblocked, task
    flashinfer-ai#129 covers the canonical post-audit re-sweep.

Memories (separate, not in this commit since they live on Mac):
  reference_tactic_divergence_ep_scaling.md,
  reference_kernel_coverage_1to1.md,
  project_cutedsl_wrapper_cycle_bug.md,
  project_cutedsl_moe_fp4_port_audit_closed.md,
  MEMORY.md (index)
have all received the same gate-lift / merged-status updates;
project_cutedsl_wrapper_cycle_bug.md's frontmatter name updated
to include the merge tag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants