engine: C-LEARN interpreter perf -- run -24% via array-view + opcode fusion#599
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 975b9b5510
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| tally(&fused, &mut p.fused_histogram); | ||
| let mut fused_stocks = module.compiled_stocks.as_ref().clone(); | ||
| fused_stocks.fuse_three_address(); | ||
| tally(&fused_stocks, &mut p.fused_histogram); |
There was a problem hiding this comment.
Exclude fused clones from
total_literals accounting
bytecode_profile() now calls tally() on fused and fused_stocks, but tally() increments p.total_literals as a side effect. Because those fused bytecodes are temporary clones of the same module bytecode, every module’s flow/stock literal table is counted twice, so total_literals is inflated and no longer reflects the compiled artifact size. This skews profiling output and can mislead perf/size comparisons that rely on literal counts.
Useful? React with 👍 / 👎.
Automated reviewReviewed the bytecode-fusion + array-view perf changes for correctness (not just the numeric guards in the PR). No blocking issues found. The correctness-critical surfaces I traced all hold:
Overall correctness verdict: correct. The patch preserves observable VM semantics; existing code/tests should not break. 🤖 Generated with Claude Code |
The fused-stream histogram added in the global/const operand-fusion commit tallied the fused flow/stock clones through the same closure that increments total_literals. Those clones share the real bytecode's literal table (fusion rewrites opcodes, never the literal pool), so every module's flow+stock literals were counted twice and total_literals was inflated (profiling-only; no VM behavior change). Split out a histogram-only tally for the fused streams and add a regression test asserting total_literals counts each phase once. Addresses the P2 review comment on #599.
flat_offset is a per-element hot spot (~19% of the C-LEARN run), hit at every vector-op/reducer dispatch site. It rebuilt a per-dimension sparse_lookup SmallVec and did per-index Option branching on every call, even for dense views with no sparse mappings (the common case). Skip that work when self.sparse is empty; numerically identical there (empty sparse implies actual_idx == idx). Measured C-LEARN run_to_end ~155 -> ~125 ms/iter (-20%); engine and simulate suites unchanged.
R2 fused subexpression loads (Load;Load;Op2 -> BinVarVar) but left
leaf assignments dst = a op b as Load a; Load b; BinOpAssign (3
dispatches + stack traffic). Add operator-specialized 3-operand
opcodes Assign{Add,Sub,Mul,Div}{VarVar,VarConst,ConstVar}{Curr,Next}
plus 2-operand stack-leaf forms, emitted by an extended
fuse_three_address pass at Vm::new. Operator-in-the-tag keeps the
payload at 3xu16 = 6 bytes, so size_of::<Opcode>() stays 8 (no
encoding growth) and dispatch is straight-line. These ops never
enter the symbolic layer (exhaustive unreachable! arm). Numerically
exact. C-LEARN: flow opcodes 26539 -> 25215 (973 sites), run_to_end
~124 -> ~121 ms (-2.2%), retired instructions -1.4%, branch-misses
-0.95%.
vm.rs was 32 lines under the 6000-line per-file cap, so the added
dispatch arms and tests pushed it over. Relocate the unrelated,
all-public-API set_value tests to a #[path]-included child module
vm_set_value_tests.rs (still crate::vm's child, so super::* reaches
the parent internals with no visibility changes) to restore headroom.
Pure relocation, no behaviour change.
PushStaticView clones the static view's dims/strides/dim_ids into a RuntimeView on every execution (~1M times/run on C-LEARN). smallvec 1.15 (built without the specialization feature) lowers SmallVec::clone to an element-wise Extend, so use SmallVec::from_slice (a single memcpy) for the Copy-element buffers and an empty allocation for the near-always-empty sparse list. Numerically identical; removes ~4.2% of retired run instructions. Wall-clock is unchanged here -- the C-LEARN run is dispatch-mispredict/latency-bound, not instruction-bound (IPC ~3.3), so the core already absorbed these tiny L1-resident copies -- but it is a strict efficiency improvement worth keeping.
R2 fused Load;Load;Op2 for module-relative var operands but left LoadGlobalVar operands and two-literal operands as separate loads. Extend the Vm::new-time fuse_three_address pass with pushing binop variants that read a global operand directly from curr[g] (absolute, no module_off): BinGlobalVar/BinVarGlobal/BinGlobalConst/BinConstGlobal/ BinGlobalGlobal/BinStackGlobal, plus BinConstConst for two-literal operands. Numerically exact (still computes l OP r at run time); the ops live only in the Vm execution copy (exhaustive unreachable! arm in symbolize_opcode), so CompiledSimulation stays symbolizable. C-LEARN: 1858 dispatches fused, branches -2.9%, branch-misses -3.8% (the dispatch loop is branch-bound); run ~120 -> ~118 ms (noise floor). size_of::<Opcode>() unchanged (8 bytes).
The fused-stream histogram added in the global/const operand-fusion commit tallied the fused flow/stock clones through the same closure that increments total_literals. Those clones share the real bytecode's literal table (fusion rewrites opcodes, never the literal pool), so every module's flow+stock literals were counted twice and total_literals was inflated (profiling-only; no VM behavior change). Split out a histogram-only tally for the fused streams and add a regression test asserting total_literals counts each phase once. Addresses the P2 review comment on #599.
Review — engine: C-LEARN interpreter perf (#599)I reviewed all six files with a focus on numerical faithfulness of the fusion passes: the The implementation holds up well. Each new dispatch arm matches its opcode definition; operand order is preserved for the non-commutative [P3] Stale comment contradicts the leaf-assign opcodes added in this same PR
Overall correctness: correctI found no correctness issues. The patch is numerically faithful, the new opcode set is handled exhaustively in every match, and the change is well covered by the added discriminating tests (operand order, the submodule global-vs- |
0c143bf to
047e192
Compare
Review: engine C-LEARN interpreter perf (opcode fusion + array-view fast paths)I did a line-by-line correctness audit of the new fused opcodes, the fusion pass, and the array-view fast paths. No blocking issues found — the patch looks correct. Things I verified:
Overall correctness verdict: correct. No bugs or test-breaking changes identified. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #599 +/- ##
========================================
Coverage 82.87% 82.87%
========================================
Files 260 261 +1
Lines 69576 69828 +252
========================================
+ Hits 57659 57872 +213
- Misses 11917 11956 +39 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
A hill-climbing pass on the bytecode VM interpreter, benchmarked on the C-LEARN model (our largest: ~53k MDL lines, 5726 root slots, 1000 Euler steps), LTM off, native release (
opt-level=3+ LTO). Each step was profiled, A/B-measured, and validated numerically.Cumulative:
run_to_end~155 -> ~118 ms/iter (about -24%). Every commit is numerically faithful (byte-exact vs Vensim'sRef.vdfvia theclearn_residual_exactnessguard, plus the full engine +simulatesuites green).Commits
flat_offsetdense fast-path (skip the per-elementsparse_lookupSmallVec when a view has no sparse mappings)flat_offset18.9% -> 8.6% of rundst = a op b(operator-specializedAssign{Add,Sub,Mul,Div}{VarVar,VarConst,ConstVar}{Curr,Next}+ stack-leaf forms)Opcodeat 8 bytes; branch-misses -0.95%to_runtime_viewmemcpy (SmallVec::from_slicevs element-wise clone)LoadGlobalVarand two-literal operands into pushing binops (BinGlobal*/BinConstConst)Both fusion passes run at
Vm::newon the execution bytecode, soCompiledSimulationstays a pure, salsa-cached, symbolizable artifact (the fused opcodes have an exhaustiveunreachable!arm insymbolize_opcode).size_of::<Opcode>()stays 8 bytes throughout.Key finding: the run is branch-mispredict-bound, not instruction-bound
A fresh
perfprofile attributes 68% of branch-misses to theeval_bytecodedispatch indirect branch; IPC is ~3.3. So wall-clock wins come from cutting dispatches (fewer indirect-branch mispredicts), not instruction count -- which is exactly why theto_runtime_viewmemcpy (-4.2% instructions) was wall-clock-neutral while the fusion passes (which cut dispatches) moved the needle and the branch-miss rate.Ruled out (data-driven NO-GOs)
run_to_endis allocation-free (the malloc/free seen in profiles is the one-time compile + the harness's per-iterVm::new/clone).opt-level=3(and would needunsafe).Open follow-ups (not in this PR)
become): the one remaining step-change for the branch-bound dispatch loop, but it requires nightly Rust (a toolchain decision).lookup/graphical-function hot path (~3-4% of run);flat_offsetstride-stepping for theBeginIternon-contiguous precompute.Validation
Every commit passed the full pre-commit hook. C-LEARN matches
Ref.vdfbyte-for-byte (clearn_residual_exactness) on all four commits. Operand order for the non-commutativeSub/Divand the global-vs-module_offslot distinction are covered by dedicated, verified-discriminating tests (injecting the bug fails the test and the exactness guard).