Replace VM register arrays with contiguous stack (+28% bytecode throughput)#235
Conversation
Replace TGocciaRegisterArray (dynamic array) with a contiguous stack and typed pointer (PGocciaRegister) for the VM register file. Same for local cells. This eliminates FPC's automatic refcount management (FPC_DYNARRAY_ASSIGN, FPC_DYNARRAY_CLEAR, FPC_DYNARRAY_INCR_REF, FPC_DYNARR_SETLENGTH, FPC_FINALIZE) which profiling showed consumed 16.6% of production execution time. Design: - FRegisterStack: single growing TGocciaRegisterArray (initial 4096 slots) - FRegisterBase: integer index into the stack for the current frame - FRegisters: PGocciaRegister pointer (= @FRegisterStack[FRegisterBase]) - Save/restore saves integer indices, not array references — no refcounting - All 507 FRegisters[X] access sites work unchanged (pointer indexing) - Same pattern for FLocalCellStack / FLocalCells Measured impact (production build, fib(20) bytecode): - Before: 405 ns/call (median of 10 runs) - After: 291 ns/call (median of 10 runs) - Improvement: 28% Profiler confirmation (macOS sample tool): - FPC_DYNARRAY_CLEAR: 103 samples → 0 - FPC_DYNARRAY_ASSIGN: 93 samples → 0 - FPC_DYNARRAY_INCR_REF: 43 samples → 0 - FPC_DYNARR_SETLENGTH: 52 samples → 0 - FPC_FINALIZE: 57 samples → 0 - Total execution: 2100ms → 1605ms (-24%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe PR refactors register and local cell management in the VM to use preallocated growable stacks with base/count offset tracking and pointer windows instead of dynamic arrays, with pointer arithmetic support enabled via compiler directives. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Benchmark Results274 benchmarks Interpreted: 🟢 21 improved · 🔴 134 regressed · 119 unchanged · avg -1.7% arraybuffer.js — Interp: 🟢 8, 🔴 3, 3 unch. · avg +1.0% · Bytecode: 🟢 5, 🔴 9 · avg +0.1%
arrays.js — Interp: 🔴 9, 10 unch. · avg -1.3% · Bytecode: 🟢 9, 🔴 10 · avg +0.3%
async-await.js — Interp: 🟢 2, 4 unch. · avg +1.3% · Bytecode: 🟢 1, 🔴 4, 1 unch. · avg -3.6%
classes.js — Interp: 🔴 20, 11 unch. · avg -1.6% · Bytecode: 🟢 16, 15 unch. · avg +5.5%
closures.js — Interp: 🟢 1, 🔴 4, 6 unch. · avg -1.3% · Bytecode: 🟢 10, 🔴 1 · avg +11.8%
collections.js — Interp: 🔴 11, 1 unch. · avg -3.5% · Bytecode: 🟢 2, 🔴 9, 1 unch. · avg -8.7%
destructuring.js — Interp: 🔴 14, 8 unch. · avg -2.6% · Bytecode: 🟢 9, 🔴 2, 11 unch. · avg +2.2%
fibonacci.js — Interp: 🔴 2, 6 unch. · avg -1.2% · Bytecode: 🟢 6, 2 unch. · avg +10.2%
for-of.js — Interp: 🔴 6, 1 unch. · avg -3.9% · Bytecode: 🔴 5, 2 unch. · avg -2.1%
helpers/bench-module.js — Interp: 0 · Bytecode: 0
iterators.js — Interp: 🔴 14, 6 unch. · avg -3.9% · Bytecode: 🟢 9, 🔴 5, 6 unch. · avg +0.7%
json.js — Interp: 🔴 6, 14 unch. · avg -1.7% · Bytecode: 🔴 17, 3 unch. · avg -6.6%
jsx.jsx — Interp: 🟢 5, 🔴 2, 14 unch. · avg +0.4% · Bytecode: 🟢 15, 6 unch. · avg +3.4%
modules.js — Interp: 🟢 1, 🔴 2, 6 unch. · avg -0.5% · Bytecode: 🟢 7, 🔴 1, 1 unch. · avg +45.9%
numbers.js — Interp: 🟢 1, 🔴 7, 3 unch. · avg -2.4% · Bytecode: 🟢 2, 🔴 9 · avg -2.5%
objects.js — Interp: 🔴 4, 3 unch. · avg -1.6% · Bytecode: 🟢 3, 🔴 2, 2 unch. · avg +1.5%
promises.js — Interp: 🟢 1, 🔴 4, 7 unch. · avg -0.9% · Bytecode: 🟢 1, 🔴 6, 5 unch. · avg -1.4%
regexp.js — Interp: 🔴 6, 5 unch. · avg -2.7% · Bytecode: 🟢 1, 🔴 8, 2 unch. · avg -3.8%
strings.js — Interp: 🔴 3, 8 unch. · avg -1.8% · Bytecode: 🔴 10, 1 unch. · avg -6.6%
typed-arrays.js — Interp: 🟢 2, 🔴 17, 3 unch. · avg -2.5% · Bytecode: 🟢 12, 🔴 6, 4 unch. · avg +1.7%
Measured on ubuntu-latest x64. Benchmark ranges compare cached main-branch min/max ops/sec with the PR run; overlapping ranges are treated as unchanged noise. Percentage deltas are secondary context. |
Suite Timing
Measured on ubuntu-latest x64. |
Summary
TGocciaRegisterArray(dynamic array) allocation with a contiguous register stack andPGocciaRegistertyped pointerProblem
Profiling the production bytecode VM with
macOS samplerevealed that FPC's automatic dynamic array refcounting consumed 16.6% of execution time:FPC_DYNARRAY_CLEARFPC_DYNARRAY_ASSIGNFPC_DYNARRAY_INCR_REFFPC_DYNARR_SETLENGTHFPC_FINALIZEEvery function call saved/restored
FRegisters(a dynamic array), triggering 4+ refcount operations. For fib(20) with 21,891 calls, that's ~87,000 refcount operations.Solution
Replace the per-call dynamic array with a single growing stack:
Save/restore becomes integer index copies (zero refcounting):
All 507
FRegisters[X]access sites in the dispatch loop work unchanged — pointer indexing has the same syntax as array indexing in FPC.Measurements (production build, macOS AArch64)
Profiler confirmation — all dynamic array functions eliminated:
FPC_DYNARRAY_CLEARFPC_DYNARRAY_ASSIGNFPC_DYNARRAY_INCR_REFFPC_DYNARR_SETLENGTHFPC_FINALIZETest plan
🤖 Generated with Claude Code