Skip to content

test: fix InterpreterBenchmark so it produces trustworthy numbers#2985

Merged
He-Pin merged 2 commits into
apache:mainfrom
He-Pin:perf-graph-interpreter-jit-friendly
May 21, 2026
Merged

test: fix InterpreterBenchmark so it produces trustworthy numbers#2985
He-Pin merged 2 commits into
apache:mainfrom
He-Pin:perf-graph-interpreter-jit-friendly

Conversation

@He-Pin
Copy link
Copy Markdown
Member

@He-Pin He-Pin commented May 21, 2026

Motivation

InterpreterBenchmark had two independent bugs that made its results unreliable, which becomes a problem the moment anyone wants to evaluate a GraphInterpreter-touching change against it.

  1. The benchmark body was new GraphInterpreterSpecKit { new TestSetup { ... } }. Because that ran inside @Benchmark, every invocation built (and never tore down) a fresh ActorSystem. Long iterations exhausted native threads and JMH ended up reporting empty results once the JVM ran out of resources.
  2. Inside the assembly we used GraphStages.identity[Int] once per slot, but GraphStages.identity is a singleton whose Inlet/Outlet shape is shared across every reference. Chaining N copies (numberOfIds = 5/10) collapses to a single shape and mis-wires the connections; the run logged a flood of Cannot pull port twice errors and ended up reporting nonsense throughput (5/10-stage configs faster than the 1-stage one).

Modification

  • Make InterpreterBenchmark itself extend GraphInterpreterSpecKit so JMH's @State(Scope.Benchmark) lifecycle reuses one ActorSystem across invocations, and add @TearDown(Level.Trial) to terminate it cleanly.
  • Define a local IdentityStage[T] with its own Inlet/Outlet per instance and use Vector.fill(numberOfIds)(new IdentityStage[Int]) so each slot in the chain is a distinct stage with a distinct shape.

No changes to production code — this PR only fixes the benchmark.

Result

The benchmark now runs to completion without leaking actor systems and produces stable, monotonic numbers (throughput decreases as numberOfIds grows, as expected). This restores it as a usable baseline for subsequent GraphInterpreter work.

JMH on this branch (JDK 25, G1, single thread, -i 5 -wi 3 -f 1 -t 1):

Benchmark                                             (numberOfIds)   Mode  Cnt      Score      Error   Units
InterpreterBenchmark.graph_interpreter_100k_elements              1  thrpt    5  45238.227 ± 3143.242  ops/ms
InterpreterBenchmark.graph_interpreter_100k_elements              5  thrpt    5  10526.376 ±  151.239  ops/ms
InterpreterBenchmark.graph_interpreter_100k_elements             10  thrpt    5   5350.558 ±  192.965  ops/ms

Pre-fix the 5/10-stage rows were both higher than the 1-stage row (i.e. wrong direction) because the singleton-shape bug meant the chain wasn't actually N stages long.

This is a benchmark-correctness fix, not a performance improvement. There is no production-code change here.

Tests

  • sbt 'bench-jmh/compile'
  • sbt 'bench-jmh/headerCheck; bench-jmh/scalafmtCheck'
  • sbt 'bench-jmh/Jmh/run -i 5 -wi 3 -f 1 -t 1 -rf json -rff /tmp/jmh.json .*InterpreterBenchmark.*' — completes cleanly, scores above.

References

None - benchmark-only fix surfaced while preparing to evaluate GraphInterpreter micro-optimizations against a trustworthy baseline.

He-Pin added 2 commits May 21, 2026 12:51
Motivation:
The previous shape `new GraphInterpreterSpecKit { new TestSetup { ... } }` ran
inside @benchmark, so each invocation built (and never tore down) a fresh
ActorSystem. Long iterations exhausted native threads and JMH reported empty
results once the JVM ran out of resources.

Modification:
Make the benchmark class itself extend GraphInterpreterSpecKit so JMH's
@State(Scope.Benchmark) lifecycle reuses one ActorSystem across all
invocations. Add @teardown(Level.Trial) to terminate it cleanly.

Result:
The benchmark now runs to completion and produces stable numbers, which is a
prerequisite for measuring follow-up GraphInterpreter optimizations.

Tests:
sbt 'bench-jmh/compile'
Motivation:
GraphStages.identity is a singleton whose Inlet/Outlet shape is shared across
every reference. Chaining N copies into the assembly (numberOfIds = 5/10)
collapses to a single shape and mis-wires the connections, which surfaced as a
runtime "Cannot pull port twice" error spam during the benchmark and produced
nonsense throughput numbers (5/10 stages reported as faster than 1).

Modification:
Define a local IdentityStage class with its own Inlet/Outlet per instance and
use Vector.fill(numberOfIds)(new IdentityStage[Int]).

Result:
The benchmark wires N distinct stages and produces stable, monotonic numbers
(throughput decreases as numberOfIds grows, as expected).

Tests:
sbt 'bench-jmh/compile'
Copy link
Copy Markdown
Member

@pjfanning pjfanning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@He-Pin He-Pin merged commit 956d478 into apache:main May 21, 2026
9 checks passed
@He-Pin He-Pin deleted the perf-graph-interpreter-jit-friendly branch May 21, 2026 09:14
He-Pin added a commit that referenced this pull request May 21, 2026
…ter chase hot path (#2986)

* test: stop leaking ActorSystem in InterpreterBenchmark per invocation

Motivation:
The previous shape `new GraphInterpreterSpecKit { new TestSetup { ... } }` ran
inside @benchmark, so each invocation built (and never tore down) a fresh
ActorSystem. Long iterations exhausted native threads and JMH reported empty
results once the JVM ran out of resources.

Modification:
Make the benchmark class itself extend GraphInterpreterSpecKit so JMH's
@State(Scope.Benchmark) lifecycle reuses one ActorSystem across all
invocations. Add @teardown(Level.Trial) to terminate it cleanly.

Result:
The benchmark now runs to completion and produces stable numbers, which is a
prerequisite for measuring follow-up GraphInterpreter optimizations.

Tests:
sbt 'bench-jmh/compile'

* test: use per-instance IdentityStage in InterpreterBenchmark

Motivation:
GraphStages.identity is a singleton whose Inlet/Outlet shape is shared across
every reference. Chaining N copies into the assembly (numberOfIds = 5/10)
collapses to a single shape and mis-wires the connections, which surfaced as a
runtime "Cannot pull port twice" error spam during the benchmark and produced
nonsense throughput numbers (5/10 stages reported as faster than 1).

Modification:
Define a local IdentityStage class with its own Inlet/Outlet per instance and
use Vector.fill(numberOfIds)(new IdentityStage[Int]).

Result:
The benchmark wires N distinct stages and produces stable, monotonic numbers
(throughput decreases as numberOfIds grows, as expected).

Tests:
sbt 'bench-jmh/compile'

* optimize: skip afterStageHasRun no-op finalize check in chase hot path

Motivation:
GraphInterpreter's chase loops dominate hot-path CPU in steady state — JMH
stack profiling on InterpreterBenchmark attributes ~50% of stream-related
samples to the two while loops at execute:449 / execute:460. Every chase
iteration calls afterStageHasRun(activeStage), which in steady state always
reads shutdownCounter(activeStage.stageId) and the per-stage finalized flag
only to discover the stage has not just completed and skip the body. That is
a per-event array load + null check + branch on the hottest path with no work
to do, which the JIT cannot fold away because the array is mutable shared
state.

Modification:
Track pendingFinalization: Boolean on the interpreter, set when a stage's
shutdownCounter decrements to 0 in completeConnection or transitions to 0
when KeepGoing is cleared in setKeepGoing. Gate the three hot-path
afterStageHasRun calls in execute() (post normal-dispatch and the two chase
loops) on the flag, resetting it before the call so cascaded completions
during finalization re-arm the flag correctly. The slow-frequency callers
(init, runAsyncInput) are left untouched.

Result:
JMH on InterpreterBenchmark (JDK 25, G1, single thread, -i 5 -wi 3 -f 1 -t 1):

  numberOfIds   baseline (ops/ms)    with patch (ops/ms)    delta
  1             45238 ± 3143         50952 ± 4784           +12.6%
  5             10526 ±  151         11242 ±  288           +6.8%   (CIs disjoint)
  10             5350 ±  193          5927 ±  173           +10.8%  (CIs disjoint)

Allocation rate stays at ~0.6 B/op — no GC impact. All stream-tests pass.

Tests:
- sbt 'stream/compile'
- sbt 'stream/mimaReportBinaryIssues' - clean
- sbt 'stream-tests/testOnly *fusing*' - 159 tests, all passed
- sbt 'stream-tests/testOnly *Flow*Spec' - 1208 tests, all passed
- sbt 'bench-jmh/Jmh/run -i 5 -wi 3 -f 1 -t 1 .*InterpreterBenchmark.*' - numbers above

References:
Refs #2985 - benchmark fix used to obtain trustworthy JMH numbers above.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants