optimize: skip afterStageHasRun no-op finalize check in GraphInterpreter chase hot path by He-Pin · Pull Request #2986 · apache/pekko

He-Pin · 2026-05-21T09:58:29Z

Motivation

GraphInterpreter's chase loops dominate hot-path CPU in steady state. JMH stack profiling on InterpreterBenchmark (numberOfIds=10) attributes ~50% of stream-related samples to the two while loops at execute:449 and execute:460, after deep JIT inlining of processPush / onPush / push / grab / chasePush.

Every chase iteration ends with afterStageHasRun(activeStage), which in steady state always reads shutdownCounter(activeStage.stageId) and the per-stage finalized flag, only to discover the stage has not just completed and skip the body. That is a per-event array load + null check + branch on the hottest path with no work to do, which the JIT cannot fold away because shutdownCounter is a mutable shared array.

Modification

Add pendingFinalization: Boolean on the interpreter, set when a stage's shutdownCounter decrements to 0 in completeConnection, or transitions to 0 when KeepGoing is cleared in setKeepGoing.
Gate the three hot-path afterStageHasRun(activeStage) call sites in execute() (post normal-dispatch and the two chase loops) on the flag, resetting it to false before invoking afterStageHasRun so cascaded completions during finalization re-arm the flag correctly.
The lower-frequency afterStageHasRun callers in init() and runAsyncInput are intentionally left untouched — they run once per stage / per async event and are not on the hot path.

The semantic invariant is preserved: any path that decrements shutdownCounter to 0 sets the flag, so any state where isStageCompleted(activeStage) could newly return true is guaranteed to be observed by the next gated call.

Result

JMH on InterpreterBenchmark (JDK 25, G1, single thread, -i 5 -wi 3 -f 1 -t 1):

numberOfIds   baseline (ops/ms)    with patch (ops/ms)    delta
1             45238 ± 3143         50952 ± 4784           +12.6%
5             10526 ±  151         11242 ±  288            +6.8%   (CIs disjoint)
10             5350 ±  193          5927 ±  173           +10.8%   (CIs disjoint)

numberOfIds=5 and =10 show non-overlapping 99.9% confidence intervals vs the same-tree baseline, so the gain is not noise. Allocation rate stays at ~0.6 B/op (0 GC counts in the measurement window) — no GC impact.

Tests

sbt 'stream/compile'
sbt 'stream/mimaReportBinaryIssues' — clean
sbt 'stream-tests/testOnly *fusing*' — 159 tests, all passed (covers GraphInterpreterSpec, GraphInterpreterPortsSpec, completion / cancel / fail paths)
sbt 'stream-tests/testOnly *Flow*Spec' — 1208 tests, all passed
sbt 'bench-jmh/Jmh/run -i 5 -wi 3 -f 1 -t 1 .*InterpreterBenchmark.*' — numbers above

References

Refs #2985 — relies on the InterpreterBenchmark correctness fix in that PR to obtain trustworthy JMH numbers. This branch contains the two commits from #2985 plus the optimization commit; if #2985 lands first, this branch will be rebased to drop those.

@benchmark

Motivation: The previous shape `new GraphInterpreterSpecKit { new TestSetup { ... } }` ran inside @benchmark, so each invocation built (and never tore down) a fresh ActorSystem. Long iterations exhausted native threads and JMH reported empty results once the JVM ran out of resources. Modification: Make the benchmark class itself extend GraphInterpreterSpecKit so JMH's @State(Scope.Benchmark) lifecycle reuses one ActorSystem across all invocations. Add @teardown(Level.Trial) to terminate it cleanly. Result: The benchmark now runs to completion and produces stable numbers, which is a prerequisite for measuring follow-up GraphInterpreter optimizations. Tests: sbt 'bench-jmh/compile'

Motivation: GraphStages.identity is a singleton whose Inlet/Outlet shape is shared across every reference. Chaining N copies into the assembly (numberOfIds = 5/10) collapses to a single shape and mis-wires the connections, which surfaced as a runtime "Cannot pull port twice" error spam during the benchmark and produced nonsense throughput numbers (5/10 stages reported as faster than 1). Modification: Define a local IdentityStage class with its own Inlet/Outlet per instance and use Vector.fill(numberOfIds)(new IdentityStage[Int]). Result: The benchmark wires N distinct stages and produces stable, monotonic numbers (throughput decreases as numberOfIds grows, as expected). Tests: sbt 'bench-jmh/compile'

Motivation: GraphInterpreter's chase loops dominate hot-path CPU in steady state — JMH stack profiling on InterpreterBenchmark attributes ~50% of stream-related samples to the two while loops at execute:449 / execute:460. Every chase iteration calls afterStageHasRun(activeStage), which in steady state always reads shutdownCounter(activeStage.stageId) and the per-stage finalized flag only to discover the stage has not just completed and skip the body. That is a per-event array load + null check + branch on the hottest path with no work to do, which the JIT cannot fold away because the array is mutable shared state. Modification: Track pendingFinalization: Boolean on the interpreter, set when a stage's shutdownCounter decrements to 0 in completeConnection or transitions to 0 when KeepGoing is cleared in setKeepGoing. Gate the three hot-path afterStageHasRun calls in execute() (post normal-dispatch and the two chase loops) on the flag, resetting it before the call so cascaded completions during finalization re-arm the flag correctly. The slow-frequency callers (init, runAsyncInput) are left untouched. Result: JMH on InterpreterBenchmark (JDK 25, G1, single thread, -i 5 -wi 3 -f 1 -t 1): numberOfIds baseline (ops/ms) with patch (ops/ms) delta 1 45238 ± 3143 50952 ± 4784 +12.6% 5 10526 ± 151 11242 ± 288 +6.8% (CIs disjoint) 10 5350 ± 193 5927 ± 173 +10.8% (CIs disjoint) Allocation rate stays at ~0.6 B/op — no GC impact. All stream-tests pass. Tests: - sbt 'stream/compile' - sbt 'stream/mimaReportBinaryIssues' - clean - sbt 'stream-tests/testOnly *fusing*' - 159 tests, all passed - sbt 'stream-tests/testOnly *Flow*Spec' - 1208 tests, all passed - sbt 'bench-jmh/Jmh/run -i 5 -wi 3 -f 1 -t 1 .*InterpreterBenchmark.*' - numbers above References: Refs apache#2985 - benchmark fix used to obtain trustworthy JMH numbers above.

Copilot

Pull request overview

This PR reduces GraphInterpreter.execute() hot-path overhead by avoiding an unconditional afterStageHasRun(activeStage) check on every dispatch/chase iteration, while also fixing InterpreterBenchmark to provide reliable JMH measurements for interpreter changes.

Changes:

Add a pendingFinalization flag that is set when a stage’s shutdownCounter transitions to 0, and use it to gate afterStageHasRun in the main dispatch path and both chase loops.
Update shutdown bookkeeping (completeConnection, setKeepGoing(false)) to arm pendingFinalization when a stage becomes eligible for finalization.
Fix InterpreterBenchmark lifecycle (reuse a single ActorSystem per trial + proper teardown) and replace GraphStages.identity usage with a per-instance identity stage to avoid shape reuse/miswiring.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
stream/src/main/scala/org/apache/pekko/stream/impl/fusing/GraphInterpreter.scala	Introduces `pendingFinalization` and gates `afterStageHasRun` in execute/chase hot paths; arms the flag when stages newly complete.
bench-jmh/src/main/scala/org/apache/pekko/stream/InterpreterBenchmark.scala	Makes the benchmark state reuse one `ActorSystem`, adds teardown, and uses a per-instance identity `GraphStage` to avoid singleton-shape wiring issues.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pjfanning

lgtm

He-Pin added 3 commits May 21, 2026 17:22

He-Pin requested review from Copilot and pjfanning May 21, 2026 11:15

Copilot started reviewing on behalf of He-Pin May 21, 2026 11:15 View session

He-Pin added this to the 2.0.0-M3 milestone May 21, 2026

Copilot AI reviewed May 21, 2026

View reviewed changes

pjfanning approved these changes May 21, 2026

View reviewed changes

He-Pin merged commit a505843 into apache:main May 21, 2026
9 checks passed

He-Pin deleted the perf-graph-interpreter-skip-finalize-check branch May 21, 2026 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize: skip afterStageHasRun no-op finalize check in GraphInterpreter chase hot path#2986

optimize: skip afterStageHasRun no-op finalize check in GraphInterpreter chase hot path#2986
He-Pin merged 3 commits into
apache:mainfrom
He-Pin:perf-graph-interpreter-skip-finalize-check

He-Pin commented May 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

pjfanning left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

He-Pin commented May 21, 2026

Motivation

Modification

Result

Tests

References

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

pjfanning left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants