Skip to content

optimize: skip afterStageHasRun no-op finalize check in GraphInterpreter chase hot path#2986

Merged
He-Pin merged 3 commits into
apache:mainfrom
He-Pin:perf-graph-interpreter-skip-finalize-check
May 21, 2026
Merged

optimize: skip afterStageHasRun no-op finalize check in GraphInterpreter chase hot path#2986
He-Pin merged 3 commits into
apache:mainfrom
He-Pin:perf-graph-interpreter-skip-finalize-check

Conversation

@He-Pin
Copy link
Copy Markdown
Member

@He-Pin He-Pin commented May 21, 2026

Motivation

GraphInterpreter's chase loops dominate hot-path CPU in steady state. JMH stack profiling on InterpreterBenchmark (numberOfIds=10) attributes ~50% of stream-related samples to the two while loops at execute:449 and execute:460, after deep JIT inlining of processPush / onPush / push / grab / chasePush.

Every chase iteration ends with afterStageHasRun(activeStage), which in steady state always reads shutdownCounter(activeStage.stageId) and the per-stage finalized flag, only to discover the stage has not just completed and skip the body. That is a per-event array load + null check + branch on the hottest path with no work to do, which the JIT cannot fold away because shutdownCounter is a mutable shared array.

Modification

  • Add pendingFinalization: Boolean on the interpreter, set when a stage's shutdownCounter decrements to 0 in completeConnection, or transitions to 0 when KeepGoing is cleared in setKeepGoing.
  • Gate the three hot-path afterStageHasRun(activeStage) call sites in execute() (post normal-dispatch and the two chase loops) on the flag, resetting it to false before invoking afterStageHasRun so cascaded completions during finalization re-arm the flag correctly.
  • The lower-frequency afterStageHasRun callers in init() and runAsyncInput are intentionally left untouched — they run once per stage / per async event and are not on the hot path.

The semantic invariant is preserved: any path that decrements shutdownCounter to 0 sets the flag, so any state where isStageCompleted(activeStage) could newly return true is guaranteed to be observed by the next gated call.

Result

JMH on InterpreterBenchmark (JDK 25, G1, single thread, -i 5 -wi 3 -f 1 -t 1):

numberOfIds   baseline (ops/ms)    with patch (ops/ms)    delta
1             45238 ± 3143         50952 ± 4784           +12.6%
5             10526 ±  151         11242 ±  288            +6.8%   (CIs disjoint)
10             5350 ±  193          5927 ±  173           +10.8%   (CIs disjoint)

numberOfIds=5 and =10 show non-overlapping 99.9% confidence intervals vs the same-tree baseline, so the gain is not noise. Allocation rate stays at ~0.6 B/op (0 GC counts in the measurement window) — no GC impact.

Tests

  • sbt 'stream/compile'
  • sbt 'stream/mimaReportBinaryIssues' — clean
  • sbt 'stream-tests/testOnly *fusing*' — 159 tests, all passed (covers GraphInterpreterSpec, GraphInterpreterPortsSpec, completion / cancel / fail paths)
  • sbt 'stream-tests/testOnly *Flow*Spec' — 1208 tests, all passed
  • sbt 'bench-jmh/Jmh/run -i 5 -wi 3 -f 1 -t 1 .*InterpreterBenchmark.*' — numbers above

References

Refs #2985 — relies on the InterpreterBenchmark correctness fix in that PR to obtain trustworthy JMH numbers. This branch contains the two commits from #2985 plus the optimization commit; if #2985 lands first, this branch will be rebased to drop those.

He-Pin added 3 commits May 21, 2026 17:22
Motivation:
The previous shape `new GraphInterpreterSpecKit { new TestSetup { ... } }` ran
inside @benchmark, so each invocation built (and never tore down) a fresh
ActorSystem. Long iterations exhausted native threads and JMH reported empty
results once the JVM ran out of resources.

Modification:
Make the benchmark class itself extend GraphInterpreterSpecKit so JMH's
@State(Scope.Benchmark) lifecycle reuses one ActorSystem across all
invocations. Add @teardown(Level.Trial) to terminate it cleanly.

Result:
The benchmark now runs to completion and produces stable numbers, which is a
prerequisite for measuring follow-up GraphInterpreter optimizations.

Tests:
sbt 'bench-jmh/compile'
Motivation:
GraphStages.identity is a singleton whose Inlet/Outlet shape is shared across
every reference. Chaining N copies into the assembly (numberOfIds = 5/10)
collapses to a single shape and mis-wires the connections, which surfaced as a
runtime "Cannot pull port twice" error spam during the benchmark and produced
nonsense throughput numbers (5/10 stages reported as faster than 1).

Modification:
Define a local IdentityStage class with its own Inlet/Outlet per instance and
use Vector.fill(numberOfIds)(new IdentityStage[Int]).

Result:
The benchmark wires N distinct stages and produces stable, monotonic numbers
(throughput decreases as numberOfIds grows, as expected).

Tests:
sbt 'bench-jmh/compile'
Motivation:
GraphInterpreter's chase loops dominate hot-path CPU in steady state — JMH
stack profiling on InterpreterBenchmark attributes ~50% of stream-related
samples to the two while loops at execute:449 / execute:460. Every chase
iteration calls afterStageHasRun(activeStage), which in steady state always
reads shutdownCounter(activeStage.stageId) and the per-stage finalized flag
only to discover the stage has not just completed and skip the body. That is
a per-event array load + null check + branch on the hottest path with no work
to do, which the JIT cannot fold away because the array is mutable shared
state.

Modification:
Track pendingFinalization: Boolean on the interpreter, set when a stage's
shutdownCounter decrements to 0 in completeConnection or transitions to 0
when KeepGoing is cleared in setKeepGoing. Gate the three hot-path
afterStageHasRun calls in execute() (post normal-dispatch and the two chase
loops) on the flag, resetting it before the call so cascaded completions
during finalization re-arm the flag correctly. The slow-frequency callers
(init, runAsyncInput) are left untouched.

Result:
JMH on InterpreterBenchmark (JDK 25, G1, single thread, -i 5 -wi 3 -f 1 -t 1):

  numberOfIds   baseline (ops/ms)    with patch (ops/ms)    delta
  1             45238 ± 3143         50952 ± 4784           +12.6%
  5             10526 ±  151         11242 ±  288           +6.8%   (CIs disjoint)
  10             5350 ±  193          5927 ±  173           +10.8%  (CIs disjoint)

Allocation rate stays at ~0.6 B/op — no GC impact. All stream-tests pass.

Tests:
- sbt 'stream/compile'
- sbt 'stream/mimaReportBinaryIssues' - clean
- sbt 'stream-tests/testOnly *fusing*' - 159 tests, all passed
- sbt 'stream-tests/testOnly *Flow*Spec' - 1208 tests, all passed
- sbt 'bench-jmh/Jmh/run -i 5 -wi 3 -f 1 -t 1 .*InterpreterBenchmark.*' - numbers above

References:
Refs apache#2985 - benchmark fix used to obtain trustworthy JMH numbers above.
@He-Pin He-Pin requested review from Copilot and pjfanning May 21, 2026 11:15
@He-Pin He-Pin added this to the 2.0.0-M3 milestone May 21, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces GraphInterpreter.execute() hot-path overhead by avoiding an unconditional afterStageHasRun(activeStage) check on every dispatch/chase iteration, while also fixing InterpreterBenchmark to provide reliable JMH measurements for interpreter changes.

Changes:

  • Add a pendingFinalization flag that is set when a stage’s shutdownCounter transitions to 0, and use it to gate afterStageHasRun in the main dispatch path and both chase loops.
  • Update shutdown bookkeeping (completeConnection, setKeepGoing(false)) to arm pendingFinalization when a stage becomes eligible for finalization.
  • Fix InterpreterBenchmark lifecycle (reuse a single ActorSystem per trial + proper teardown) and replace GraphStages.identity usage with a per-instance identity stage to avoid shape reuse/miswiring.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
stream/src/main/scala/org/apache/pekko/stream/impl/fusing/GraphInterpreter.scala Introduces pendingFinalization and gates afterStageHasRun in execute/chase hot paths; arms the flag when stages newly complete.
bench-jmh/src/main/scala/org/apache/pekko/stream/InterpreterBenchmark.scala Makes the benchmark state reuse one ActorSystem, adds teardown, and uses a per-instance identity GraphStage to avoid singleton-shape wiring issues.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Member

@pjfanning pjfanning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@He-Pin He-Pin merged commit a505843 into apache:main May 21, 2026
9 checks passed
@He-Pin He-Pin deleted the perf-graph-interpreter-skip-finalize-check branch May 21, 2026 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants