Skip to content

feat: enable virtualize=on in stream test dispatcher to bypass JDK 21+ FJP scheduling regression#2872

Closed
He-Pin wants to merge 5 commits intomainfrom
feat-virtualize-stream-dispatcher
Closed

feat: enable virtualize=on in stream test dispatcher to bypass JDK 21+ FJP scheduling regression#2872
He-Pin wants to merge 5 commits intomainfrom
feat-virtualize-stream-dispatcher

Conversation

@He-Pin
Copy link
Copy Markdown
Member

@He-Pin He-Pin commented Apr 18, 2026

Summary

Enables virtualize = on in the Pekko stream test dispatcher so that on JDK 21+ each dispatcher task runs in its own virtual thread (Project Loom). This bypasses the ForkJoinPool compensation-thread scheduling regression (JDK-8300995) at the root rather than working around it.

⚠️ Depends on #2869 (test stability fixes). Retarget this PR to main after #2869 merges.


Root Cause Recap

ForkJoinPool with asyncMode=FIFO on JDK 21+ places response tasks at the back of the deque, causing actor reply futures to queue behind unrelated tasks. Compensation threads that should keep worker count stable became more conservative in JDK-8321335 (JDK 21) and have not improved in JDK 25. This manifests as cascading latency spikes in any test with tight actor round-trips (MergeHub, HubSpec, etc.).

How virtualize = on Fixes It

With virtualize = on, each dispatcher Runnable is scheduled as a virtual thread (JDK 21+ Project Loom). Virtual threads unmount their carrier when they block (e.g., awaiting an actor reply), freeing the carrier for other work. The FJP pool's FIFO-ordering starvation cannot happen because:

  • There are no blocked workers holding carriers
  • Compensation threads are never needed for actor-reply waits
  • The FJP pool is used only as a carrier scheduler, not to queue Runnables

Changes

stream-testkit/src/test/resources/reference.conf

pekko.test.stream-dispatcher {
  fork-join-executor {
    # Enable virtual threads on JDK 21+. Silently ignored on JDK < 21.
    virtualize = on
  }
}

Safety Notes

  • On JDK < 21: VirtualThreadSupport.isSupported = false → flag silently ignored, pool behaves identically to today
  • --add-opens flags required by Loom are already supplied by JdkOptions.scala for JDK 9+
  • Mailbox ordering is unaffected (UnboundedMailbox is always FIFO)
  • Verified: 48/48 HubSpec tests pass in ~20 seconds on JDK 25 (was 2+ minutes timeout previously)

Relation to Other PRs

PR What it does
#2869 Test stability fixes (patience scaling, element counts, TCK timefactor)
#2871 FJP config improvements (LIFO, configurable minimum-runnable, max-pool-size docs)
this PR Root-cause fix: virtual threads bypass FJP scheduling entirely
#2870 Tracking issue documenting the JDK-8300995 root cause

He-Pin and others added 5 commits April 18, 2026 23:30
Motivation:
Nightly CI (JDK 25, TIMEFACTOR=3) has been failing consistently for 30+
days due to ForkJoinPool scheduling changes in JDK 25 causing slower
throughput and higher scheduler overhead.  Four root causes were found:

1. HubSpec.patience used a hard-coded Span(60, Seconds) that was never
   scaled by the test-timefactor, so the 60 s budget was exhausted on
   JDK 25 (needs 180 s with TIMEFACTOR=3).

2. AggregateWithTimeBoundaryAndSimulatedTimeSpec used interval = 1.milli
   with ExplicitlyTriggeredScheduler, which fired up to 400 000 timer
   callbacks per test-run (timePasses(400.seconds) × 1 ms steps), each
   requiring a scheduler lock acquisition on JDK 25.

3. TCK Timeouts (defaultTimeoutMillis / defaultNoSignalsTimeoutMillis)
   were hard-coded to 800 ms / 200 ms and never read the
   pekko.test.timefactor JVM property, causing
   stochastic_spec103_mustSignalOnMethodsSequentially to fail on JDK 25.

4. FlowMapAsyncPartitionedSpec."ignore null-completed futures" built the
   shouldBeNull set from Random.nextInt(10), which produces values 0-9.
   Because elements are 1-10, the value 0 can never match any element,
   so the set could be {0} – meaning no element ever returned null and
   the assertion was a non-deterministic no-op that failed on
   JDK 17 / Scala 3.3.x in CI.

Modification:
- HubSpec: multiply the 60 s base by testKitSettings.TestTimeFactor so
  CI with TIMEFACTOR=3 gets 180 s and TIMEFACTOR=2 gets 120 s.
- AggregateWithTimeBoundaryAndSimulatedTimeSpec: change interval from
  1.milli to 1.second in the gap and duration tests, reducing timer
  firings from ~400 000 to ~400 (still sufficient to trigger boundaries).
- TCK Timeouts: read pekko.test.timefactor from JVM system properties
  and scale defaultTimeoutMillis / defaultNoSignalsTimeoutMillis.
- FlowMapAsyncPartitionedSpec: replace the random shouldBeNull set with
  the fixed Set(2, 5, 8), whose values are all in the 1-10 element range,
  ensuring null filtering is actually exercised deterministically.

Result:
All four previously-failing test categories should pass on the next
nightly run across JDK 17/21/25 × Scala 2.13/3.3.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation:
On JDK 25, ForkJoinPool scheduling changes cause increased actor dispatch
latency. The original 20K-element long-stream tests reliably time out on
JDK 25 CI (timefactor=3 → 180 s patience).

Modification:
- 'long streams' (buffer=16): 20K → 2K elements (2×1K sources)
- 'buffer size is 1': 20K → 200 elements (2×100 sources); bufferSize=1
  requires one actor round-trip per element, so count must stay small
- 'consumer is slower': 2K → 400 elements; burst=200 covers first 200
  elements with no scheduler ticks, keeping wall-clock time low
- 'producer is slower': 2K → 400 elements; burst=200 on the throttled
  source (200 elements) means zero scheduler ticks needed, eliminating
  ForkJoinPool starvation risk on JDK 25

Result:
All four tests now complete in under 100 ms on a loaded JDK 25 machine
(burst=200 absorbs all throttled elements instantly; no timer callbacks
are scheduled). Full HubSpec (48 tests) passes with timefactor=3.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation:
The earlier nightly fixes solved the immediate JDK 25 failures, but two tradeoffs
needed refinement. The mapAsyncPartitioned null test lost its randomness, and the
HubSpec long-stream fixes needed to preserve as much coverage as possible while
remaining stable under JDK 25 scheduling changes.

Modification:
- Restore randomness in FlowMapAsyncPartitionedSpec while shifting generated null
  candidates from 0..9 to 1..10 so the null path is always exercised.
- Keep HubSpec patience scaled by test timefactor with a higher 120 s base.
- Set plain MergeHub long-stream coverage to 2K elements and bufferSize=1 coverage
  to 200 elements based on measured JDK 25 limits.
- Replace throttle-based slower-consumer/slower-producer timing with deterministic
  Thread.sleep-based slow paths, keeping those tests at 2K elements without relying
  on timer callbacks that are unstable on JDK 25.

Result:
HubSpec passes end-to-end with pekko.test.timefactor=3, and the null-completed
futures test keeps its random coverage without silently skipping the null branch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation:
JDK 25 ForkJoinPool scheduling regression (JDK-8300995) causes slower
task scheduling under load. timefactor=3 was insufficient for some
long-running stream tests.

Modification:
Raise timefactor to 4 for JDK ≥ 25 in the nightly-builds workflow,
updating the comment to also reference #2870.

Result:
Wider timeout budget on JDK 25 reduces spurious test failures caused
by scheduling jitter rather than correctness issues.

References: #2870, #2573

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation:
ForkJoinPool with asyncMode=FIFO on JDK 21+ has a compensation-thread
scheduling regression (JDK-8300995) that causes actor reply tasks to
queue behind unrelated tasks. This leads to cascading latency spikes
in tests that exercise tight actor round-trips (MergeHub, etc.).

Modification:
Enable 'virtualize = on' in pekko.test.stream-dispatcher so that on
JDK 21+ each dispatcher task runs in its own virtual thread (Project Loom).
Virtual threads unmount their carrier when blocking, so the FJP pool's
FIFO starvation issue no longer applies to stream tests.
On JDK < 21 the flag is silently ignored (VirtualThreadSupport.isSupported
returns false), so JDK 17 and JDK 21 nightly CI jobs are unaffected.
The required --add-opens flags are already supplied by JdkOptions.scala.

Result:
Stream tests on JDK 21+ use virtual threads as carriers, bypassing the
ForkJoinPool compensation-thread starvation entirely.

References: #2870

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Base automatically changed from fix-nightly-failures to main April 18, 2026 19:02
@He-Pin He-Pin marked this pull request as draft April 18, 2026 19:04
@He-Pin
Copy link
Copy Markdown
Member Author

He-Pin commented Apr 18, 2026

🔬 Key Finding & Decision

What we measured

Enabling virtualize = on in the stream test dispatcher produced a dramatic speedup:

Test suite Without virtual threads With virtual threads
HubSpec (48 tests) 2+ min timeout (3 failures) ~20 seconds, 48/48 pass

Virtual threads completely bypass the ForkJoinPool FIFO-starvation issue (JDK-8300995) because blocked virtual threads unmount their carrier, so compensation threads are never needed for actor-reply waits.

Why we're NOT merging this as a default

The concern is sound: silently switching the stream test dispatcher to virtual threads changes the runtime model — any test that implicitly relies on FJP scheduling behavior (thread affinity, carrier pinning, blocking detection) could produce false positives or mask real issues. Enabling it only in CI for JDK 21+ would be a hidden configuration divergence.

Recommended path

  1. Opt-in per test config: teams investigating virtual-thread performance can add virtualize = on to their own reference.conf override. The support is already present in Pekko v1.2.0+.
  2. Future work: a dedicated VirtualThreadDispatcherSpec that explicitly opts in and asserts virtual-thread behavior is a better way to validate this path without affecting all stream tests. Tracked in ForkJoinPool compensation-thread regression (JDK-8300995) causes test flakiness on JDK 21+ #2870.

Closing in favour of documenting in #2870.

@He-Pin He-Pin closed this Apr 18, 2026
@He-Pin He-Pin reopened this Apr 18, 2026
@He-Pin
Copy link
Copy Markdown
Member Author

He-Pin commented Apr 20, 2026

@pjfanning I want to turn this on in the nightly build for Java 25

fork-join-executor {
parallelism-min = 8
parallelism-max = 8
# Enable virtual threads on JDK 21+. Virtual threads (Project Loom) bypass the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be moved to the application.conf file(s) where we might still have test issues - instead of making it a global setting that affects any test that uses the stream-testkit?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this can be done. we can enable it just for Java 25 and Java 21

@He-Pin
Copy link
Copy Markdown
Member Author

He-Pin commented Apr 21, 2026

Superseded by PR #2881 which includes enhanced virtual thread support with environment variable conditioning and developer documentation.

@He-Pin He-Pin closed this Apr 21, 2026
He-Pin added a commit to He-Pin/incubator-pekko that referenced this pull request Apr 21, 2026
Motivation: remove accidental unrelated files (.claude, .jvmopts, design docs) from PR branch

Modification: port only the intended files (.github/workflows/nightly-builds.yml, stream-testkit/src/test/resources/reference.conf, stream-tests/src/test/resources/application.conf, CONTRIBUTING.md)

Result: create clean branch ready for review; does not overwrite PR branch until user approves

References: upstream PR apache#2872

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
He-Pin added a commit to He-Pin/incubator-pekko that referenced this pull request Apr 21, 2026
Motivation:
Remove unrelated documentation churn so the PR only carries the intended nightly virtualize changes.

Modification:
Create a clean branch from origin/main with only the workflow change and the two stream test config updates.

Result:
The PR diff is reduced to three files and stays focused on the virtualize dispatcher behavior.

References:
upstream PR apache#2872

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
He-Pin added a commit to He-Pin/incubator-pekko that referenced this pull request Apr 21, 2026
Motivation:
Remove the stream-testkit reference.conf churn so the PR stays focused on the intended nightly virtualize wiring.

Modification:
Rebuild the branch from origin/main with only the nightly workflow change and the stream-tests application.conf override.

Result:
The PR diff is reduced to two files with no unrelated testkit comment changes.

References:
upstream PR apache#2872

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
He-Pin added a commit that referenced this pull request Apr 22, 2026
Motivation:
Remove the stream-testkit reference.conf churn so the PR stays focused on the intended nightly virtualize wiring.

Modification:
Rebuild the branch from origin/main with only the nightly workflow change and the stream-tests application.conf override.

Result:
The PR diff is reduced to two files with no unrelated testkit comment changes.

References:
upstream PR #2872
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants