feat: enable virtualize=on in stream test dispatcher to bypass JDK 21+ FJP scheduling regression#2872
feat: enable virtualize=on in stream test dispatcher to bypass JDK 21+ FJP scheduling regression#2872
Conversation
Motivation:
Nightly CI (JDK 25, TIMEFACTOR=3) has been failing consistently for 30+
days due to ForkJoinPool scheduling changes in JDK 25 causing slower
throughput and higher scheduler overhead. Four root causes were found:
1. HubSpec.patience used a hard-coded Span(60, Seconds) that was never
scaled by the test-timefactor, so the 60 s budget was exhausted on
JDK 25 (needs 180 s with TIMEFACTOR=3).
2. AggregateWithTimeBoundaryAndSimulatedTimeSpec used interval = 1.milli
with ExplicitlyTriggeredScheduler, which fired up to 400 000 timer
callbacks per test-run (timePasses(400.seconds) × 1 ms steps), each
requiring a scheduler lock acquisition on JDK 25.
3. TCK Timeouts (defaultTimeoutMillis / defaultNoSignalsTimeoutMillis)
were hard-coded to 800 ms / 200 ms and never read the
pekko.test.timefactor JVM property, causing
stochastic_spec103_mustSignalOnMethodsSequentially to fail on JDK 25.
4. FlowMapAsyncPartitionedSpec."ignore null-completed futures" built the
shouldBeNull set from Random.nextInt(10), which produces values 0-9.
Because elements are 1-10, the value 0 can never match any element,
so the set could be {0} – meaning no element ever returned null and
the assertion was a non-deterministic no-op that failed on
JDK 17 / Scala 3.3.x in CI.
Modification:
- HubSpec: multiply the 60 s base by testKitSettings.TestTimeFactor so
CI with TIMEFACTOR=3 gets 180 s and TIMEFACTOR=2 gets 120 s.
- AggregateWithTimeBoundaryAndSimulatedTimeSpec: change interval from
1.milli to 1.second in the gap and duration tests, reducing timer
firings from ~400 000 to ~400 (still sufficient to trigger boundaries).
- TCK Timeouts: read pekko.test.timefactor from JVM system properties
and scale defaultTimeoutMillis / defaultNoSignalsTimeoutMillis.
- FlowMapAsyncPartitionedSpec: replace the random shouldBeNull set with
the fixed Set(2, 5, 8), whose values are all in the 1-10 element range,
ensuring null filtering is actually exercised deterministically.
Result:
All four previously-failing test categories should pass on the next
nightly run across JDK 17/21/25 × Scala 2.13/3.3.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation: On JDK 25, ForkJoinPool scheduling changes cause increased actor dispatch latency. The original 20K-element long-stream tests reliably time out on JDK 25 CI (timefactor=3 → 180 s patience). Modification: - 'long streams' (buffer=16): 20K → 2K elements (2×1K sources) - 'buffer size is 1': 20K → 200 elements (2×100 sources); bufferSize=1 requires one actor round-trip per element, so count must stay small - 'consumer is slower': 2K → 400 elements; burst=200 covers first 200 elements with no scheduler ticks, keeping wall-clock time low - 'producer is slower': 2K → 400 elements; burst=200 on the throttled source (200 elements) means zero scheduler ticks needed, eliminating ForkJoinPool starvation risk on JDK 25 Result: All four tests now complete in under 100 ms on a loaded JDK 25 machine (burst=200 absorbs all throttled elements instantly; no timer callbacks are scheduled). Full HubSpec (48 tests) passes with timefactor=3. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation: The earlier nightly fixes solved the immediate JDK 25 failures, but two tradeoffs needed refinement. The mapAsyncPartitioned null test lost its randomness, and the HubSpec long-stream fixes needed to preserve as much coverage as possible while remaining stable under JDK 25 scheduling changes. Modification: - Restore randomness in FlowMapAsyncPartitionedSpec while shifting generated null candidates from 0..9 to 1..10 so the null path is always exercised. - Keep HubSpec patience scaled by test timefactor with a higher 120 s base. - Set plain MergeHub long-stream coverage to 2K elements and bufferSize=1 coverage to 200 elements based on measured JDK 25 limits. - Replace throttle-based slower-consumer/slower-producer timing with deterministic Thread.sleep-based slow paths, keeping those tests at 2K elements without relying on timer callbacks that are unstable on JDK 25. Result: HubSpec passes end-to-end with pekko.test.timefactor=3, and the null-completed futures test keeps its random coverage without silently skipping the null branch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation: JDK 25 ForkJoinPool scheduling regression (JDK-8300995) causes slower task scheduling under load. timefactor=3 was insufficient for some long-running stream tests. Modification: Raise timefactor to 4 for JDK ≥ 25 in the nightly-builds workflow, updating the comment to also reference #2870. Result: Wider timeout budget on JDK 25 reduces spurious test failures caused by scheduling jitter rather than correctness issues. References: #2870, #2573 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation: ForkJoinPool with asyncMode=FIFO on JDK 21+ has a compensation-thread scheduling regression (JDK-8300995) that causes actor reply tasks to queue behind unrelated tasks. This leads to cascading latency spikes in tests that exercise tight actor round-trips (MergeHub, etc.). Modification: Enable 'virtualize = on' in pekko.test.stream-dispatcher so that on JDK 21+ each dispatcher task runs in its own virtual thread (Project Loom). Virtual threads unmount their carrier when blocking, so the FJP pool's FIFO starvation issue no longer applies to stream tests. On JDK < 21 the flag is silently ignored (VirtualThreadSupport.isSupported returns false), so JDK 17 and JDK 21 nightly CI jobs are unaffected. The required --add-opens flags are already supplied by JdkOptions.scala. Result: Stream tests on JDK 21+ use virtual threads as carriers, bypassing the ForkJoinPool compensation-thread starvation entirely. References: #2870 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔬 Key Finding & DecisionWhat we measuredEnabling
Virtual threads completely bypass the ForkJoinPool FIFO-starvation issue (JDK-8300995) because blocked virtual threads unmount their carrier, so compensation threads are never needed for actor-reply waits. Why we're NOT merging this as a defaultThe concern is sound: silently switching the stream test dispatcher to virtual threads changes the runtime model — any test that implicitly relies on FJP scheduling behavior (thread affinity, carrier pinning, blocking detection) could produce false positives or mask real issues. Enabling it only in CI for JDK 21+ would be a hidden configuration divergence. Recommended path
Closing in favour of documenting in #2870. |
|
@pjfanning I want to turn this on in the nightly build for Java 25 |
| fork-join-executor { | ||
| parallelism-min = 8 | ||
| parallelism-max = 8 | ||
| # Enable virtual threads on JDK 21+. Virtual threads (Project Loom) bypass the |
There was a problem hiding this comment.
could this be moved to the application.conf file(s) where we might still have test issues - instead of making it a global setting that affects any test that uses the stream-testkit?
There was a problem hiding this comment.
Yes, this can be done. we can enable it just for Java 25 and Java 21
|
Superseded by PR #2881 which includes enhanced virtual thread support with environment variable conditioning and developer documentation. |
Motivation: remove accidental unrelated files (.claude, .jvmopts, design docs) from PR branch Modification: port only the intended files (.github/workflows/nightly-builds.yml, stream-testkit/src/test/resources/reference.conf, stream-tests/src/test/resources/application.conf, CONTRIBUTING.md) Result: create clean branch ready for review; does not overwrite PR branch until user approves References: upstream PR apache#2872 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation: Remove unrelated documentation churn so the PR only carries the intended nightly virtualize changes. Modification: Create a clean branch from origin/main with only the workflow change and the two stream test config updates. Result: The PR diff is reduced to three files and stays focused on the virtualize dispatcher behavior. References: upstream PR apache#2872 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation: Remove the stream-testkit reference.conf churn so the PR stays focused on the intended nightly virtualize wiring. Modification: Rebuild the branch from origin/main with only the nightly workflow change and the stream-tests application.conf override. Result: The PR diff is reduced to two files with no unrelated testkit comment changes. References: upstream PR apache#2872 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation: Remove the stream-testkit reference.conf churn so the PR stays focused on the intended nightly virtualize wiring. Modification: Rebuild the branch from origin/main with only the nightly workflow change and the stream-tests application.conf override. Result: The PR diff is reduced to two files with no unrelated testkit comment changes. References: upstream PR #2872
Summary
Enables
virtualize = onin the Pekko stream test dispatcher so that on JDK 21+ each dispatcher task runs in its own virtual thread (Project Loom). This bypasses the ForkJoinPool compensation-thread scheduling regression (JDK-8300995) at the root rather than working around it.Root Cause Recap
ForkJoinPool with
asyncMode=FIFOon JDK 21+ places response tasks at the back of the deque, causing actor reply futures to queue behind unrelated tasks. Compensation threads that should keep worker count stable became more conservative in JDK-8321335 (JDK 21) and have not improved in JDK 25. This manifests as cascading latency spikes in any test with tight actor round-trips (MergeHub, HubSpec, etc.).How
virtualize = onFixes ItWith
virtualize = on, each dispatcherRunnableis scheduled as a virtual thread (JDK 21+ Project Loom). Virtual threads unmount their carrier when they block (e.g., awaiting an actor reply), freeing the carrier for other work. The FJP pool's FIFO-ordering starvation cannot happen because:Changes
stream-testkit/src/test/resources/reference.confSafety Notes
VirtualThreadSupport.isSupported = false→ flag silently ignored, pool behaves identically to today--add-opensflags required by Loom are already supplied byJdkOptions.scalafor JDK 9+Relation to Other PRs