fix: stabilise JDK 21+ / JDK 25 nightly test runs#2889
Merged
He-Pin merged 2 commits intoapache:mainfrom Apr 23, 2026
Merged
Conversation
Motivation: The JDK 21+ ForkJoinPool compensation-thread regression (JDK-8300995 / JDK-8321335) starves Pekko's actor and remote dispatchers during heavy ask/await workloads in tests, producing intermittent timeouts on the nightly matrix (see apache#2573, apache#2870). Several individual tests also rely on hardcoded timeouts that never scale with `pekko.test.timefactor`, so they flake even when the dispatcher itself is healthy. Modification: - nightly-builds.yml: when the JDK is 21 or newer, raise `fork-join-executor.minimum-runnable` to 4 for both `pekko.actor.default-dispatcher` and `pekko.remote.default-remote-dispatcher`. This pushes the pool to spawn compensation threads earlier and avoids long stalls under the new compensation policy without changing scheduling semantics (FIFO unchanged, no fairness regressions). - TlsSpec: dilate the previously hardcoded 15s/17s timeouts in the ServerInitiatesViaTcp / CancellingRHSIgnoresBoth scenario so the timefactor actually applies on slower CI workers. - EventSourcedStashOverflowSpec: pass an explicit 30s budget to `receiveMessages(stashCapacity)` (the default 12s on JDK 25 is not enough to drain 20k messages on slow runners). - SteppingInmemJournal.step: bump the per-step ask timeout from 3s (dilated) to 10s (dilated) so PersistentActorRecoveryTimeoutSpec and similar tests retain headroom on JDK 17 with timefactor=2. Result: JDK 21+ nightly runs stop hitting compensation-thread starvation in the artery / remoting suites, and the four targeted timing fixes remove specific flakes observed in the latest nightly run on JDK 17 / 25. Production defaults are unchanged - the FJP override is applied only in the workflow.
5 tasks
Motivation: The "resume after multiple failures if resume supervision is in place" case times out after the default 6-second ScalaFutures patience on the JDK 25 nightly matrix. The sibling suite stream-typed-tests MapAsyncPartitionedSpec received the same dilation treatment in apache#2884; this spec was missed. Modification: Override the spec's patienceConfig to use a dilated 30-second timeout so the whole suite tracks pekko.test.timefactor instead of a fixed 6s. Result: Supervised-resume and bulk-throughput cases no longer flake on JDK 25 CI when the environment is under contention.
He-Pin
added a commit
that referenced
this pull request
Apr 23, 2026
* fix: auto-tune ForkJoinPool minimum-runnable on JDK 21+ Motivation: JDK-8300995 / JDK-8321335 changed compensation-thread creation in ForkJoinPool asyncMode (FIFO) to be much more conservative. Pekko fork-join dispatchers using the prior default `minimum-runnable = 1` are then prone to starvation under blocking workloads on JDK 21+, which has shown up as flaky nightly runs (#2870) and is the root cause behind the workflow override added in #2889. Modification: * Introduce `ForkJoinExecutorConfigurator.resolveMinimumRunnable`, an internal helper that computes the effective `minimum-runnable` value from the configured value, the dispatcher parallelism, and the running JDK major version. A negative configured value (the new default `-1`) triggers the JDK-aware policy: on JDK 21+ the value becomes `min(8, max(1, parallelism / 2))`; on JDK < 21 it stays at `1`. Non-negative values are honoured verbatim, so explicit `0` still disables compensation entirely and explicit positive values (including `1`) keep their existing meaning. * Change `pekko.actor.default-dispatcher.fork-join-executor.minimum-runnable` in `reference.conf` to the sentinel `-1` and update the doc block to describe the new auto-selection rule. * Add `ForkJoinExecutorConfiguratorSpec` with three groups of assertions: (1) pure-function matrix on `resolveMinimumRunnable`; (2) directional checks asserting the auto policy strictly raises the value on JDK 21+ and never exceeds the documented cap of 8; (3) wiring integration that builds a `ForkJoinExecutorServiceFactory` from a real dispatcher config and verifies the resolved value reaches the factory (guarding against regressions of the resolver wiring). Result: Production users on JDK 21+ now benefit from the same starvation mitigation that #2889 bolted onto the nightly CI workflow. Source and binary compatibility are preserved (constructor defaults stay at `1`, no signature changes, no MiMa filter required). Users wanting to opt out can set `minimum-runnable = 1` (or any explicit value) to restore the previous behaviour. * fix: address PR review feedback * License header: replace abbreviated header on the new ForkJoinExecutorConfiguratorSpec with the canonical Apache 2.0 header used by other clean-room test files in the project (per pjfanning's review comment). * Narrow auto-policy scope from JDK 21+ to JDK 25+: nightly evidence shows the asyncMode (FIFO) compensation-thread regression (JDK-8300995 / JDK-8321335) surfaces most clearly on the JDK 25 line, while JDK 21 has been running fine on the legacy default of 1 for years. Keep the default unchanged on JDK 21 to avoid a silent behaviour change for users who are not affected. * Document the new auto-tuning behaviour in both docs/src/main/paradox/dispatchers.md (classic) and docs/src/main/paradox/typed/dispatchers.md, including the opt-out instructions. * Update reference.conf doc comment, configurator scaladoc, and the spec assertions / pending guards to reflect the JDK 25+ scope.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Targets the recurring JDK 21+ / JDK 25 nightly flakes observed in the latest nightly run 24810320571. Two complementary moves: (1) bump
fork-join-executor.minimum-runnablefor the JDK 21+ test JVMs to fight the JDK-8300995 / JDK-8321335 compensation regression, and (2) plug four hardcoded test timeouts that ignorepekko.test.timefactor.Why
minimum-runnable, not LIFOLIFO
task-peeking-modeimproves cache locality but causes head-of-line blocking and scheduling-fairness regressions across mailboxes — a poor default for actor workloads. Raisingminimum-runnablekeeps FIFO semantics and only changes how eagerly the pool spawns compensation threads when workers block, which is exactly the behaviour change needed to counter the JDK 21+ regression. The override is applied only in the workflow; production defaults stay at1.Changes
.github/workflows/nightly-builds.ymljavaVersion >= 21, append-Dpekko.actor.default-dispatcher.fork-join-executor.minimum-runnable=4and same forpekko.remote.default-remote-dispatcherstream-tests/.../TlsSpec.scala.dilatedto the15.secondsand17.secondstimeouts in theServerInitiatesViaTcp / CancellingRHSIgnoresBothscenariotimefactor— failed at exactly 17s on JDK 25persistence-typed-tests/.../EventSourcedStashOverflowSpec.scala30.secondsbudget toprobe.receiveMessages(stashCapacity)persistence/.../journal/SteppingInmemJournal.scalastep()ask timeout from3.seconds.dilatedto10.seconds.dilatedPersistentActorRecoveryTimeoutSpecfailed at 6s (3s × timefactor=2) on JDK 17Out of scope (separate follow-ups)
AsyncDnsResolverIntegrationSpecfour DNS-test failures on JDK 25 / Scala 2.13 trace to Docker BIND returningSERVER_FAILURE— infrastructure issue, not test logic. Tracked separately.minimum-runnablebased onRuntime.version().feature() >= 21inForkJoinExecutorConfiguratoris a follow-up; this PR keeps production defaults unchanged.must be able to send messages concurrently preserving order) tracked under nightly tests (CI): java 25 runs have a lot of stream test failures #2573.Test plan
TlsSpecServerInitiatesViaTcp scenario passes on JDK 25EventSourcedStashOverflowSpecpasses on JDK 25PersistentActorRecoveryTimeoutSpecpasses on JDK 17 / Scala 3must be able to send messages concurrently preserving orderno longer hits 40s timeout on JDK 21+Related
minimum-runnableconfigurability landed (default unchanged); this PR consumes it for CI