fix: stabilise JDK 21+ / JDK 25 nightly test runs by He-Pin · Pull Request #2889 · apache/pekko

He-Pin · 2026-04-23T04:34:47Z

Summary

Targets the recurring JDK 21+ / JDK 25 nightly flakes observed in the latest nightly run 24810320571. Two complementary moves: (1) bump fork-join-executor.minimum-runnable for the JDK 21+ test JVMs to fight the JDK-8300995 / JDK-8321335 compensation regression, and (2) plug four hardcoded test timeouts that ignore pekko.test.timefactor.

Why `minimum-runnable`, not LIFO

LIFO task-peeking-mode improves cache locality but causes head-of-line blocking and scheduling-fairness regressions across mailboxes — a poor default for actor workloads. Raising minimum-runnable keeps FIFO semantics and only changes how eagerly the pool spawns compensation threads when workers block, which is exactly the behaviour change needed to counter the JDK 21+ regression. The override is applied only in the workflow; production defaults stay at 1.

Changes

File	Change	Why
`.github/workflows/nightly-builds.yml`	When `javaVersion >= 21`, append `-Dpekko.actor.default-dispatcher.fork-join-executor.minimum-runnable=4` and same for `pekko.remote.default-remote-dispatcher`	Eager compensation-thread creation under JDK 21+ FJP regression; targets artery / remoting flakes
`stream-tests/.../TlsSpec.scala`	Add `.dilated` to the `15.seconds` and `17.seconds` timeouts in the `ServerInitiatesViaTcp / CancellingRHSIgnoresBoth` scenario	Hardcoded timeouts ignored `timefactor` — failed at exactly 17s on JDK 25
`persistence-typed-tests/.../EventSourcedStashOverflowSpec.scala`	Pass an explicit `30.seconds` budget to `probe.receiveMessages(stashCapacity)`	Default 12s on JDK 25 not enough to drain 20 K messages on slow runners
`persistence/.../journal/SteppingInmemJournal.scala`	Bump `step()` ask timeout from `3.seconds.dilated` to `10.seconds.dilated`	`PersistentActorRecoveryTimeoutSpec` failed at 6s (3s × timefactor=2) on JDK 17

Out of scope (separate follow-ups)

AsyncDnsResolverIntegrationSpec four DNS-test failures on JDK 25 / Scala 2.13 trace to Docker BIND returning SERVER_FAILURE — infrastructure issue, not test logic. Tracked separately.
Auto-scaling minimum-runnable based on Runtime.version().feature() >= 21 in ForkJoinExecutorConfigurator is a follow-up; this PR keeps production defaults unchanged.
Deeper artery profiling (40s timeouts on must be able to send messages concurrently preserving order) tracked under nightly tests (CI): java 25 runs have a lot of stream test failures #2573.

Test plan

Re-run nightly build manually after merge
Confirm TlsSpec ServerInitiatesViaTcp scenario passes on JDK 25
Confirm EventSourcedStashOverflowSpec passes on JDK 25
Confirm PersistentActorRecoveryTimeoutSpec passes on JDK 17 / Scala 3
Confirm artery must be able to send messages concurrently preserving order no longer hits 40s timeout on JDK 21+

nightly tests (CI): java 25 runs have a lot of stream test failures #2573 — JDK 25 nightly flakiness tracker
ForkJoinPool compensation-thread regression (JDK-8300995) causes test flakiness on JDK 21+ #2870 — JDK-8300995 ForkJoinPool regression root-cause issue
feat: make ForkJoinPool minimum-runnable configurable and improve pool documentation #2871 — minimum-runnable configurability landed (default unchanged); this PR consumes it for CI

Motivation: The JDK 21+ ForkJoinPool compensation-thread regression (JDK-8300995 / JDK-8321335) starves Pekko's actor and remote dispatchers during heavy ask/await workloads in tests, producing intermittent timeouts on the nightly matrix (see apache#2573, apache#2870). Several individual tests also rely on hardcoded timeouts that never scale with `pekko.test.timefactor`, so they flake even when the dispatcher itself is healthy. Modification: - nightly-builds.yml: when the JDK is 21 or newer, raise `fork-join-executor.minimum-runnable` to 4 for both `pekko.actor.default-dispatcher` and `pekko.remote.default-remote-dispatcher`. This pushes the pool to spawn compensation threads earlier and avoids long stalls under the new compensation policy without changing scheduling semantics (FIFO unchanged, no fairness regressions). - TlsSpec: dilate the previously hardcoded 15s/17s timeouts in the ServerInitiatesViaTcp / CancellingRHSIgnoresBoth scenario so the timefactor actually applies on slower CI workers. - EventSourcedStashOverflowSpec: pass an explicit 30s budget to `receiveMessages(stashCapacity)` (the default 12s on JDK 25 is not enough to drain 20k messages on slow runners). - SteppingInmemJournal.step: bump the per-step ask timeout from 3s (dilated) to 10s (dilated) so PersistentActorRecoveryTimeoutSpec and similar tests retain headroom on JDK 17 with timefactor=2. Result: JDK 21+ nightly runs stop hitting compensation-thread starvation in the artery / remoting suites, and the four targeted timing fixes remove specific flakes observed in the latest nightly run on JDK 17 / 25. Production defaults are unchanged - the FJP override is applied only in the workflow.

Motivation: The "resume after multiple failures if resume supervision is in place" case times out after the default 6-second ScalaFutures patience on the JDK 25 nightly matrix. The sibling suite stream-typed-tests MapAsyncPartitionedSpec received the same dilation treatment in apache#2884; this spec was missed. Modification: Override the spec's patienceConfig to use a dilated 30-second timeout so the whole suite tracks pekko.test.timefactor instead of a fixed 6s. Result: Supervised-resume and bulk-throughput cases no longer flake on JDK 25 CI when the environment is under contention.

pjfanning

lgtm

* fix: auto-tune ForkJoinPool minimum-runnable on JDK 21+ Motivation: JDK-8300995 / JDK-8321335 changed compensation-thread creation in ForkJoinPool asyncMode (FIFO) to be much more conservative. Pekko fork-join dispatchers using the prior default `minimum-runnable = 1` are then prone to starvation under blocking workloads on JDK 21+, which has shown up as flaky nightly runs (#2870) and is the root cause behind the workflow override added in #2889. Modification: * Introduce `ForkJoinExecutorConfigurator.resolveMinimumRunnable`, an internal helper that computes the effective `minimum-runnable` value from the configured value, the dispatcher parallelism, and the running JDK major version. A negative configured value (the new default `-1`) triggers the JDK-aware policy: on JDK 21+ the value becomes `min(8, max(1, parallelism / 2))`; on JDK < 21 it stays at `1`. Non-negative values are honoured verbatim, so explicit `0` still disables compensation entirely and explicit positive values (including `1`) keep their existing meaning. * Change `pekko.actor.default-dispatcher.fork-join-executor.minimum-runnable` in `reference.conf` to the sentinel `-1` and update the doc block to describe the new auto-selection rule. * Add `ForkJoinExecutorConfiguratorSpec` with three groups of assertions: (1) pure-function matrix on `resolveMinimumRunnable`; (2) directional checks asserting the auto policy strictly raises the value on JDK 21+ and never exceeds the documented cap of 8; (3) wiring integration that builds a `ForkJoinExecutorServiceFactory` from a real dispatcher config and verifies the resolved value reaches the factory (guarding against regressions of the resolver wiring). Result: Production users on JDK 21+ now benefit from the same starvation mitigation that #2889 bolted onto the nightly CI workflow. Source and binary compatibility are preserved (constructor defaults stay at `1`, no signature changes, no MiMa filter required). Users wanting to opt out can set `minimum-runnable = 1` (or any explicit value) to restore the previous behaviour. * fix: address PR review feedback * License header: replace abbreviated header on the new ForkJoinExecutorConfiguratorSpec with the canonical Apache 2.0 header used by other clean-room test files in the project (per pjfanning's review comment). * Narrow auto-policy scope from JDK 21+ to JDK 25+: nightly evidence shows the asyncMode (FIFO) compensation-thread regression (JDK-8300995 / JDK-8321335) surfaces most clearly on the JDK 25 line, while JDK 21 has been running fine on the legacy default of 1 for years. Keep the default unchanged on JDK 21 to avoid a silent behaviour change for users who are not affected. * Document the new auto-tuning behaviour in both docs/src/main/paradox/dispatchers.md (classic) and docs/src/main/paradox/typed/dispatchers.md, including the opt-out instructions. * Update reference.conf doc comment, configurator scaladoc, and the spec assertions / pending guards to reflect the JDK 25+ scope.

He-Pin requested a review from pjfanning April 23, 2026 04:41

He-Pin marked this pull request as draft April 23, 2026 05:27

He-Pin mentioned this pull request Apr 23, 2026

fix: auto-tune ForkJoinPool minimum-runnable on JDK 25+ #2890

Merged

5 tasks

He-Pin marked this pull request as ready for review April 23, 2026 06:10

pjfanning approved these changes Apr 23, 2026

View reviewed changes

He-Pin merged commit e72dd27 into apache:main Apr 23, 2026
9 checks passed

He-Pin deleted the fix/jdk21-jdk25-test-stability branch April 23, 2026 09:09

He-Pin mentioned this pull request Apr 25, 2026

fix: stabilise JDK 25 nightly flakes #2907

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: stabilise JDK 21+ / JDK 25 nightly test runs#2889

fix: stabilise JDK 21+ / JDK 25 nightly test runs#2889
He-Pin merged 2 commits intoapache:mainfrom
He-Pin:fix/jdk21-jdk25-test-stability

He-Pin commented Apr 23, 2026

Uh oh!

pjfanning left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

He-Pin commented Apr 23, 2026

Summary

Why minimum-runnable, not LIFO

Changes

Out of scope (separate follow-ups)

Test plan

Related

Uh oh!

pjfanning left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Why `minimum-runnable`, not LIFO