Skip to content

fix: stabilise JDK 21+ / JDK 25 nightly test runs#2889

Merged
He-Pin merged 2 commits intoapache:mainfrom
He-Pin:fix/jdk21-jdk25-test-stability
Apr 23, 2026
Merged

fix: stabilise JDK 21+ / JDK 25 nightly test runs#2889
He-Pin merged 2 commits intoapache:mainfrom
He-Pin:fix/jdk21-jdk25-test-stability

Conversation

@He-Pin
Copy link
Copy Markdown
Member

@He-Pin He-Pin commented Apr 23, 2026

Summary

Targets the recurring JDK 21+ / JDK 25 nightly flakes observed in the latest nightly run 24810320571. Two complementary moves: (1) bump fork-join-executor.minimum-runnable for the JDK 21+ test JVMs to fight the JDK-8300995 / JDK-8321335 compensation regression, and (2) plug four hardcoded test timeouts that ignore pekko.test.timefactor.

Why minimum-runnable, not LIFO

LIFO task-peeking-mode improves cache locality but causes head-of-line blocking and scheduling-fairness regressions across mailboxes — a poor default for actor workloads. Raising minimum-runnable keeps FIFO semantics and only changes how eagerly the pool spawns compensation threads when workers block, which is exactly the behaviour change needed to counter the JDK 21+ regression. The override is applied only in the workflow; production defaults stay at 1.

Changes

File Change Why
.github/workflows/nightly-builds.yml When javaVersion >= 21, append -Dpekko.actor.default-dispatcher.fork-join-executor.minimum-runnable=4 and same for pekko.remote.default-remote-dispatcher Eager compensation-thread creation under JDK 21+ FJP regression; targets artery / remoting flakes
stream-tests/.../TlsSpec.scala Add .dilated to the 15.seconds and 17.seconds timeouts in the ServerInitiatesViaTcp / CancellingRHSIgnoresBoth scenario Hardcoded timeouts ignored timefactor — failed at exactly 17s on JDK 25
persistence-typed-tests/.../EventSourcedStashOverflowSpec.scala Pass an explicit 30.seconds budget to probe.receiveMessages(stashCapacity) Default 12s on JDK 25 not enough to drain 20 K messages on slow runners
persistence/.../journal/SteppingInmemJournal.scala Bump step() ask timeout from 3.seconds.dilated to 10.seconds.dilated PersistentActorRecoveryTimeoutSpec failed at 6s (3s × timefactor=2) on JDK 17

Out of scope (separate follow-ups)

  • AsyncDnsResolverIntegrationSpec four DNS-test failures on JDK 25 / Scala 2.13 trace to Docker BIND returning SERVER_FAILURE — infrastructure issue, not test logic. Tracked separately.
  • Auto-scaling minimum-runnable based on Runtime.version().feature() >= 21 in ForkJoinExecutorConfigurator is a follow-up; this PR keeps production defaults unchanged.
  • Deeper artery profiling (40s timeouts on must be able to send messages concurrently preserving order) tracked under nightly tests (CI): java 25 runs have a lot of stream test failures #2573.

Test plan

  • Re-run nightly build manually after merge
  • Confirm TlsSpec ServerInitiatesViaTcp scenario passes on JDK 25
  • Confirm EventSourcedStashOverflowSpec passes on JDK 25
  • Confirm PersistentActorRecoveryTimeoutSpec passes on JDK 17 / Scala 3
  • Confirm artery must be able to send messages concurrently preserving order no longer hits 40s timeout on JDK 21+

Related

Motivation:
The JDK 21+ ForkJoinPool compensation-thread regression (JDK-8300995 /
JDK-8321335) starves Pekko's actor and remote dispatchers during heavy
ask/await workloads in tests, producing intermittent timeouts on the
nightly matrix (see apache#2573, apache#2870).  Several individual tests also rely
on hardcoded timeouts that never scale with `pekko.test.timefactor`,
so they flake even when the dispatcher itself is healthy.

Modification:
- nightly-builds.yml: when the JDK is 21 or newer, raise
  `fork-join-executor.minimum-runnable` to 4 for both
  `pekko.actor.default-dispatcher` and
  `pekko.remote.default-remote-dispatcher`.  This pushes the pool to
  spawn compensation threads earlier and avoids long stalls under the
  new compensation policy without changing scheduling semantics
  (FIFO unchanged, no fairness regressions).
- TlsSpec: dilate the previously hardcoded 15s/17s timeouts in the
  ServerInitiatesViaTcp / CancellingRHSIgnoresBoth scenario so the
  timefactor actually applies on slower CI workers.
- EventSourcedStashOverflowSpec: pass an explicit 30s budget to
  `receiveMessages(stashCapacity)` (the default 12s on JDK 25 is not
  enough to drain 20k messages on slow runners).
- SteppingInmemJournal.step: bump the per-step ask timeout from 3s
  (dilated) to 10s (dilated) so PersistentActorRecoveryTimeoutSpec and
  similar tests retain headroom on JDK 17 with timefactor=2.

Result:
JDK 21+ nightly runs stop hitting compensation-thread starvation in the
artery / remoting suites, and the four targeted timing fixes remove
specific flakes observed in the latest nightly run on JDK 17 / 25.
Production defaults are unchanged - the FJP override is applied only
in the workflow.
@He-Pin He-Pin requested a review from pjfanning April 23, 2026 04:41
@He-Pin He-Pin marked this pull request as draft April 23, 2026 05:27
Motivation:
The "resume after multiple failures if resume supervision is in place"
case times out after the default 6-second ScalaFutures patience on the
JDK 25 nightly matrix. The sibling suite
stream-typed-tests MapAsyncPartitionedSpec received the same dilation
treatment in apache#2884; this spec was missed.

Modification:
Override the spec's patienceConfig to use a dilated 30-second timeout
so the whole suite tracks pekko.test.timefactor instead of a fixed 6s.

Result:
Supervised-resume and bulk-throughput cases no longer flake on JDK 25
CI when the environment is under contention.
@He-Pin He-Pin marked this pull request as ready for review April 23, 2026 06:10
Copy link
Copy Markdown
Member

@pjfanning pjfanning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@He-Pin He-Pin merged commit e72dd27 into apache:main Apr 23, 2026
9 checks passed
@He-Pin He-Pin deleted the fix/jdk21-jdk25-test-stability branch April 23, 2026 09:09
He-Pin added a commit that referenced this pull request Apr 23, 2026
* fix: auto-tune ForkJoinPool minimum-runnable on JDK 21+

Motivation:
JDK-8300995 / JDK-8321335 changed compensation-thread creation in
ForkJoinPool asyncMode (FIFO) to be much more conservative. Pekko fork-join
dispatchers using the prior default `minimum-runnable = 1` are then prone
to starvation under blocking workloads on JDK 21+, which has shown up as
flaky nightly runs (#2870) and is the root cause behind the workflow
override added in #2889.

Modification:
* Introduce `ForkJoinExecutorConfigurator.resolveMinimumRunnable`, an
  internal helper that computes the effective `minimum-runnable` value
  from the configured value, the dispatcher parallelism, and the running
  JDK major version. A negative configured value (the new default `-1`)
  triggers the JDK-aware policy: on JDK 21+ the value becomes
  `min(8, max(1, parallelism / 2))`; on JDK < 21 it stays at `1`.
  Non-negative values are honoured verbatim, so explicit `0` still
  disables compensation entirely and explicit positive values (including
  `1`) keep their existing meaning.
* Change `pekko.actor.default-dispatcher.fork-join-executor.minimum-runnable`
  in `reference.conf` to the sentinel `-1` and update the doc block to
  describe the new auto-selection rule.
* Add `ForkJoinExecutorConfiguratorSpec` with three groups of assertions:
  (1) pure-function matrix on `resolveMinimumRunnable`; (2) directional
  checks asserting the auto policy strictly raises the value on JDK 21+
  and never exceeds the documented cap of 8; (3) wiring integration that
  builds a `ForkJoinExecutorServiceFactory` from a real dispatcher config
  and verifies the resolved value reaches the factory (guarding against
  regressions of the resolver wiring).

Result:
Production users on JDK 21+ now benefit from the same starvation
mitigation that #2889 bolted onto the nightly CI workflow. Source and
binary compatibility are preserved (constructor defaults stay at `1`,
no signature changes, no MiMa filter required). Users wanting to opt
out can set `minimum-runnable = 1` (or any explicit value) to restore
the previous behaviour.

* fix: address PR review feedback

* License header: replace abbreviated header on the new
  ForkJoinExecutorConfiguratorSpec with the canonical Apache 2.0
  header used by other clean-room test files in the project
  (per pjfanning's review comment).

* Narrow auto-policy scope from JDK 21+ to JDK 25+: nightly evidence
  shows the asyncMode (FIFO) compensation-thread regression
  (JDK-8300995 / JDK-8321335) surfaces most clearly on the JDK 25
  line, while JDK 21 has been running fine on the legacy default of
  1 for years. Keep the default unchanged on JDK 21 to avoid a
  silent behaviour change for users who are not affected.

* Document the new auto-tuning behaviour in both
  docs/src/main/paradox/dispatchers.md (classic) and
  docs/src/main/paradox/typed/dispatchers.md, including the opt-out
  instructions.

* Update reference.conf doc comment, configurator scaladoc, and the
  spec assertions / pending guards to reflect the JDK 25+ scope.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants