Skip to content

[fix][test] Flaky SameAuthParamsLookupAutoClusterFailoverTest#25566

Merged
merlimat merged 1 commit intoapache:masterfrom
lhotari:lh-fix-flaky-SameAuthParamsLookupAutoClusterFailoverTest
Apr 22, 2026
Merged

[fix][test] Flaky SameAuthParamsLookupAutoClusterFailoverTest#25566
merlimat merged 1 commit intoapache:masterfrom
lhotari:lh-fix-flaky-SameAuthParamsLookupAutoClusterFailoverTest

Conversation

@lhotari
Copy link
Copy Markdown
Member

@lhotari lhotari commented Apr 22, 2026

Fixes #24526

Motivation

The SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover test is flaky. A prior fix (#25388) addressed one root cause (stale Healthy state in findFailoverTo), but the test still times out intermittently at the "Test recover 2 --> 1" step, where Awaitility waits up to 60s for state to reach [Failed, Healthy, Healthy].

Root cause: with checkHealthyIntervalMs(300) and recoverThreshold(5), recovery of index 1 requires 5 successful probe cycles. Each cycle sequentially probes every index up to currentPulsarServiceIndex (so 3 probes per cycle once currentIndex=2). Each probe is bounded by a 3s deadline in SameAuthParamsLookupAutoClusterFailover.probeAvailable(). Under CI load, the accumulated wall time (cycles × probes × per-probe time) can approach or exceed the 60s Awaitility budget — especially if a transient probe failure of index 1 resets the PreRecover counter, forcing another 5 cycles.

Modifications

Reduce checkHealthyIntervalMs from 300ms to 100ms in the test. This tightens the cycle cadence so recovery completes sooner without weakening the threshold-based state machine — the full five-step PreRecover -> Healthy transition is still exercised.

Verifying this change

  • Make sure that the change passes the CI checks.

This change is already covered by existing tests, such as SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover. Verified locally with @Test(invocationCount = 5) (10 runs total across both enabledTls values): 10/10 passes, each invocation consistent at ~16.5s (vs. the flaky runs that previously exceeded the 60s awaitility budget).

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc-not-needed

…LookupAutoClusterFailoverTest

Under CI load, the "Test recover 2 --> 1" step in testAutoClusterFailover
could exceed its 60s awaitility budget because each recovery cycle
sequentially probes every index up to currentPulsarServiceIndex with a
3s per-probe deadline, and recoverThreshold=5 requires five successful
cycles. Transient probe failures reset the PreRecover counter, compounding
delay.

Tightening checkHealthyIntervalMs from 300ms to 100ms shortens the cycle
cadence without weakening the threshold-based state machine — the full
five-step PreRecover -> Healthy transition is still exercised.
@merlimat merlimat merged commit 1071cc9 into apache:master Apr 22, 2026
80 of 82 checks passed
lhotari added a commit that referenced this pull request Apr 22, 2026
lhotari added a commit that referenced this pull request Apr 22, 2026
lhotari added a commit that referenced this pull request Apr 22, 2026
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky-test: SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover

4 participants