[fix][client] Reset higher-index states on recovery in SameAuthParamsLookupAutoClusterFailover by merlimat · Pull Request #25826 · apache/pulsar

merlimat · 2026-05-19T15:59:03Z

Motivation

SameAuthParamsLookupAutoClusterFailover maintains a per-service-index state array (Healthy, PreFail, Failed, PreRecover) updated by a periodic check loop. The check loop only probes indices 0..currentPulsarServiceIndex:

private void checkPulsarServices() {
    for (int i = 0; i <= currentPulsarServiceIndex; i++) {
        ...
    }
}

When we recover to a higher-priority service (currentPulsarServiceIndex decreases), the loop stops probing the higher-indexed services. If any of those indices were in a transient state at the moment of recovery (e.g., PreFail from a single timed-out probe), they get stuck there because nothing ever probes them again to flip them back to Healthy.

This causes SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover to fail intermittently:

Caused by: java.lang.AssertionError: Arrays differ at element [2]: Healthy != PreFail expected [Healthy] but found [PreFail]

Concretely: while currentPulsarServiceIndex is 2 and the test is waiting for index 1 to recover, a single transient probe failure on pulsar2 transitions state[2]: Healthy -> PreFail. A subsequent successful probe at index 1 reaches recoverThreshold and triggers updateServiceUrl(1), dropping currentPulsarServiceIndex to 1. From that point on the loop only probes indices 0 and 1, and state[2] stays at PreFail forever — the test's 3-minute await window never sees state[2] == Healthy.

Example failure: https://scans.gradle.com/s/7pttiiyo6yybc/tests/task/:pulsar-broker:test/details/org.apache.pulsar.broker.SameAuthParamsLookupAutoClusterFailoverTest/testAutoClusterFailover%5B2%5D(true)/1/output

Modifications

Production fix (SameAuthParamsLookupAutoClusterFailover.java): In updateServiceUrl, when recovering (target index < current index), reset state of indices above the new target to Healthy and zero their counters. The state of an unprobed index is not meaningful — resetting it ensures (a) a subsequent failover starts from a clean baseline if it needs to consider those services again, and (b) we don't leave stale transient state lying around.

New deterministic test (SameAuthParamsLookupAutoClusterFailoverTest.testRecoveryResetsHigherIndexStaleState): uses the existing mock-based harness in pulsar-client to drive a precise probe sequence that reproduces the bug:

Failover 0 → 2 (url0 down, url1 down, url2 up). State becomes [Failed, Failed, Healthy].
url1 recovers; a check cycle transitions state[1]: Failed → PreRecover.
On the cycle that promotes state[1]: PreRecover → Healthy and triggers updateServiceUrl(1), url2 sees one failed probe so state[2]: Healthy → PreFail right before the index drops to 1.

After the recovery transition the check loop only iterates 0..1, so without the fix state[2] is stuck at PreFail forever. The test asserts state[2] == Healthy after the transition.

I verified the test fails with the exact expected message when the fix is reverted:

AssertionError: state[2] should be reset to Healthy on recovery, not stuck at PreFail expected [Healthy] but found [PreFail]

Verifying this change

Covered by:

The new deterministic unit test SameAuthParamsLookupAutoClusterFailoverTest.testRecoveryResetsHigherIndexStaleState in pulsar-client (fails without the fix, passes with it).
The existing integration test SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover in pulsar-broker (both TLS and non-TLS variants). Locally I ran it 3 times with --rerun-tasks; all passed in ~90s each.

Does this pull request potentially affect one of the following parts:

…LookupAutoClusterFailover The periodic health-check loop only iterates indices 0..currentPulsarServiceIndex. When we recover to a higher-priority service (currentPulsarServiceIndex decreases), the loop stops probing the higher-indexed services. If any of those indices were in a transient state at that moment (e.g., PreFail from a single timed-out probe), they get stuck there because nothing ever probes them again to flip them back to Healthy. This caused SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover to fail intermittently: after recovering 2->1, the test asserts state[2]=Healthy, but a single transient probe failure on pulsar2 right before the recovery left state[2]=PreFail, and the loop never visits it again. Reset state of indices above the new target to Healthy on recovery so they start fresh if a future failover needs to consider them again.

Adds testRecoveryResetsHigherIndexStaleState to the client-side SameAuthParamsLookupAutoClusterFailoverTest that uses a mocked LookupService to drive a precise probe sequence: 1. Failover 0 -> 2 (url0 down, url1 down, url2 up). 2. url1 recovers; state[1] Failed -> PreRecover. 3. On the cycle that promotes state[1] PreRecover -> Healthy and triggers updateServiceUrl(1), url2 sees one failed probe so state[2] flips Healthy -> PreFail right before the index drops to 1. After the recovery transition, the check loop only iterates 0..1, so without the fix state[2] is stuck at PreFail forever. The test asserts state[2]=Healthy after the transition, which fails without the production fix and passes with it. Verified the test fails with the exact expected message when the fix is reverted: AssertionError: state[2] should be reset to Healthy on recovery, not stuck at PreFail expected [Healthy] but found [PreFail]

merlimat added the area/client label May 19, 2026

lhotari approved these changes May 20, 2026

View reviewed changes

lhotari merged commit 4229e19 into apache:master May 20, 2026
44 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][client] Reset higher-index states on recovery in SameAuthParamsLookupAutoClusterFailover#25826

[fix][client] Reset higher-index states on recovery in SameAuthParamsLookupAutoClusterFailover#25826
lhotari merged 2 commits into
apache:masterfrom
merlimat:mmerli/failover-reset-states-on-recovery

merlimat commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

merlimat commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

merlimat commented May 19, 2026 •

edited

Loading