[fix][client] Reset higher-index states on recovery in SameAuthParamsLookupAutoClusterFailover#25826
Merged
lhotari merged 2 commits intoMay 20, 2026
Conversation
…LookupAutoClusterFailover The periodic health-check loop only iterates indices 0..currentPulsarServiceIndex. When we recover to a higher-priority service (currentPulsarServiceIndex decreases), the loop stops probing the higher-indexed services. If any of those indices were in a transient state at that moment (e.g., PreFail from a single timed-out probe), they get stuck there because nothing ever probes them again to flip them back to Healthy. This caused SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover to fail intermittently: after recovering 2->1, the test asserts state[2]=Healthy, but a single transient probe failure on pulsar2 right before the recovery left state[2]=PreFail, and the loop never visits it again. Reset state of indices above the new target to Healthy on recovery so they start fresh if a future failover needs to consider them again.
Adds testRecoveryResetsHigherIndexStaleState to the client-side
SameAuthParamsLookupAutoClusterFailoverTest that uses a mocked
LookupService to drive a precise probe sequence:
1. Failover 0 -> 2 (url0 down, url1 down, url2 up).
2. url1 recovers; state[1] Failed -> PreRecover.
3. On the cycle that promotes state[1] PreRecover -> Healthy and
triggers updateServiceUrl(1), url2 sees one failed probe so
state[2] flips Healthy -> PreFail right before the index drops
to 1.
After the recovery transition, the check loop only iterates 0..1,
so without the fix state[2] is stuck at PreFail forever. The test
asserts state[2]=Healthy after the transition, which fails without
the production fix and passes with it.
Verified the test fails with the exact expected message when the fix
is reverted:
AssertionError: state[2] should be reset to Healthy on recovery,
not stuck at PreFail expected [Healthy] but found [PreFail]
lhotari
approved these changes
May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
SameAuthParamsLookupAutoClusterFailovermaintains a per-service-index state array (Healthy,PreFail,Failed,PreRecover) updated by a periodic check loop. The check loop only probes indices0..currentPulsarServiceIndex:When we recover to a higher-priority service (
currentPulsarServiceIndexdecreases), the loop stops probing the higher-indexed services. If any of those indices were in a transient state at the moment of recovery (e.g.,PreFailfrom a single timed-out probe), they get stuck there because nothing ever probes them again to flip them back toHealthy.This causes
SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailoverto fail intermittently:Concretely: while
currentPulsarServiceIndexis 2 and the test is waiting for index 1 to recover, a single transient probe failure on pulsar2 transitionsstate[2]: Healthy -> PreFail. A subsequent successful probe at index 1 reachesrecoverThresholdand triggersupdateServiceUrl(1), droppingcurrentPulsarServiceIndexto 1. From that point on the loop only probes indices 0 and 1, andstate[2]stays atPreFailforever — the test's 3-minute await window never seesstate[2] == Healthy.Example failure: https://scans.gradle.com/s/7pttiiyo6yybc/tests/task/:pulsar-broker:test/details/org.apache.pulsar.broker.SameAuthParamsLookupAutoClusterFailoverTest/testAutoClusterFailover%5B2%5D(true)/1/output
Modifications
Production fix (
SameAuthParamsLookupAutoClusterFailover.java): InupdateServiceUrl, when recovering (target index < current index), reset state of indices above the new target toHealthyand zero their counters. The state of an unprobed index is not meaningful — resetting it ensures (a) a subsequent failover starts from a clean baseline if it needs to consider those services again, and (b) we don't leave stale transient state lying around.New deterministic test (
SameAuthParamsLookupAutoClusterFailoverTest.testRecoveryResetsHigherIndexStaleState): uses the existing mock-based harness inpulsar-clientto drive a precise probe sequence that reproduces the bug:[Failed, Failed, Healthy].state[1]: Failed → PreRecover.state[1]: PreRecover → Healthyand triggersupdateServiceUrl(1), url2 sees one failed probe sostate[2]: Healthy → PreFailright before the index drops to 1.After the recovery transition the check loop only iterates
0..1, so without the fixstate[2]is stuck atPreFailforever. The test assertsstate[2] == Healthyafter the transition.I verified the test fails with the exact expected message when the fix is reverted:
Verifying this change
Covered by:
SameAuthParamsLookupAutoClusterFailoverTest.testRecoveryResetsHigherIndexStaleStateinpulsar-client(fails without the fix, passes with it).SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailoverinpulsar-broker(both TLS and non-TLS variants). Locally I ran it 3 times with--rerun-tasks; all passed in ~90s each.Does this pull request potentially affect one of the following parts: