Skip to content

[fix][client] Reset higher-index states on recovery in SameAuthParamsLookupAutoClusterFailover#25826

Merged
lhotari merged 2 commits into
apache:masterfrom
merlimat:mmerli/failover-reset-states-on-recovery
May 20, 2026
Merged

[fix][client] Reset higher-index states on recovery in SameAuthParamsLookupAutoClusterFailover#25826
lhotari merged 2 commits into
apache:masterfrom
merlimat:mmerli/failover-reset-states-on-recovery

Conversation

@merlimat
Copy link
Copy Markdown
Contributor

@merlimat merlimat commented May 19, 2026

Motivation

SameAuthParamsLookupAutoClusterFailover maintains a per-service-index state array (Healthy, PreFail, Failed, PreRecover) updated by a periodic check loop. The check loop only probes indices 0..currentPulsarServiceIndex:

private void checkPulsarServices() {
    for (int i = 0; i <= currentPulsarServiceIndex; i++) {
        ...
    }
}

When we recover to a higher-priority service (currentPulsarServiceIndex decreases), the loop stops probing the higher-indexed services. If any of those indices were in a transient state at the moment of recovery (e.g., PreFail from a single timed-out probe), they get stuck there because nothing ever probes them again to flip them back to Healthy.

This causes SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover to fail intermittently:

Caused by: java.lang.AssertionError: Arrays differ at element [2]: Healthy != PreFail expected [Healthy] but found [PreFail]

Concretely: while currentPulsarServiceIndex is 2 and the test is waiting for index 1 to recover, a single transient probe failure on pulsar2 transitions state[2]: Healthy -> PreFail. A subsequent successful probe at index 1 reaches recoverThreshold and triggers updateServiceUrl(1), dropping currentPulsarServiceIndex to 1. From that point on the loop only probes indices 0 and 1, and state[2] stays at PreFail forever — the test's 3-minute await window never sees state[2] == Healthy.

Example failure: https://scans.gradle.com/s/7pttiiyo6yybc/tests/task/:pulsar-broker:test/details/org.apache.pulsar.broker.SameAuthParamsLookupAutoClusterFailoverTest/testAutoClusterFailover%5B2%5D(true)/1/output

Modifications

Production fix (SameAuthParamsLookupAutoClusterFailover.java): In updateServiceUrl, when recovering (target index < current index), reset state of indices above the new target to Healthy and zero their counters. The state of an unprobed index is not meaningful — resetting it ensures (a) a subsequent failover starts from a clean baseline if it needs to consider those services again, and (b) we don't leave stale transient state lying around.

New deterministic test (SameAuthParamsLookupAutoClusterFailoverTest.testRecoveryResetsHigherIndexStaleState): uses the existing mock-based harness in pulsar-client to drive a precise probe sequence that reproduces the bug:

  1. Failover 0 → 2 (url0 down, url1 down, url2 up). State becomes [Failed, Failed, Healthy].
  2. url1 recovers; a check cycle transitions state[1]: Failed → PreRecover.
  3. On the cycle that promotes state[1]: PreRecover → Healthy and triggers updateServiceUrl(1), url2 sees one failed probe so state[2]: Healthy → PreFail right before the index drops to 1.

After the recovery transition the check loop only iterates 0..1, so without the fix state[2] is stuck at PreFail forever. The test asserts state[2] == Healthy after the transition.

I verified the test fails with the exact expected message when the fix is reverted:

AssertionError: state[2] should be reset to Healthy on recovery, not stuck at PreFail expected [Healthy] but found [PreFail]

Verifying this change

Covered by:

  • The new deterministic unit test SameAuthParamsLookupAutoClusterFailoverTest.testRecoveryResetsHigherIndexStaleState in pulsar-client (fails without the fix, passes with it).
  • The existing integration test SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover in pulsar-broker (both TLS and non-TLS variants). Locally I ran it 3 times with --rerun-tasks; all passed in ~90s each.

Does this pull request potentially affect one of the following parts:

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

…LookupAutoClusterFailover

The periodic health-check loop only iterates indices 0..currentPulsarServiceIndex.
When we recover to a higher-priority service (currentPulsarServiceIndex decreases),
the loop stops probing the higher-indexed services. If any of those indices were
in a transient state at that moment (e.g., PreFail from a single timed-out probe),
they get stuck there because nothing ever probes them again to flip them back
to Healthy.

This caused SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover
to fail intermittently: after recovering 2->1, the test asserts state[2]=Healthy,
but a single transient probe failure on pulsar2 right before the recovery left
state[2]=PreFail, and the loop never visits it again.

Reset state of indices above the new target to Healthy on recovery so they start
fresh if a future failover needs to consider them again.
Adds testRecoveryResetsHigherIndexStaleState to the client-side
SameAuthParamsLookupAutoClusterFailoverTest that uses a mocked
LookupService to drive a precise probe sequence:

1. Failover 0 -> 2 (url0 down, url1 down, url2 up).
2. url1 recovers; state[1] Failed -> PreRecover.
3. On the cycle that promotes state[1] PreRecover -> Healthy and
   triggers updateServiceUrl(1), url2 sees one failed probe so
   state[2] flips Healthy -> PreFail right before the index drops
   to 1.

After the recovery transition, the check loop only iterates 0..1,
so without the fix state[2] is stuck at PreFail forever. The test
asserts state[2]=Healthy after the transition, which fails without
the production fix and passes with it.

Verified the test fails with the exact expected message when the fix
is reverted:

    AssertionError: state[2] should be reset to Healthy on recovery,
    not stuck at PreFail expected [Healthy] but found [PreFail]
@lhotari lhotari merged commit 4229e19 into apache:master May 20, 2026
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants