Skip to content

[fix][test] Fix flaky SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover() test#25892

Open
oneby-wang wants to merge 1 commit into
apache:masterfrom
oneby-wang:testAutoClusterFailover_flaky_fix
Open

[fix][test] Fix flaky SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover() test#25892
oneby-wang wants to merge 1 commit into
apache:masterfrom
oneby-wang:testAutoClusterFailover_flaky_fix

Conversation

@oneby-wang
Copy link
Copy Markdown
Contributor

Motivation

SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover can fail intermittently when run repeatedly with invocationCount = 100.

The failure reproduced as:

Gradle suite > Gradle test > org.apache.pulsar.broker.SameAuthParamsLookupAutoClusterFailoverTest > testAutoClusterFailover[42](false) FAILED
    org.awaitility.core.ConditionTimeoutException: Assertion condition Arrays differ at element [1]: Healthy != PreFail expected [Healthy] but found [PreFail] within 3 minutes.
        at org.apache.pulsar.broker.SameAuthParamsLookupAutoClusterFailoverTest.awaitStatesAndIndex(SameAuthParamsLookupAutoClusterFailoverTest.java:154)
        at org.apache.pulsar.broker.SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover(SameAuthParamsLookupAutoClusterFailoverTest.java:134)

The key flaky signal is that the same failover provider was driven by two scheduled check threads:

2026-05-29T22:55:15,723 - INFO - [broker-service-url-check-2796-1:SameAuthParamsLookupAutoClusterFailover] - Failover to low priority pulsar service [0] pulsar://localhost:53673 --> [2] pulsar://localhost:53683. States: [Failed, Failed, Healthy], Counters: [0, 0, 0] {}
2026-05-29T22:55:15,723 - INFO - [broker-service-url-check-2795-1:SameAuthParamsLookupAutoClusterFailover] - Failover to low priority pulsar service [0] pulsar://localhost:53673 --> [2] pulsar://localhost:53683. States: [Failed, Failed, Healthy], Counters: [0, 0, 0] {}
2026-05-29T22:55:21,734 - INFO - [broker-service-url-check-2796-1:SameAuthParamsLookupAutoClusterFailover] - Recover to high priority pulsar service [2] pulsar://localhost:53683 --> [1] pulsar://localhost:53683. States: [Failed, Healthy, Healthy], Counters: [0, 0, 0] {}
2026-05-29T22:55:21,734 - INFO - [broker-service-url-check-2795-1:SameAuthParamsLookupAutoClusterFailover] - Recover to high priority pulsar service [2] pulsar://localhost:53683 --> [1] pulsar://localhost:53683. States: [Failed, Healthy, Healthy], Counters: [0, 0, 0] {}
2026-05-29T22:55:22,114 - INFO - [broker-service-url-check-2796-1:SameAuthParamsLookupAutoClusterFailover] - Recover to high priority pulsar service [1] pulsar://localhost:53683 --> [0] pulsar://localhost:53683. States: [Healthy, Healthy, Healthy], Counters: [0, 0, 0] {}
2026-05-29T22:55:22,115 - WARN - [broker-service-url-check-2795-1:SameAuthParamsLookupAutoClusterFailover] - Failed to probe service availability {brokerServiceIndex=1, counters=[0, 0, 0], states=[Healthy, Healthy, Healthy], url=pulsar://localhost:53683}

The test passed the provider into PulsarClient.builder().serviceUrlProvider(failover). PulsarClientImpl already initializes the configured ServiceUrlProvider while building the client. The test then called failover.initialize(client) again manually, creating a second scheduled check task for the same provider instance.

Because both scheduled tasks mutate the same failover state, one task can recover the current index back to 0 while the other task is still probing index 1. A transient failed probe can move index 1 to PreFail; after the current index is 0, the check loop no longer visits index 1, so the test waits until timeout with [Healthy, PreFail, Healthy].

Modifications

Removed the redundant manual failover.initialize(client) call from SameAuthParamsLookupAutoClusterFailoverTest.

The provider lifecycle remains managed by PulsarClient through ClientBuilder.serviceUrlProvider(...).

Verifying this change

  • Make sure that the change passes the CI checks.

This change is already covered by existing tests:

  • ./gradlew :pulsar-broker:test --tests "org.apache.pulsar.broker.SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover" -PtestRetryCount=0 --no-configuration-cache

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Copy link
Copy Markdown
Contributor

@void-ptr974 void-ptr974 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. PulsarClientImpl already initializes the ServiceUrlProvider, so this removes the duplicate scheduled check task.

Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants