Skip to content

FlakyIT: ITHighAvailabilityTest #12653

@paul-rogers

Description

@paul-rogers

The integration test ITHighAvailabilityTest failed in this build:

[ERROR] Failures: 
[ERROR]   ITHighAvailabilityTest.testCoordinatorCluster:207 » ISE Max number of retries[...

Details:

2022-06-14T18:33:50,356 INFO [main] org.apache.druid.testing.utils.DruidClusterAdminClient - 307 Temporary Redirect 
2022-06-14T18:33:50,356 INFO [main] org.apache.druid.testing.utils.ITRetryUtil - Trying attempt[0/240]...
2022-06-14T18:33:50,358 WARN [HttpClient-Netty-Worker-14] org.apache.druid.java.util.http.client.pool.ResourcePool - Resource at key[http://127.0.0.1:8590] was returned multiple times?
2022-06-14T18:33:50,358 ERROR [main] org.apache.druid.testing.utils.DruidClusterAdminClient - Error while waiting for [http://127.0.0.1:8590] to be ready
java.util.concurrent.ExecutionException: java.io.IOException: Connection reset by peer
	at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[guava-16.0.1.jar:?]
	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[guava-16.0.1.jar:?]
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[guava-16.0.1.jar:?]
	at org.apache.druid.testing.utils.DruidClusterAdminClient.lambda$waitUntilInstanceReady$1(DruidClusterAdminClient.java:268) ~[druid-integration-tests-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at org.apache.druid.testing.utils.ITRetryUtil.retryUntil(ITRetryUtil.java:61) ~[druid-integration-tests-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at org.apache.druid.testing.utils.ITRetryUtil.retryUntilTrue(ITRetryUtil.java:39) ~[druid-integration-tests-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at org.apache.druid.testing.utils.DruidClusterAdminClient.waitUntilInstanceReady(DruidClusterAdminClient.java:262) ~[druid-integration-tests-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at org.apache.druid.testing.utils.DruidClusterAdminClient.waitUntilOverlordTwoReady(DruidClusterAdminClient.java:140) ~[druid-integration-tests-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at org.apache.druid.tests.leadership.ITHighAvailabilityTest.lambda$swapLeadersAndWait$7(ITHighAvailabilityTest.java:263) ~[test-classes/:?]
	at org.apache.druid.tests.leadership.ITHighAvailabilityTest.swapLeadersAndWait(ITHighAvailabilityTest.java:266) ~[test-classes/:?]
	at org.apache.druid.tests.leadership.ITHighAvailabilityTest.testLeadershipChanges(ITHighAvailabilityTest.java:125) ~[test-classes/:?]

This PR did change this particular test case, but in a different test function. Some things to note:

  • This test passed on a previous run for this PR. The change that triggered the re-run was trivial: a change in a documentation file.
  • The test failed on retry 0 of 240: somehow the retry mechanism (which is generally over-aggressive) didn't kick in this time, yet the failure is that the number of retries was exceeded.
  • There is a 307 redirect error in the log. Perhaps the tests don't handle the transient case in which a redirect occurs?
  • Perhaps unrelated, but there is an entry for "Resource at key[http://127.0.0.1:8590] was returned multiple times?"

This particular test has been redone in the "new IT" PR, but we're stuck with the old version in the present PR.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions