AtomicRegisterCoordinatorTests.testClusterRecoversAfterExceptionDuringSerialization failure #98606

DaveCTurner · 2023-08-17T15:01:31Z

The following test fails reproducibly in current main (af071cc):

./gradlew ':server:test' --tests "org.elasticsearch.cluster.coordination.AtomicRegisterCoordinatorTests.testClusterRecoversAfterExceptionDuringSerialization" -Dtests.seed=E6F4C7552A14BE28

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-08-17T15:01:55Z

Pinging @elastic/es-distributed (Team:Distributed)

It's possible (although very unlikely) that the `GatewayService` recovers the state, then fails over to a new master with unrecovered state, and then fails back to the original master, and only then performs the reroute that resets the flags which would trigger another state recovery attempt. This leaves the cluster in an unrecovered state until the next cluster state update. This commit resets the flags at the end of the recovery update rather than waiting until after the reroute, allowing a subsequent election to retry recovery again. Closes elastic#98606

It's possible (although very unlikely) that the `GatewayService` recovers the state, then fails over to a new master with unrecovered state, and then fails back to the original master, and only then performs the reroute that resets the flags which would trigger another state recovery attempt. This leaves the cluster in an unrecovered state until the next cluster state update. This commit resets the flags at the end of the recovery update rather than waiting until after the reroute, allowing a subsequent election to retry recovery again. Closes #98606

It's possible (although very unlikely) that the `GatewayService` recovers the state, then fails over to a new master with unrecovered state, and then fails back to the original master, and only then performs the reroute that resets the flags which would trigger another state recovery attempt. This leaves the cluster in an unrecovered state until the next cluster state update. This commit resets the flags at the end of the recovery update rather than waiting until after the reroute, allowing a subsequent election to retry recovery again. Closes elastic#98606

DaveCTurner added >test-failure Triaged test failures from CI :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Aug 17, 2023

elasticsearchmachine added the Team:Distributed Meta label for distributed team label Aug 17, 2023

DaveCTurner self-assigned this Aug 18, 2023

DaveCTurner mentioned this issue Aug 21, 2023

Reset GatewayService flags before reroute #98653

Merged

DaveCTurner closed this as completed in #98653 Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AtomicRegisterCoordinatorTests.testClusterRecoversAfterExceptionDuringSerialization failure #98606

AtomicRegisterCoordinatorTests.testClusterRecoversAfterExceptionDuringSerialization failure #98606

DaveCTurner commented Aug 17, 2023

elasticsearchmachine commented Aug 17, 2023

AtomicRegisterCoordinatorTests.testClusterRecoversAfterExceptionDuringSerialization failure #98606

AtomicRegisterCoordinatorTests.testClusterRecoversAfterExceptionDuringSerialization failure #98606

Comments

DaveCTurner commented Aug 17, 2023

elasticsearchmachine commented Aug 17, 2023