Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AtomicRegisterCoordinatorTests.testClusterRecoversAfterExceptionDuringSerialization failure #98606

Closed
DaveCTurner opened this issue Aug 17, 2023 · 1 comment · Fixed by #98653
Assignees
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

The following test fails reproducibly in current main (af071cc):

./gradlew ':server:test' --tests "org.elasticsearch.cluster.coordination.AtomicRegisterCoordinatorTests.testClusterRecoversAfterExceptionDuringSerialization" -Dtests.seed=E6F4C7552A14BE28
@DaveCTurner DaveCTurner added >test-failure Triaged test failures from CI :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Aug 17, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Aug 17, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Aug 18, 2023
It's possible (although very unlikely) that the `GatewayService`
recovers the state, then fails over to a new master with unrecovered
state, and then fails back to the original master, and only then
performs the reroute that resets the flags which would trigger another
state recovery attempt. This leaves the cluster in an unrecovered state
until the next cluster state update.

This commit resets the flags at the end of the recovery update rather
than waiting until after the reroute, allowing a subsequent election to
retry recovery again.

Closes elastic#98606
@DaveCTurner DaveCTurner self-assigned this Aug 18, 2023
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Aug 21, 2023
It's possible (although very unlikely) that the `GatewayService`
recovers the state, then fails over to a new master with unrecovered
state, and then fails back to the original master, and only then
performs the reroute that resets the flags which would trigger another
state recovery attempt. This leaves the cluster in an unrecovered state
until the next cluster state update.

This commit resets the flags at the end of the recovery update rather
than waiting until after the reroute, allowing a subsequent election to
retry recovery again.

Closes elastic#98606
DaveCTurner added a commit that referenced this issue Aug 22, 2023
It's possible (although very unlikely) that the `GatewayService`
recovers the state, then fails over to a new master with unrecovered
state, and then fails back to the original master, and only then
performs the reroute that resets the flags which would trigger another
state recovery attempt. This leaves the cluster in an unrecovered state
until the next cluster state update.

This commit resets the flags at the end of the recovery update rather
than waiting until after the reroute, allowing a subsequent election to
retry recovery again.

Closes #98606
dreamquster pushed a commit to dreamquster/elasticsearch that referenced this issue Aug 26, 2023
It's possible (although very unlikely) that the `GatewayService`
recovers the state, then fails over to a new master with unrecovered
state, and then fails back to the original master, and only then
performs the reroute that resets the flags which would trigger another
state recovery attempt. This leaves the cluster in an unrecovered state
until the next cluster state update.

This commit resets the flags at the end of the recovery update rather
than waiting until after the reroute, allowing a subsequent election to
retry recovery again.

Closes elastic#98606
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants