Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset GatewayService flags before reroute #98653

Conversation

DaveCTurner
Copy link
Contributor

It's possible (although very unlikely) that the GatewayService
recovers the state, then fails over to a new master with unrecovered
state, and then fails back to the original master, and only then
performs the reroute that resets the flags which would trigger another
state recovery attempt. This leaves the cluster in an unrecovered state
until the next cluster state update.

This commit resets the flags at the end of the recovery update rather
than waiting until after the reroute, allowing a subsequent election to
retry recovery again.

Closes #98606

It's possible (although very unlikely) that the `GatewayService`
recovers the state, then fails over to a new master with unrecovered
state, and then fails back to the original master, and only then
performs the reroute that resets the flags which would trigger another
state recovery attempt. This leaves the cluster in an unrecovered state
until the next cluster state update.

This commit resets the flags at the end of the recovery update rather
than waiting until after the reroute, allowing a subsequent election to
retry recovery again.

Closes elastic#98606
@DaveCTurner DaveCTurner added >bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.11.0 v8.10.1 labels Aug 21, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Aug 21, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit 2df0e68 into elastic:main Aug 22, 2023
12 checks passed
@DaveCTurner DaveCTurner deleted the 2023/08/21/testClusterRecoversAfterExceptionDuringSerialization branch August 22, 2023 13:34
dreamquster pushed a commit to dreamquster/elasticsearch that referenced this pull request Aug 26, 2023
It's possible (although very unlikely) that the `GatewayService`
recovers the state, then fails over to a new master with unrecovered
state, and then fails back to the original master, and only then
performs the reroute that resets the flags which would trigger another
state recovery attempt. This leaves the cluster in an unrecovered state
until the next cluster state update.

This commit resets the flags at the end of the recovery update rather
than waiting until after the reroute, allowing a subsequent election to
retry recovery again.

Closes elastic#98606
@JVerwolf JVerwolf added v8.10.0 and removed v8.10.1 labels Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team v8.10.0 v8.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AtomicRegisterCoordinatorTests.testClusterRecoversAfterExceptionDuringSerialization failure
4 participants