Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot Can Become Stuck if Master Circuit-Breaks on Shard Snapshot Update Message (TransportMasterNodeAction does not retry) #54714

Closed
original-brownbear opened this issue Apr 3, 2020 · 2 comments · Fixed by #55376
Assignees
Labels
>bug :Distributed/Network Http and internode communication implementations :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@original-brownbear
Copy link
Member

original-brownbear commented Apr 3, 2020

Currently, the message that data nodes send to master when they are done snapshotting a shard is not exempt from the circuit breaker.
That means that if a data node sends a [internal:cluster/snapshot/update_snapshot_status] message and the master drops it because of the circuit breaker, the data node will not resend it unless master fails over.
This in turn leads to the master never finalising the shard snapshot in the cluster state and thus never finalising the snapshot overall, making it stuck until a master failover occurs.

We need to fix this by either exempting these messages from the circuit breaker or retrying them if they run into the circuit breaker.

Note: due to the small size of the shard status update request this is fairly unlikely to happen fortunately

EDIT: just discussing this with @ywelsch he pointed out that this is a general problem with TransportMasterNodeAction. We retry requests for master on master fail-overs but we don't retry on circuit breaker exceptions. So we ideally need a fix here that adds back-off retries to TransportMasterNodeAction.

@original-brownbear original-brownbear added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Apr 3, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@original-brownbear original-brownbear self-assigned this Apr 3, 2020
@original-brownbear original-brownbear changed the title Snapshot Can Become Stuck if Master Circuit-Breaks on Shard Snapshot Update Message Snapshot Can Become Stuck if Master Circuit-Breaks on Shard Snapshot Update Message (TransportMasterNodeAction does not retry) Apr 3, 2020
@original-brownbear original-brownbear added the :Distributed/Network Http and internode communication implementations label Apr 3, 2020
@ywelsch
Copy link
Contributor

ywelsch commented Apr 3, 2020

There are other actions similar to TransportMasterNodeAction that suffer from similar issues, e.g. ShardStateAction. Assume for example that a primary fails to replicate a write to a replica, and therefore reaches out to the master to mark the replica as stale. If the master is circuitbreaking, the primary will fail itself (ReplicationOperation.onNoLongerPrimary), which is just horrible.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Apr 6, 2020
Implements randomized, exponential backoff for remote master actions if they tripped
the master's circuit breakers.

Closes elastic#54714
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Apr 17, 2020
Hotfix to not run into stuck snapshots because of master circuit breaking these requests.
Given that these requests are very small and much of the memory associated with them is already allocated
when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out
of memory should be very small.

Closes elastic#54714
original-brownbear added a commit that referenced this issue Apr 17, 2020
…55376)

Hotfix to not run into stuck snapshots because of master circuit breaking these requests.
Given that these requests are very small and much of the memory associated with them is already allocated
when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out
of memory should be very small.

Closes #54714
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Apr 17, 2020
…lastic#55376)

Hotfix to not run into stuck snapshots because of master circuit breaking these requests.
Given that these requests are very small and much of the memory associated with them is already allocated
when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out
of memory should be very small.

Closes elastic#54714
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Apr 17, 2020
…lastic#55376)

Hotfix to not run into stuck snapshots because of master circuit breaking these requests.
Given that these requests are very small and much of the memory associated with them is already allocated
when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out
of memory should be very small.

Closes elastic#54714
original-brownbear added a commit that referenced this issue Apr 17, 2020
…55376) (#55384)

Hotfix to not run into stuck snapshots because of master circuit breaking these requests.
Given that these requests are very small and much of the memory associated with them is already allocated
when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out
of memory should be very small.

Closes #54714
original-brownbear added a commit that referenced this issue Apr 17, 2020
…55376) (#55383)

Hotfix to not run into stuck snapshots because of master circuit breaking these requests.
Given that these requests are very small and much of the memory associated with them is already allocated
when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out
of memory should be very small.

Closes #54714
yyff pushed a commit to yyff/elasticsearch that referenced this issue Apr 17, 2020
…lastic#55376)

Hotfix to not run into stuck snapshots because of master circuit breaking these requests.
Given that these requests are very small and much of the memory associated with them is already allocated
when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out
of memory should be very small.

Closes elastic#54714
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Network Http and internode communication implementations :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
3 participants