Snapshot Can Become Stuck if Master Circuit-Breaks on Shard Snapshot Update Message (TransportMasterNodeAction does not retry) #54714

original-brownbear · 2020-04-03T12:31:25Z

Currently, the message that data nodes send to master when they are done snapshotting a shard is not exempt from the circuit breaker.
That means that if a data node sends a [internal:cluster/snapshot/update_snapshot_status] message and the master drops it because of the circuit breaker, the data node will not resend it unless master fails over.
This in turn leads to the master never finalising the shard snapshot in the cluster state and thus never finalising the snapshot overall, making it stuck until a master failover occurs.

We need to fix this by either exempting these messages from the circuit breaker or retrying them if they run into the circuit breaker.

Note: due to the small size of the shard status update request this is fairly unlikely to happen fortunately

EDIT: just discussing this with @ywelsch he pointed out that this is a general problem with TransportMasterNodeAction. We retry requests for master on master fail-overs but we don't retry on circuit breaker exceptions. So we ideally need a fix here that adds back-off retries to TransportMasterNodeAction.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-04-03T12:31:27Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

ywelsch · 2020-04-03T14:08:51Z

There are other actions similar to TransportMasterNodeAction that suffer from similar issues, e.g. ShardStateAction. Assume for example that a primary fails to replicate a write to a replica, and therefore reaches out to the master to mark the replica as stale. If the master is circuitbreaking, the primary will fail itself (ReplicationOperation.onNoLongerPrimary), which is just horrible.

Implements randomized, exponential backoff for remote master actions if they tripped the master's circuit breakers. Closes elastic#54714

Hotfix to not run into stuck snapshots because of master circuit breaking these requests. Given that these requests are very small and much of the memory associated with them is already allocated when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out of memory should be very small. Closes elastic#54714

…55376) Hotfix to not run into stuck snapshots because of master circuit breaking these requests. Given that these requests are very small and much of the memory associated with them is already allocated when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out of memory should be very small. Closes #54714

…lastic#55376) Hotfix to not run into stuck snapshots because of master circuit breaking these requests. Given that these requests are very small and much of the memory associated with them is already allocated when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out of memory should be very small. Closes elastic#54714

…55376) (#55384) Hotfix to not run into stuck snapshots because of master circuit breaking these requests. Given that these requests are very small and much of the memory associated with them is already allocated when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out of memory should be very small. Closes #54714

…55376) (#55383) Hotfix to not run into stuck snapshots because of master circuit breaking these requests. Given that these requests are very small and much of the memory associated with them is already allocated when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out of memory should be very small. Closes #54714

…lastic#55376) Hotfix to not run into stuck snapshots because of master circuit breaking these requests. Given that these requests are very small and much of the memory associated with them is already allocated when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out of memory should be very small. Closes elastic#54714

original-brownbear added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Apr 3, 2020

original-brownbear self-assigned this Apr 3, 2020

original-brownbear changed the title ~~Snapshot Can Become Stuck if Master Circuit-Breaks on Shard Snapshot Update Message~~ Snapshot Can Become Stuck if Master Circuit-Breaks on Shard Snapshot Update Message (TransportMasterNodeAction does not retry) Apr 3, 2020

original-brownbear added the :Distributed/Network Http and internode communication implementations label Apr 3, 2020

original-brownbear mentioned this issue Apr 6, 2020

Implement Retries if Master Tripped Circuit Breaker on Master Action #54821

Closed

original-brownbear mentioned this issue Apr 17, 2020

Exclude Snapshot Shard Status Update Requests from Circuit Breaker #55376

Merged

original-brownbear closed this as completed in #55376 Apr 17, 2020

original-brownbear mentioned this issue Apr 17, 2020

Exclude Snapshot Shard Status Update Requests from Circuit Breaker (#55376) #55383

Merged

original-brownbear mentioned this issue Apr 17, 2020

Exclude Snapshot Shard Status Update Requests from Circuit Breaker (#55376) #55384

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot Can Become Stuck if Master Circuit-Breaks on Shard Snapshot Update Message (TransportMasterNodeAction does not retry) #54714

Snapshot Can Become Stuck if Master Circuit-Breaks on Shard Snapshot Update Message (TransportMasterNodeAction does not retry) #54714

original-brownbear commented Apr 3, 2020 •

edited

elasticmachine commented Apr 3, 2020

ywelsch commented Apr 3, 2020

Snapshot Can Become Stuck if Master Circuit-Breaks on Shard Snapshot Update Message (TransportMasterNodeAction does not retry) #54714

Snapshot Can Become Stuck if Master Circuit-Breaks on Shard Snapshot Update Message (TransportMasterNodeAction does not retry) #54714

Comments

original-brownbear commented Apr 3, 2020 • edited

elasticmachine commented Apr 3, 2020

ywelsch commented Apr 3, 2020

original-brownbear commented Apr 3, 2020 •

edited