New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot Can Become Stuck if Master Circuit-Breaks on Shard Snapshot Update Message (TransportMasterNodeAction does not retry) #54714
Comments
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
There are other actions similar to |
Implements randomized, exponential backoff for remote master actions if they tripped the master's circuit breakers. Closes elastic#54714
Hotfix to not run into stuck snapshots because of master circuit breaking these requests. Given that these requests are very small and much of the memory associated with them is already allocated when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out of memory should be very small. Closes elastic#54714
…55376) Hotfix to not run into stuck snapshots because of master circuit breaking these requests. Given that these requests are very small and much of the memory associated with them is already allocated when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out of memory should be very small. Closes #54714
…lastic#55376) Hotfix to not run into stuck snapshots because of master circuit breaking these requests. Given that these requests are very small and much of the memory associated with them is already allocated when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out of memory should be very small. Closes elastic#54714
…lastic#55376) Hotfix to not run into stuck snapshots because of master circuit breaking these requests. Given that these requests are very small and much of the memory associated with them is already allocated when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out of memory should be very small. Closes elastic#54714
…55376) (#55384) Hotfix to not run into stuck snapshots because of master circuit breaking these requests. Given that these requests are very small and much of the memory associated with them is already allocated when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out of memory should be very small. Closes #54714
…55376) (#55383) Hotfix to not run into stuck snapshots because of master circuit breaking these requests. Given that these requests are very small and much of the memory associated with them is already allocated when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out of memory should be very small. Closes #54714
…lastic#55376) Hotfix to not run into stuck snapshots because of master circuit breaking these requests. Given that these requests are very small and much of the memory associated with them is already allocated when the circuit breaker kicks in, the risk of this change introducing a higher chance of master running out of memory should be very small. Closes elastic#54714
Currently, the message that data nodes send to master when they are done snapshotting a shard is not exempt from the circuit breaker.
That means that if a data node sends a
[internal:cluster/snapshot/update_snapshot_status]
message and the master drops it because of the circuit breaker, the data node will not resend it unless master fails over.This in turn leads to the master never finalising the shard snapshot in the cluster state and thus never finalising the snapshot overall, making it stuck until a master failover occurs.
We need to fix this by either exempting these messages from the circuit breaker or retrying them if they run into the circuit breaker.
Note: due to the small size of the shard status update request this is fairly unlikely to happen fortunately
EDIT: just discussing this with @ywelsch he pointed out that this is a general problem with
TransportMasterNodeAction
. We retry requests for master on master fail-overs but we don't retry on circuit breaker exceptions. So we ideally need a fix here that adds back-off retries toTransportMasterNodeAction
.The text was updated successfully, but these errors were encountered: