During shard relocation some requests might fail to be sent to a shard #13719
Labels
>bug
:Distributed/Distributed
A catch all label for anything in the Distributed Area. If you aren't sure, use this one.
help wanted
adoptme
resiliency
Team:Distributed
Meta label for distributed team
When shards relocate then there might be small window in time where requests fail to reach the relocating shard. This happens when a node that lags one cluster state behind has not realized yet that a node has relocated and the relocation source is already removed.
Below is a graphical representation of an example course of events. This caused us some trouble in test already because results were unexpected, see #13266. It affects all actions that inherit from
TransportBroadcastAction
,TransportBroadcastByNodeAction
and might also be problematic for others. For example: an optimize request might never reach a shard if it is relocating, indices stats may report wrong statistics, see #13266 (comment), etc.We should check if we can get away with just sending requests to relocation targets too for the affected actions or if we need to implement these kind of requests as replication action like we did for refresh and flush.
I: shard is relocating from n2 to n3
II: CS2 signals that n3 has started its shard and n2 can remove its own copy. shard on n2 is therefore closed. But n1 lags behind one cluster state and still expects an up and running primary on n2
III: if now n1 sends an optimize, indices stats request or the likes, it will send the request to n2 (based on CS1) but that does not not have the shard anymore.
The text was updated successfully, but these errors were encountered: