During shard relocation some requests might fail to be sent to a shard #13719

Open
brwe opened this Issue Sep 22, 2015 · 0 comments

Projects

None yet

2 participants

@brwe
Contributor
brwe commented Sep 22, 2015

When shards relocate then there might be small window in time where requests fail to reach the relocating shard. This happens when a node that lags one cluster state behind has not realized yet that a node has relocated and the relocation source is already removed.
Below is a graphical representation of an example course of events. This caused us some trouble in test already because results were unexpected, see #13266. It affects all actions that inherit from TransportBroadcastAction, TransportBroadcastByNodeAction and might also be problematic for others. For example: an optimize request might never reach a shard if it is relocating, indices stats may report wrong statistics, see #13266 (comment), etc.

We should check if we can get away with just sending requests to relocation targets too for the affected actions or if we need to implement these kind of requests as replication action like we did for refresh and flush.

recovery-issues

finally

I: shard is relocating from n2 to n3
II: CS2 signals that n3 has started its shard and n2 can remove its own copy. shard on n2 is therefore closed. But n1 lags behind one cluster state and still expects an up and running primary on n2
III: if now n1 sends an optimize, indices stats request or the likes, it will send the request to n2 (based on CS1) but that does not not have the shard anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment