When a primary is relocating from node_1 to node_2, there can be a short time where the old primary is removed from the node already (closed, not deleted) but the new primary is still in POST_RECOVERY. In this state indexing requests might be sent back and forth between node_1 and node_2 endlessly.
Course of events:
primary ([index]) relocates from node_1 to node_2
node_2 is done recovering, moves its shard to IndexShardState.POST_RECOVERY and sends a message to master that the shard is ShardRoutingState.STARTED
Cluster state 1:
node_1: [index] RELOCATING (ShardRoutingState), (STARTED from IndexShardState perspective on node_1)
node_2: [index] INITIALIZING (ShardRoutingState), (at this point already POST_RECOVERY from IndexShardState perspective on node_2)
master receives shard started and updates cluster state to:
Cluster state 2:
node_1: [index] no shard
node_2: [index] STARTED (ShardRoutingState), (at this point still in POST_RECOVERY from IndexShardState perspective on node_2)
master sends this to node_1 and node_2
node_1 receives the new cluster state and removes its shard because it is not allocated on node_1 anymore
index a document
At this point node_1 is already on cluster state 2 and does not have the shard anymore so it forwards the request to node_2. But node_2 is behind with cluster state processing, is still on cluster state 1 and therefore has the shard in IndexShardState.POST_RECOVERY and thinks node_1 has the primary. So it will send the request back to node_1. This goes on until either node_2 finally catches up and processes cluster state 2 or both nodes OOM.
I will make a pull request with a test shortly
here is a test that reproduces this: #12574
I think this will be closed by #15900
I've opened #16274 to address this issue.
Prevent TransportReplicationAction to route request based on stale lo…
…cal routing table
Update resliency page
#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.
Update resiliency page (#17586)