…cation with ongoing replica recoveries
Primary relocation violates two invariants that ensure proper interaction between document replication and peer recoveries,
ultimately leading to documents not being properly replicated.
Invariant 1: Document writes must be replicated based on the routing table of a cluster state that includes all shards which
have ongoing or finished recoveries. This is ensured by the fact that do not start a recovery that is not reflected by the
cluster state available on the primary node and we always sample a fresh cluster state before starting to replicate write
operations.
Invariant 2: Every operation that is not part of the snapshot taken for phase 2, must be succesfully indexed on the target
replica (pending shard level errors which will cause the target shard to be failed). To ensure this, we start replicating to
the target shard as soon as the recovery start and open it's engine before we take the snapshot. All operations that are
indexed after the snapshot was taken are guaranteed to arrive to the shard when it's ready to index them. Note that this also
means that the replication doesn't fail a shard if it's not yet ready to recieve operations - it's a normal part of a
recovering shard.
With primary relocations, the two invariants can be possibly violated. Let's consider a primary
relocating while there is another replica shard recovering from the primary shard.
Invariant 1 can be violated if the target of the primary relocation is so lagging on cluster state processing that it doesn't
even know about the new initializing replica. This is very rare in practice as replica recoveries take time to copy all the
index files but it is a theoretical gap that surfaces in testing scenarios.
Invariant 2 can be violated even if the target primary knows about the initializing replica. This can happen if the target
primary replicates an operation to the intializing shard and that operation arrives to the initializing shard before it opens
it's engine but arrives to the primary source after it has taken the snapshot of the translog. Those operations will be
currently missed on the new initializing replica.
The fix to reestablish invariant 1 is to ensure that the primary relocation target has a cluster state with all replica
recoveries that were successfully started on primary relocation source. The fix to reestablish invariant 2 is to check after
opening engine on the replica if the primary has been relocated in the meanwhile and fail the recovery.