Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix replica-primary inconsistencies when indexing during primary relocation with ongoing replica recoveries #19287

Merged
merged 1 commit into from Jul 19, 2016

Commits on Jul 19, 2016

  1. Fix replica-primary inconsistencies when indexing during primary relo…

    …cation with ongoing replica recoveries
    
    Primary relocation violates two invariants that ensure proper interaction between document replication and peer recoveries,
    ultimately leading to documents not being properly replicated.
    
    Invariant 1: Document writes must be replicated based on the routing table of a cluster state that includes all shards which
    have ongoing or finished recoveries. This is ensured by the fact that do not start a recovery that is not reflected by the
    cluster state available on the primary node and we always sample a fresh cluster state before starting to replicate write
    operations.
    
    Invariant 2: Every operation that is not part of the snapshot taken for phase 2, must be succesfully indexed on the target
    replica (pending shard level errors which will cause the target shard to be failed). To ensure this, we start replicating to
    the target shard as soon as the recovery start and open it's engine before we take the snapshot. All operations that are
    indexed after the snapshot was taken are guaranteed to arrive to the shard when it's ready to index them. Note that this also
    means that the replication doesn't fail a shard if it's not yet ready to recieve operations - it's a normal part of a
    recovering shard.
    
    With primary relocations, the two invariants can be possibly violated. Let's consider a primary
    relocating while there is another replica shard recovering from the primary shard.
    
    Invariant 1 can be violated if the target of the primary relocation is so lagging on cluster state processing that it doesn't
    even know about the new initializing replica. This is very rare in practice as replica recoveries take time to copy all the
    index files but it is a theoretical gap that surfaces in testing scenarios.
    
    Invariant 2 can be violated even if the target primary knows about the initializing replica. This can happen if the target
    primary replicates an operation to the intializing shard and that operation arrives to the initializing shard before it opens
    it's engine but arrives to the primary source after it has taken the snapshot of the translog. Those operations will be
    currently missed on the new initializing replica.
    
    The fix to reestablish invariant 1 is to ensure that the primary relocation target has a cluster state with all replica
    recoveries that were successfully started on primary relocation source. The fix to reestablish invariant 2 is to check after
    opening engine on the replica if the primary has been relocated in the meanwhile and fail the recovery.
    ywelsch committed Jul 19, 2016
    Copy the full SHA
    d7f99a4 View commit details
    Browse the repository at this point in the history