Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use index for peer recovery instead of translog #45136

Merged
merged 48 commits into from Aug 2, 2019

Commits on Jun 19, 2019

  1. Create peer-recovery retention leases (elastic#43190)

    This creates a peer-recovery retention lease for every shard during recovery,
    ensuring that the replication group retains history for future peer recoveries.
    It also ensures that leases for active shard copies do not expire, and leases
    for inactive shard copies expire immediately if the shard is fully-allocated.
    
    Relates elastic#41536
    DaveCTurner committed Jun 19, 2019
    Configuration menu
    Copy the full SHA
    2ec1483 View commit details
    Browse the repository at this point in the history

Commits on Jun 21, 2019

  1. Configuration menu
    Copy the full SHA
    dfa22bc View commit details
    Browse the repository at this point in the history

Commits on Jun 24, 2019

  1. Configuration menu
    Copy the full SHA
    f68fac4 View commit details
    Browse the repository at this point in the history

Commits on Jun 26, 2019

  1. Configuration menu
    Copy the full SHA
    cb39840 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    f5fdb75 View commit details
    Browse the repository at this point in the history

Commits on Jun 28, 2019

  1. Configuration menu
    Copy the full SHA
    00145cd View commit details
    Browse the repository at this point in the history

Commits on Jul 1, 2019

  1. Configuration menu
    Copy the full SHA
    cb6b0a9 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    b328478 View commit details
    Browse the repository at this point in the history
  3. Less sync

    DaveCTurner committed Jul 1, 2019
    Configuration menu
    Copy the full SHA
    7f7f84b View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    6bac16a View commit details
    Browse the repository at this point in the history
  5. Better test fix

    DaveCTurner committed Jul 1, 2019
    Configuration menu
    Copy the full SHA
    9941eb6 View commit details
    Browse the repository at this point in the history
  6. Checkstyle

    DaveCTurner committed Jul 1, 2019
    Configuration menu
    Copy the full SHA
    f3fbb33 View commit details
    Browse the repository at this point in the history
  7. Advance PRRLs to match GCP of tracked shards (elastic#43751)

    This commit adjusts the behaviour of the retention lease sync to first renew
    any peer-recovery retention leases where either:
    
    - the corresponding shard's global checkpoint has advanced, or
    
    - the lease is older than half of its expiry time
    
    Relates elastic#41536
    DaveCTurner committed Jul 1, 2019
    Configuration menu
    Copy the full SHA
    fbc4477 View commit details
    Browse the repository at this point in the history

Commits on Jul 3, 2019

  1. Configuration menu
    Copy the full SHA
    b1be151 View commit details
    Browse the repository at this point in the history

Commits on Jul 4, 2019

  1. Configuration menu
    Copy the full SHA
    291ff8d View commit details
    Browse the repository at this point in the history
  2. Remove PRRLs before performing file-based recovery (elastic#43928)

    If the primary performs a file-based recovery to a node that has (or recently
    had) a copy of the shard then it is possible that the persisted global
    checkpoint of the new copy is behind that of the old copy since file-based
    recoveries are somewhat destructive operations.
    
    Today we leave that node's PRRL in place during the recovery with the
    expectation that it can be used by the new copy. However this isn't the case if
    the new copy needs more history to be retained, because retention leases may
    only advance and never retreat.
    
    This commit addresses this by removing any existing PRRL during a file-based
    recovery: since we are performing a file-based recovery we have already
    determined that there isn't enough history for an ops-based recovery, so there
    is little point in keeping the old lease in place.
    
    Caught by [a failure of `RecoveryWhileUnderLoadIT.testRecoverWhileRelocating`](https://scans.gradle.com/s/wxccfrtfgjj3g/console-log?task=:server:integTest#L14)
    
    Relates elastic#41536
    DaveCTurner committed Jul 4, 2019
    Configuration menu
    Copy the full SHA
    389c625 View commit details
    Browse the repository at this point in the history

Commits on Jul 5, 2019

  1. Update BWC version for PRRLs (elastic#43958)

    This commit updates the version in which PRRLs are expected to exist to 7.4.0.
    DaveCTurner committed Jul 5, 2019
    Configuration menu
    Copy the full SHA
    ac2da33 View commit details
    Browse the repository at this point in the history
  2. Return recovery to generic thread post-PRRL action (elastic#44000)

    Today we perform `TransportReplicationAction` derivatives during recovery, and
    these actions call their response handlers on the transport thread. This change
    moves the continued execution of the recovery back onto the generic threadpool.
    DaveCTurner committed Jul 5, 2019
    Configuration menu
    Copy the full SHA
    d016e79 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    76ff6e8 View commit details
    Browse the repository at this point in the history
  4. Skip PRRL renewal on UNASSIGNED_SEQ_NO (elastic#44019)

    Today when renewing PRRLs we assert that any invalid "backwards" renewals must
    be because we are recovering the shard. In fact it's also possible to have
    `checkpointState.globalCheckpoint == SequenceNumbers.UNASSIGNED_SEQ_NO` on a
    tracked shard copy if the primary was just promoted and hasn't received
    checkpoints from all of its peers too.
    
    This commit weakens the assertion to match.
    
    Caught by a [failure of the full cluster restart
    tests](https://scans.gradle.com/s/5lllzgqtuegty/console-log#L8605)
    
    Relates elastic#41536
    DaveCTurner committed Jul 5, 2019
    Configuration menu
    Copy the full SHA
    da3c901 View commit details
    Browse the repository at this point in the history

Commits on Jul 8, 2019

  1. Only call assertNotTransportThread if asserts on (elastic#44028)

    In elastic#44000 we introduced some calls to `assertNotTransportThread` that are
    executed whether assertions are enabled or not. Although they have no effect if
    assertions are disabled, we should have done it like this instead.
    DaveCTurner committed Jul 8, 2019
    Configuration menu
    Copy the full SHA
    c5ed201 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    9523445 View commit details
    Browse the repository at this point in the history
  3. Create missing PRRLs after primary activation (elastic#44009)

    Today peer recovery retention leases (PRRLs) are created when starting a
    replication group from scratch and during peer recovery. However, if the
    replication group was migrated from nodes running a version which does not
    create PRRLs (e.g. 7.3 and earlier) then it's possible that the primary was
    relocated or promoted without first establishing all the expected leases.
    
    It's not possible to establish these leases before or during primary
    activation, so we must create them as soon as possible afterwards. This gives
    weaker guarantees about history retention, since there's a possibility that
    history will be discarded before it can be used. In practice such situations
    are expected to occur only rarely.
    
    This commit adds the machinery to create missing leases after primary
    activation, and strengthens the assertions about the existence of such leases
    in order to ensure that once all the leases do exist we never again enter a
    state where there's a missing lease.
    
    Relates elastic#41536
    DaveCTurner committed Jul 8, 2019
    Configuration menu
    Copy the full SHA
    bea2627 View commit details
    Browse the repository at this point in the history
  4. Reduce number of replicas in cluster restart test

    The cluster in the full-cluster restart test only has 2 nodes, so we cannot
    fully allocate an index with 2 replicas.
    DaveCTurner committed Jul 8, 2019
    Configuration menu
    Copy the full SHA
    11e9880 View commit details
    Browse the repository at this point in the history
  5. Only create missing PRRLs when appropriate

    Today PRRLs are not supported on closed indices or indices where soft deletes
    are disabled, but (confusingly) nor are they actively forbidden. This commit
    avoids creating them unnecessarily in unsupported situations.
    DaveCTurner committed Jul 8, 2019
    Configuration menu
    Copy the full SHA
    d7f7ebc View commit details
    Browse the repository at this point in the history

Commits on Jul 9, 2019

  1. Fix comment

    DaveCTurner committed Jul 9, 2019
    Configuration menu
    Copy the full SHA
    ba7c4be View commit details
    Browse the repository at this point in the history

Commits on Jul 11, 2019

  1. Configuration menu
    Copy the full SHA
    e12bde6 View commit details
    Browse the repository at this point in the history

Commits on Jul 15, 2019

  1. Configuration menu
    Copy the full SHA
    b8bcc0b View commit details
    Browse the repository at this point in the history

Commits on Jul 20, 2019

  1. Configuration menu
    Copy the full SHA
    40ea029 View commit details
    Browse the repository at this point in the history

Commits on Jul 23, 2019

  1. Configuration menu
    Copy the full SHA
    69c94f4 View commit details
    Browse the repository at this point in the history
  2. Use global checkpoint as starting seq in ops-based recovery (elastic#…

    …43463)
    
    Today we use the local checkpoint of the safe commit on replicas as the
    starting sequence number of operation-based peer recovery. While this is
    a good choice due to its simplicity, we need to share this information
    between copies if we use retention leases in peer recovery. We can avoid
    this extra work if we use the global checkpoint as the starting sequence
    number.
    
    With this change, we will try to recover replica locally up to the
    global checkpoint before performing peer recovery. This commit should
    also increase the chance of operation-based recovery.
    dnhatn committed Jul 23, 2019
    Configuration menu
    Copy the full SHA
    d15684d View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    06d9be6 View commit details
    Browse the repository at this point in the history

Commits on Jul 24, 2019

  1. Do not load global checkpoint to ReplicationTracker in local recovery…

    … step (elastic#44781)
    
    If we force allocate an empty or stale primary, the global checkpoint on
    replicas might be higher than the primary's as the local recovery step
    (introduced in elastic#43463) loads the previous (stale) global checkpoint into
    ReplicationTracker. There's no issue with the retention leases for a new
    lease with a higher term will supersede the stale one.
    
    Relates elastic#43463
    dnhatn committed Jul 24, 2019
    Configuration menu
    Copy the full SHA
    6275cd7 View commit details
    Browse the repository at this point in the history

Commits on Jul 25, 2019

  1. Configuration menu
    Copy the full SHA
    96dd543 View commit details
    Browse the repository at this point in the history

Commits on Jul 29, 2019

  1. Configuration menu
    Copy the full SHA
    417a2ac View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    b5c897c View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    446ebf0 View commit details
    Browse the repository at this point in the history

Commits on Jul 30, 2019

  1. Configuration menu
    Copy the full SHA
    7a247e5 View commit details
    Browse the repository at this point in the history
  2. Skip local recovery for closed or frozen indices (elastic#44887)

    For closed and frozen indices, we should not recover shard locally up to
    the global checkpoint before performing peer recovery for that copy
    might be offline when the index was closed/frozen.
    
    Relates elastic#43463
    Closes elastic#44855
    dnhatn committed Jul 30, 2019
    Configuration menu
    Copy the full SHA
    907bb55 View commit details
    Browse the repository at this point in the history

Commits on Jul 31, 2019

  1. Configuration menu
    Copy the full SHA
    0b066d3 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    2bae406 View commit details
    Browse the repository at this point in the history

Commits on Aug 1, 2019

  1. Configuration menu
    Copy the full SHA
    6960cf7 View commit details
    Browse the repository at this point in the history
  2. Recover peers using history from Lucene (elastic#44853)

    Thanks to peer recovery retention leases we now retain the history needed to
    perform peer recoveries from the index instead of from the translog. This
    commit adjusts the peer recovery process to do so, and also adjusts it to use
    the existence of a retention lease to decide whether or not to attempt an
    operations-based recovery.
    
    Reverts elastic#38904 and elastic#42211
    Relates elastic#41536
    DaveCTurner committed Aug 1, 2019
    Configuration menu
    Copy the full SHA
    5322b00 View commit details
    Browse the repository at this point in the history
  3. Reset starting seqno if fail to read last commit (elastic#45106)

    Previously, if the metadata snapshot is empty (either no commit found or
    error), we won't compute the starting sequence number and use -2 to opt
    out the operation-based recovery. With elastic#43463, we have a starting
    sequence number before reading the last commit. Thus, we need to reset
    it if we fail to snapshot the store.
    
    Closes elastic#45072
    dnhatn committed Aug 1, 2019
    Configuration menu
    Copy the full SHA
    77720e8 View commit details
    Browse the repository at this point in the history

Commits on Aug 2, 2019

  1. Configuration menu
    Copy the full SHA
    51778da View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    aea938b View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    09ae1e6 View commit details
    Browse the repository at this point in the history
  4. Remove stray file

    DaveCTurner committed Aug 2, 2019
    Configuration menu
    Copy the full SHA
    89b6a3b View commit details
    Browse the repository at this point in the history