Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain history for peer recovery using leases #39133

Conversation

DaveCTurner
Copy link
Contributor

Today if soft deletes are enabled then we discard history below the global
checkpoint, damaging our chances of being able to perform an operations-based
peer recovery. This change has each shard copy obtain a history retention lease
to retain history that is not included in its local safe commit, making an
operations-based peer recovery much more likely.

Today if soft deletes are enabled then we discard history below the global
checkpoint, damaging our chances of being able to perform an operations-based
peer recovery. This change has each shard copy obtain a history retention lease
to retain history that is not included in its local safe commit, making an
operations-based peer recovery much more likely.
@DaveCTurner DaveCTurner added >bug WIP :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v8.0.0 v7.2.0 labels Feb 19, 2019
@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Feb 19, 2019

This PR is WIP and represents the current state of my proof-of-concept. I do not antipicate merging it as-is since it is too large, but I would like to gather feedback to be sure I'm heading in the right direction here. I would particularly like opinions about the changes to RecoverySourceHandler.

The goal is that we can perform an operations-based recovery for all "reasonable" shard copies C:

  • There is a peer recovery retention lease L corresponding with C.
  • Every in-sync shard copy has a complete history of operations above the retained seqno of L.
  • The retained seqno r of L is no greater than the local checkpoint of the last safe commit of C.

Reasonable shard copies comprise all the copies that are currently being tracked, as well as all the copies that "might be a recovery target": if the shard is not fully allocated then any copy that has been tracked in the last index.soft_deletes.retention_lease.period (i.e. 12h) might reasonably be a recovery target.

We also require that history is eventually released: in a stable cluster, for every operation with seqno s below the MSN of a replication group, eventually there are no leases that retain s:

  • Every active shard copy eventually advances its LCPoSC past s.
  • Every lease for an active shard copy eventually also passes s.
  • Every inactive shard copy eventually either becomes active or else its lease expires.

There are a few conceptually-separate pieces combined into this one rather large change:

  • Create peer recovery retention leases

    • For primary, on primary activation
    • For replicas, during peer recovery
  • Periodically renew peer-recovery retention leases

    • Include the local checkpoint of the safe commit (LCPoSC) in TransportReplicationAction responses
    • Add a scheduled task that runs an action to collect the LCPoSC on all peers if needed
  • Make peer recovery work together with retention leases

    • Use the existence of a retention lease as the deciding factor for performing an ops-based recovery
    • Reinstate recovery from history stored in Lucene if soft deletes are enabled

Followup work, not included here.

  • Adjust translog retention: should we retain translog generations according to retention leases too?

  • Improve integration with test framework

    • Add utility methods to get the peer recovery retention leases to converge to their final state and release unnecessary history for those tests that assert things about history being released.
  • Make the ReplicaShardAllocator sensitive to leases, so that it prefers to select a location for each replica that only needs an ops-based recovery.

  • Seqno-based synced flush: if a copy has LCP == MSN then it needs no recovery.


BWC issues: during a rolling upgrade, we may migrate a primary onto a new node without first establishing the appropriate leases. They can't be established before or during this promotion, so we must weaken the assertions so that they only apply to sufficiently-newly-created indices, and weaken our requirements for ops-based peer recovery to be a "best effort" thing on older indices. We will still establish leases properly during peer recovery, and can establish them lazily on older indices, but they may not retain all the right history when first created.

Closed replicated indices issues: a closed index permits no replicated actions, but should not need any history to be retained. We cannot replay history into a closed index, so all recoveries must be file-based, so there's no real need for leases. We cannot assert that all the copies of a replicated closed index have a corresponding lease. Maybe we should remove any peer-recovery retention leases when closing an index? Or maybe we should let them expire naturally.


Plan:

  • Add LCPoSC to TransportReplicationAction responses and record it in CheckpointState
  • Schedule requests for LCPoSC updates whenever a CheckpointState hasn't caught up to the MSN
  • Create peer recovery retention leases as appropriate, and renew them according to the corresponding CheckpointState.
  • Adjust peer recovery to work with retention leases

Open questions:

  • If during peer recovery the recovery target were to recover as far as the global checkpoint using its own local history then the peer recovery retention leases would only need to depend on the global checkpoint and not the LCPoSC. But we already share the global checkpoint, so a bunch of plumbing goes away.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn
Copy link
Member

dnhatn commented Feb 23, 2019

Thanks @DaveCTurner.

I have some points to discuss on the approach.

  1. I think we should use the retention leases API (and its transport actions) in peer recovery rather than integrate it to ReplicationTracker. We can extend the APIs for our needs.

  2. We should expire the retention leases of peer-recovery. We should not prevent a replica from merging away deletes after replicas are offline for a long period. It's perfect fine for those replicas to copy segment files. Copying Lucene segment files is less expensive than replaying a large number of operations.

  3. We can use the global checkpoint as the retaining sequence number instead of the local checkpoint of the safe commit. We can do this by making a recovering replica start with the global checkpoint and recover its local translog up to the global checkpoint before accepting translog from the primary.

  4. With 3, can we start only install a retention lease when a copy left the ReplicationGroup?

@DaveCTurner
Copy link
Contributor Author

Hi @dnhatn

  1. I think we should use the retention leases API (and its transport actions) in peer recovery rather than integrate it to ReplicationTracker. We can extend the APIs for our needs.

I don't fully understand what you mean here, although moving functionality around is definitely possible in principle. Are you suggesting getting/adding the peer' history retention lease via transport actions rather than by talking directly to the primary IndexShard during peer recovery? Can you explain the advantage of that in more detail?

  1. We should expire the retention leases of peer-recovery. We should not prevent a replica from merging away deletes after replicas are offline for a long period. It's perfect fine for those replicas to copy segment files. Copying Lucene segment files is less expensive than replaying a large number of operations.

I think it's important that we don't expire the retention leases of active shards, even if something weird like a clock discontinuity occurs, but the leases corresponding to inactive shards are already treated as any other retention lease and expire after index.soft_deletes.retention.lease. Perhaps I have misunderstood your point?

  1. We can use the global checkpoint as the retaining sequence number instead of the local checkpoint of the safe commit. We can do this by making a recovering replica start with the global checkpoint and recover its local translog up to the global checkpoint before accepting translog from the primary.

I think you're right (but I need to think about this more to be certain). However if I understand correctly this is not what we do today, right? On the positive side we already share the global checkpoint, so there would be no need for the periodic task introduced here. However we still need to retain history from the local checkpoint of the local shard copy's safe commit (I was hoping to be able to express history retention entirely in terms of leases) and I think we might want to do more initial coordination between source and target to establish whether the recovery will be file-based or operations-based before expending effort on replaying the local translog.

  1. With 3, can we start only install a retention lease when a copy left the ReplicationGroup?

I don't think so, because then there would be no peer recovery retention lease after a cluster-wide disaster (e.g. power outage) with the consequence that many peer recoveries after the outage would have to fall back to copying files.

Today this test catches an exception and asserts that its proximate cause has
message `Random IOException` but occasionally this exception is wrapped two
layers deep, causing the test to fail. This commit adjusts the test to look at
the root cause of the exception instead.

      1> [2019-02-25T12:31:50,837][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotFileFailureDuringSnapshot] --> caught a top level exception, asserting what's expected
      1> org.elasticsearch.snapshots.SnapshotException: [test-repo:test-snap/e-hn_pLGRmOo97ENEXdQMQ] Snapshot could not be read
      1> 	at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:212) ~[main/:?]
      1> 	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:135) ~[main/:?]
      1> 	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:54) ~[main/:?]
      1> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:127) ~[main/:?]
      1> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:208) ~[main/:?]
      1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[main/:?]
      1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
      1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
      1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
      1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
      1> Caused by: org.elasticsearch.snapshots.SnapshotException: [test-repo:test-snap/e-hn_pLGRmOo97ENEXdQMQ] failed to get snapshots
      1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getSnapshotInfo(BlobStoreRepository.java:564) ~[main/:?]
      1> 	at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:206) ~[main/:?]
      1> 	... 9 more
      1> Caused by: java.io.IOException: Random IOException
      1> 	at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.maybeIOExceptionOrBlock(MockRepository.java:275) ~[test/:?]
      1> 	at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.readBlob(MockRepository.java:317) ~[test/:?]
      1> 	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.readBlob(ChecksumBlobStoreFormat.java:101) ~[main/:?]
      1> 	at org.elasticsearch.repositories.blobstore.BlobStoreFormat.read(BlobStoreFormat.java:90) ~[main/:?]
      1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getSnapshotInfo(BlobStoreRepository.java:560) ~[main/:?]
      1> 	at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:206) ~[main/:?]
      1> 	... 9 more

    FAILURE 0.59s J0 | SharedClusterSnapshotRestoreIT.testSnapshotFileFailureDuringSnapshot <<< FAILURES!
       > Throwable elastic#1: java.lang.AssertionError:
       > Expected: a string containing "Random IOException"
       >      but: was "[test-repo:test-snap/e-hn_pLGRmOo97ENEXdQMQ] failed to get snapshots"
       > 	at __randomizedtesting.SeedInfo.seed([B73CA847D4B4F52D:884E042D2D899330]:0)
       > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
       > 	at org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT.testSnapshotFileFailureDuringSnapshot(SharedClusterSnapshotRestoreIT.java:821)
       > 	at java.lang.Thread.run(Thread.java:748)
…ause' into 2019-02-19-soft-deletes-history-retention-TMP
In a rolling upgrade from a version that pre-dates the introduction of
peer-recovery retention leases, eventually the primary lands on an upgraded
node. If this was a relocation or a promotion then it inherits the replicas of
the previous primary without performing peer recoveries, so there may not be
peer recovery retention leases for the other shard copies.

This change fixes this by adding leases as-needed on older indices.
@ywelsch ywelsch removed the v7.0.0 label Apr 4, 2019
@DaveCTurner DaveCTurner self-assigned this Apr 16, 2019
@DaveCTurner
Copy link
Contributor Author

Closing in favour of meta-issue #41536

@DaveCTurner DaveCTurner deleted the 2019-02-19-soft-deletes-history-retention branch April 25, 2019 14:32
@DaveCTurner DaveCTurner restored the 2019-02-19-soft-deletes-history-retention branch April 25, 2019 14:32
@DaveCTurner DaveCTurner deleted the 2019-02-19-soft-deletes-history-retention branch July 23, 2022 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.2.0 WIP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants