Retain history for peer recovery using leases #39133

DaveCTurner · 2019-02-19T17:51:24Z

Today if soft deletes are enabled then we discard history below the global
checkpoint, damaging our chances of being able to perform an operations-based
peer recovery. This change has each shard copy obtain a history retention lease
to retain history that is not included in its local safe commit, making an
operations-based peer recovery much more likely.

Today if soft deletes are enabled then we discard history below the global checkpoint, damaging our chances of being able to perform an operations-based peer recovery. This change has each shard copy obtain a history retention lease to retain history that is not included in its local safe commit, making an operations-based peer recovery much more likely.

DaveCTurner · 2019-02-19T17:51:29Z

This PR is WIP and represents the current state of my proof-of-concept. I do not antipicate merging it as-is since it is too large, but I would like to gather feedback to be sure I'm heading in the right direction here. I would particularly like opinions about the changes to RecoverySourceHandler.

The goal is that we can perform an operations-based recovery for all "reasonable" shard copies C:

There is a peer recovery retention lease L corresponding with C.
Every in-sync shard copy has a complete history of operations above the retained seqno of L.
The retained seqno r of L is no greater than the local checkpoint of the last safe commit of C.

Reasonable shard copies comprise all the copies that are currently being tracked, as well as all the copies that "might be a recovery target": if the shard is not fully allocated then any copy that has been tracked in the last index.soft_deletes.retention_lease.period (i.e. 12h) might reasonably be a recovery target.

We also require that history is eventually released: in a stable cluster, for every operation with seqno s below the MSN of a replication group, eventually there are no leases that retain s:

Every active shard copy eventually advances its LCPoSC past s.
Every lease for an active shard copy eventually also passes s.
Every inactive shard copy eventually either becomes active or else its lease expires.

There are a few conceptually-separate pieces combined into this one rather large change:

Create peer recovery retention leases
- For primary, on primary activation
- For replicas, during peer recovery
Periodically renew peer-recovery retention leases
- Include the local checkpoint of the safe commit (LCPoSC) in TransportReplicationAction responses
- Add a scheduled task that runs an action to collect the LCPoSC on all peers if needed
Make peer recovery work together with retention leases
- Use the existence of a retention lease as the deciding factor for performing an ops-based recovery
- Reinstate recovery from history stored in Lucene if soft deletes are enabled

Followup work, not included here.

Adjust translog retention: should we retain translog generations according to retention leases too?
Improve integration with test framework
- Add utility methods to get the peer recovery retention leases to converge to their final state and release unnecessary history for those tests that assert things about history being released.
Make the ReplicaShardAllocator sensitive to leases, so that it prefers to select a location for each replica that only needs an ops-based recovery.
Seqno-based synced flush: if a copy has LCP == MSN then it needs no recovery.

BWC issues: during a rolling upgrade, we may migrate a primary onto a new node without first establishing the appropriate leases. They can't be established before or during this promotion, so we must weaken the assertions so that they only apply to sufficiently-newly-created indices, and weaken our requirements for ops-based peer recovery to be a "best effort" thing on older indices. We will still establish leases properly during peer recovery, and can establish them lazily on older indices, but they may not retain all the right history when first created.

Closed replicated indices issues: a closed index permits no replicated actions, but should not need any history to be retained. We cannot replay history into a closed index, so all recoveries must be file-based, so there's no real need for leases. We cannot assert that all the copies of a replicated closed index have a corresponding lease. Maybe we should remove any peer-recovery retention leases when closing an index? Or maybe we should let them expire naturally.

Plan:

Add LCPoSC to TransportReplicationAction responses and record it in CheckpointState
Schedule requests for LCPoSC updates whenever a CheckpointState hasn't caught up to the MSN
Create peer recovery retention leases as appropriate, and renew them according to the corresponding CheckpointState.
Adjust peer recovery to work with retention leases

Open questions:

If during peer recovery the recovery target were to recover as far as the global checkpoint using its own local history then the peer recovery retention leases would only need to depend on the global checkpoint and not the LCPoSC. But we already share the global checkpoint, so a bunch of plumbing goes away.

elasticmachine · 2019-02-19T17:51:40Z

Pinging @elastic/es-distributed

dnhatn · 2019-02-23T17:58:57Z

Thanks @DaveCTurner.

I have some points to discuss on the approach.

I think we should use the retention leases API (and its transport actions) in peer recovery rather than integrate it to ReplicationTracker. We can extend the APIs for our needs.
We should expire the retention leases of peer-recovery. We should not prevent a replica from merging away deletes after replicas are offline for a long period. It's perfect fine for those replicas to copy segment files. Copying Lucene segment files is less expensive than replaying a large number of operations.
We can use the global checkpoint as the retaining sequence number instead of the local checkpoint of the safe commit. We can do this by making a recovering replica start with the global checkpoint and recover its local translog up to the global checkpoint before accepting translog from the primary.
With 3, can we start only install a retention lease when a copy left the ReplicationGroup?

DaveCTurner · 2019-02-24T20:04:55Z

Hi @dnhatn

I think we should use the retention leases API (and its transport actions) in peer recovery rather than integrate it to ReplicationTracker. We can extend the APIs for our needs.

I don't fully understand what you mean here, although moving functionality around is definitely possible in principle. Are you suggesting getting/adding the peer' history retention lease via transport actions rather than by talking directly to the primary IndexShard during peer recovery? Can you explain the advantage of that in more detail?

We should expire the retention leases of peer-recovery. We should not prevent a replica from merging away deletes after replicas are offline for a long period. It's perfect fine for those replicas to copy segment files. Copying Lucene segment files is less expensive than replaying a large number of operations.

I think it's important that we don't expire the retention leases of active shards, even if something weird like a clock discontinuity occurs, but the leases corresponding to inactive shards are already treated as any other retention lease and expire after index.soft_deletes.retention.lease. Perhaps I have misunderstood your point?

We can use the global checkpoint as the retaining sequence number instead of the local checkpoint of the safe commit. We can do this by making a recovering replica start with the global checkpoint and recover its local translog up to the global checkpoint before accepting translog from the primary.

I think you're right (but I need to think about this more to be certain). However if I understand correctly this is not what we do today, right? On the positive side we already share the global checkpoint, so there would be no need for the periodic task introduced here. However we still need to retain history from the local checkpoint of the local shard copy's safe commit (I was hoping to be able to express history retention entirely in terms of leases) and I think we might want to do more initial coordination between source and target to establish whether the recovery will be file-based or operations-based before expending effort on replaying the local translog.

With 3, can we start only install a retention lease when a copy left the ReplicationGroup?

I don't think so, because then there would be no peer recovery retention lease after a cluster-wide disaster (e.g. power outage) with the consequence that many peer recoveries after the outage would have to fall back to copying files.

Today this test catches an exception and asserts that its proximate cause has message `Random IOException` but occasionally this exception is wrapped two layers deep, causing the test to fail. This commit adjusts the test to look at the root cause of the exception instead. 1> [2019-02-25T12:31:50,837][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotFileFailureDuringSnapshot] --> caught a top level exception, asserting what's expected 1> org.elasticsearch.snapshots.SnapshotException: [test-repo:test-snap/e-hn_pLGRmOo97ENEXdQMQ] Snapshot could not be read 1> at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:212) ~[main/:?] 1> at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:135) ~[main/:?] 1> at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:54) ~[main/:?] 1> at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:127) ~[main/:?] 1> at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:208) ~[main/:?] 1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[main/:?] 1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?] 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202] 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202] 1> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202] 1> Caused by: org.elasticsearch.snapshots.SnapshotException: [test-repo:test-snap/e-hn_pLGRmOo97ENEXdQMQ] failed to get snapshots 1> at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getSnapshotInfo(BlobStoreRepository.java:564) ~[main/:?] 1> at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:206) ~[main/:?] 1> ... 9 more 1> Caused by: java.io.IOException: Random IOException 1> at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.maybeIOExceptionOrBlock(MockRepository.java:275) ~[test/:?] 1> at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.readBlob(MockRepository.java:317) ~[test/:?] 1> at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.readBlob(ChecksumBlobStoreFormat.java:101) ~[main/:?] 1> at org.elasticsearch.repositories.blobstore.BlobStoreFormat.read(BlobStoreFormat.java:90) ~[main/:?] 1> at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getSnapshotInfo(BlobStoreRepository.java:560) ~[main/:?] 1> at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:206) ~[main/:?] 1> ... 9 more FAILURE 0.59s J0 | SharedClusterSnapshotRestoreIT.testSnapshotFileFailureDuringSnapshot <<< FAILURES! > Throwable elastic#1: java.lang.AssertionError: > Expected: a string containing "Random IOException" > but: was "[test-repo:test-snap/e-hn_pLGRmOo97ENEXdQMQ] failed to get snapshots" > at __randomizedtesting.SeedInfo.seed([B73CA847D4B4F52D:884E042D2D899330]:0) > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) > at org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT.testSnapshotFileFailureDuringSnapshot(SharedClusterSnapshotRestoreIT.java:821) > at java.lang.Thread.run(Thread.java:748)

…ause' into 2019-02-19-soft-deletes-history-retention-TMP

…sions

In a rolling upgrade from a version that pre-dates the introduction of peer-recovery retention leases, eventually the primary lands on an upgraded node. If this was a relocation or a promotion then it inherits the replicas of the previous primary without performing peer recoveries, so there may not be peer recovery retention leases for the other shard copies. This change fixes this by adding leases as-needed on older indices.

DaveCTurner · 2019-04-25T14:30:59Z

Closing in favour of meta-issue #41536

DaveCTurner added >bug WIP :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v8.0.0 v7.2.0 labels Feb 19, 2019

DaveCTurner requested review from ywelsch, jasontedor and dnhatn February 19, 2019 17:51

DaveCTurner mentioned this pull request Feb 19, 2019

Retain history for peer recovery using leases #38855

Closed

DaveCTurner added 8 commits February 19, 2019 18:08

Imports

1db54e6

Merge branch 'master' into 2019-02-19-soft-deletes-history-retention-WIP

fa27f06

No need for this

bfaf734

Compile error

e9971f9

contains -> hasItem

316c4ed

Must renew leases now

98e2708

Now renew/sync returns a future so that tests can wait

4241335

WIP on CcrRetentionLeaseIT

5d2f50a

DaveCTurner mentioned this pull request Feb 21, 2019

Add listener for background lease sync completion #39240

Closed

DaveCTurner added 6 commits February 25, 2019 10:07

Merge branch 'master' into 2019-02-19-soft-deletes-history-retention

4d960d3

We already set waitForActiveShards

08a6831

Imports

d2ab443

More imports

4cae8c5

Imports

6639a57

Imports

fdd1c97

DaveCTurner added 20 commits February 25, 2019 11:20

hasItem

d570746

Merge branch '2019-02-25-testSnapshotFileFailureDuringSnapshot-root-c…

7b50472

…ause' into 2019-02-19-soft-deletes-history-retention-TMP

Merge branch 'master' into 2019-02-19-soft-deletes-history-retention

e604567

Finer assertion

fd71984

Imports

ee3efa7

Fix up CcrRetentionLeaseIT

b2ec2d0

Merge branch 'master' into 2019-02-19-soft-deletes-history-retention-TMP

55adaeb

Fix up test

30cf6d4

WIP sharing LCPoSC with every replication response

6709175

WIP continues

2395b7a

More WIP

fa6ae12

Imports

9b45156

Weaken retention leases assertion so it only applies in supported ver…

ed18d5c

…sions

WIP fixing ReplicationTrackerTests

7064a6c

Merge branch 'master' into 2019-02-19-soft-deletes-history-retention

1ae5061

Test fixes

3e01465

Allow retention lease syncs to happen even on closed indices

684e395

Imports

47ed955

ywelsch removed the v7.0.0 label Apr 4, 2019

DaveCTurner self-assigned this Apr 16, 2019

DaveCTurner added v7.2.0 and removed v7.2.0 v8.0.0 labels Apr 23, 2019

DaveCTurner closed this Apr 25, 2019

DaveCTurner deleted the 2019-02-19-soft-deletes-history-retention branch April 25, 2019 14:32

DaveCTurner restored the 2019-02-19-soft-deletes-history-retention branch April 25, 2019 14:32

DaveCTurner deleted the 2019-02-19-soft-deletes-history-retention branch July 23, 2022 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retain history for peer recovery using leases #39133

Retain history for peer recovery using leases #39133

DaveCTurner commented Feb 19, 2019

DaveCTurner commented Feb 19, 2019 •

edited

elasticmachine commented Feb 19, 2019

dnhatn commented Feb 23, 2019

DaveCTurner commented Feb 24, 2019

DaveCTurner commented Apr 25, 2019

Retain history for peer recovery using leases #39133

Retain history for peer recovery using leases #39133

Conversation

DaveCTurner commented Feb 19, 2019

DaveCTurner commented Feb 19, 2019 • edited

elasticmachine commented Feb 19, 2019

dnhatn commented Feb 23, 2019

DaveCTurner commented Feb 24, 2019

DaveCTurner commented Apr 25, 2019

DaveCTurner commented Feb 19, 2019 •

edited