Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use peer recovery retention leases for indices without soft-deletes #50351

Merged
merged 11 commits into from
Dec 20, 2019

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Dec 19, 2019

Today, the replica allocator uses peer recovery retention leases to select the best-matched copies when allocating replicas of indices with soft-deletes. We can employ this mechanism for indices without soft-deletes because the retaining sequence number of a PRRL is the persisted global checkpoint (plus one) of that copy. If the primary and replica have the same retaining sequence number, then we should be able to perform a noop recovery. The reason is that we must be retaining translog up to the local checkpoint of the safe commit, which is at most the global checkpoint of either copy). The only limitation is that we might not cancel ongoing file-based recoveries with PRRLs for noop recoveries. We can't make the translog retention policy comply with PRRLs. We also have this problem with soft-deletes if a PRRL is about to expire.

A nice side-effect of this is that we can turn off the translog retention once all shards started. However, I prefer leaving translog disconnect to PRRLs.

Relates #45136
Relates #46959

@dnhatn dnhatn added >enhancement :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v7.6.0 labels Dec 19, 2019
@dnhatn dnhatn requested a review from ywelsch December 19, 2019 08:20
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Recovery)

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. I've left one question about strengthening the assertions in the tests.

continue;
}
assertNotNull(retentionLeases);
for (Map<String, ?> retentionLease : retentionLeases) {
if (((String) retentionLease.get("id")).startsWith("peer_recovery/")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not require that there is always a peer recovery retention lease. Should we require finding such a lease, and for the right node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++. Adjusted in 6071abd.

@dnhatn dnhatn requested a review from ywelsch December 19, 2019 18:09
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -1278,7 +1278,7 @@ public void testOperationBasedRecovery() throws Exception {
}
}
flush(index, true);
ensurePeerRecoveryRetentionLeasesRenewedAndSynced(index);
ensurePeerRecoveryRetentionLeasesRenewedAndSynced(index, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set alwaysExists to true if minimumNodeVersion() is on or after 7.6 (after backport) (here as well as in other places)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

@dnhatn
Copy link
Member Author

dnhatn commented Dec 20, 2019

Thanks Yannick!

@dnhatn dnhatn merged commit cec6678 into elastic:master Dec 20, 2019
@dnhatn dnhatn deleted the translog-prrl branch December 20, 2019 05:39
dnhatn added a commit that referenced this pull request Dec 24, 2019
…50351)

Today, the replica allocator uses peer recovery retention leases to
select the best-matched copies when allocating replicas of indices with
soft-deletes. We can employ this mechanism for indices without
soft-deletes because the retaining sequence number of a PRRL is the
persisted global checkpoint (plus one) of that copy. If the primary and
replica have the same retaining sequence number, then we should be able
to perform a noop recovery. The reason is that we must be retaining
translog up to the local checkpoint of the safe commit, which is at most
the global checkpoint of either copy). The only limitation is that we
might not cancel ongoing file-based recoveries with PRRLs for noop
recoveries. We can't make the translog retention policy comply with
PRRLs. We also have this problem with soft-deletes if a PRRL is about to
expire.

Relates #45136
Relates #46959
dnhatn added a commit that referenced this pull request Dec 24, 2019
dnhatn added a commit that referenced this pull request Dec 26, 2019
testCancelRecoveryDuringPhase1 uses a mock of IndexShard, which can't
create retention leases. We need to stub method createRetentionLease.

Relates #50351 
Closes #50424
dnhatn added a commit that referenced this pull request Dec 26, 2019
…0486)

We forgot to establish peer recovery retention leases for relocating primaries 
without soft-deletes.

Relates #50351
dnhatn added a commit that referenced this pull request Dec 26, 2019
testCancelRecoveryDuringPhase1 uses a mock of IndexShard, which can't
create retention leases. We need to stub method createRetentionLease.

Relates #50351 
Closes #50424
dnhatn added a commit that referenced this pull request Dec 26, 2019
…0486)

We forgot to establish peer recovery retention leases for relocating primaries
without soft-deletes.

Relates #50351
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
…lastic#50351)

Today, the replica allocator uses peer recovery retention leases to 
select the best-matched copies when allocating replicas of indices with
soft-deletes. We can employ this mechanism for indices without
soft-deletes because the retaining sequence number of a PRRL is the
persisted global checkpoint (plus one) of that copy. If the primary and 
replica have the same retaining sequence number, then we should be able
to perform a noop recovery. The reason is that we must be retaining
translog up to the local checkpoint of the safe commit, which is at most
the global checkpoint of either copy). The only limitation is that we
might not cancel ongoing file-based recoveries with PRRLs for noop
recoveries. We can't make the translog retention policy comply with
PRRLs. We also have this problem with soft-deletes if a PRRL is about to
expire.

Relates elastic#45136
Relates elastic#46959
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
testCancelRecoveryDuringPhase1 uses a mock of IndexShard, which can't
create retention leases. We need to stub method createRetentionLease.

Relates elastic#50351 
Closes elastic#50424
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
…astic#50486)

We forgot to establish peer recovery retention leases for relocating primaries 
without soft-deletes.

Relates elastic#50351
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement v7.6.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants