Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create missing PRRLs after primary activation #44009

Conversation

DaveCTurner
Copy link
Contributor

Today peer recovery retention leases (PRRLs) are created when starting a
replication group from scratch and during peer recovery. However, if the
replication group was migrated from nodes running a version which does not
create PRRLs (e.g. 7.3 and earlier) then it's possible that the primary was
relocated or promoted without first establishing all the expected leases.

It's not possible to establish these leases before or during primary
activation, so we must create them as soon as possible afterwards. This gives
weaker guarantees about history retention, since there's a possibility that
history will be discarded before it can be used. In practice such situations
are expected to occur only rarely.

This commit adds the machinery to create missing leases after primary
activation, and strengthens the assertions about the existence of such leases
in order to ensure that once all the leases do exist we never again enter a
state where there's a missing lease.

Relates #41536

Today peer recovery retention leases (PRRLs) are created when starting a
replication group from scratch and during peer recovery. However, if the
replication group was migrated from nodes running a version which does not
create PRRLs (e.g. 7.3 and earlier) then it's possible that the primary was
relocated or promoted without first establishing all the expected leases.

It's not possible to establish these leases before or during primary
activation, so we must create them as soon as possible afterwards. This gives
weaker guarantees about history retention, since there's a possibility that
history will be discarded before it can be used. In practice such situations
are expected to occur only rarely.

This commit adds the machinery to create missing leases after primary
activation, and strengthens the assertions about the existence of such leases
in order to ensure that once all the leases do exist we never again enter a
state where there's a missing lease.

Relates elastic#41536
@DaveCTurner DaveCTurner added >enhancement :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jul 5, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor Author

Note that this PR is against the peer-recovery-retention-leases-7.x branch rather than against peer-recovery-retention-leases. I think we do not need this change in master since a rolling upgrade to master would come from a 7.x version with PRRLs; even if we performed two rolling upgrades in such quick succession that the 7.x primary didn't have time to create any missing leases, it would still create leases for the 8.0 peers during the upgrade. I will open a PR against peer-recovery-retention-leases to strengthen the assertions in master to that effect, although we don't have any double-rolling-ugprade tests today so I am not sure how meaningful this will be.

@DaveCTurner
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/2

I opened #44011 as the failure seems unrelated and occurs elsewhere.

ywelsch
ywelsch previously approved these changes Jul 5, 2019
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left two minor comments, looking good o.w.
I would like for @dnhatn to have a look as well though

case OLD:
Settings.Builder settings = Settings.builder()
.put(IndexMetaData.INDEX_NUMBER_OF_SHARDS_SETTING.getKey(), between(1, 5))
.put(IndexMetaData.INDEX_NUMBER_OF_REPLICAS_SETTING.getKey(), 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps randomly 0 or 1 replica?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ see 1f2a197.

@DaveCTurner
Copy link
Contributor Author

even if we performed two rolling upgrades in such quick succession that the 7.x primary didn't have time to create any missing leases, it would still create leases for the 8.0 peers during the upgrade

Bah this is incorrect. We know that every 8.0 peer will have a lease, but we can't be certain that every 7.x peer has one. Therefore this PR will need forward-porting to peer-recovery-retention-leases.

@DaveCTurner DaveCTurner dismissed ywelsch’s stale review July 5, 2019 14:21

Thanks @ywelsch, dismissing this pending @dnhatn taking a look too so I don't forget this is still pending.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a small ask but LGTM. Thanks @DaveCTurner.

return false;
}

public void testCanRecoverFromStoreWithoutPeerRecoveryRetentionLease() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a full cluster restart test with 0-2 replicas then verify that after the cluster is upgraded, every copy has PRRL installed properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, yes. I pushed aa7ac54.

@DaveCTurner DaveCTurner merged commit 4b19a4b into elastic:peer-recovery-retention-leases-7.x Jul 8, 2019
@DaveCTurner DaveCTurner deleted the 2019-07-05-prrls-bwc-create-after-primary-activation branch July 8, 2019 11:13
DaveCTurner added a commit that referenced this pull request Jul 8, 2019
Today peer recovery retention leases (PRRLs) are created when starting a
replication group from scratch and during peer recovery. However, if the
replication group was migrated from nodes running a version which does not
create PRRLs (e.g. 7.3 and earlier) then it's possible that the primary was
relocated or promoted without first establishing all the expected leases.

It's not possible to establish these leases before or during primary
activation, so we must create them as soon as possible afterwards. This gives
weaker guarantees about history retention, since there's a possibility that
history will be discarded before it can be used. In practice such situations
are expected to occur only rarely.

This commit adds the machinery to create missing leases after primary
activation, and strengthens the assertions about the existence of such leases
in order to ensure that once all the leases do exist we never again enter a
state where there's a missing lease.

Relates #41536
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants