Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replica allocation consider no-op #42518

Conversation

henningandersen
Copy link
Contributor

This is a first step away from sync-ids. We now check if replica and
primary are identical using sequence numbers when determining where to
allocate a replica shard.

If an index is no longer indexed into, issuing a regular flush will now
be enough to ensure a no-op recovery is done.

This has the nice side-effect of ensuring that closed indices and frozen
indices choose existing shard copies with identical data over
file-overlap comparison, increasing the chance that we end up doing a
no-op recovery (only no-op and file-based recovery is supported by
closed indices).

Relates #41400 and #33888

Supersedes #41784

This is a first step away from sync-ids. We now check if replica and
primary are identical using sequence numbers when determining where to
allocate a replica shard.

If an index is no longer indexed into, issuing a regular flush will now
be enough to ensure a no-op recovery is done.

This has the nice side-effect of ensuring that closed indices and frozen
indices choose existing shard copies with identical data over
file-overlap comparison, increasing the chance that we end up doing a
no-op recovery (only no-op and file-based recovery is supported by
closed indices).

Relates elastic#41400 and elastic#33888

Supersedes elastic#41784
@henningandersen henningandersen added >enhancement :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.3.0 labels May 24, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Hopefully this makes test succeed in CI too.
Now lock during cleanup files to protect snapshotRecoveryMetadata from
seeing half copied data. snapshotRecoveryMetadata now handles peer
recovery and existing store recovery specifically, returning empty
snapshot in other recovery types (local shards, restore snapshot).
@ywelsch ywelsch requested a review from dnhatn June 4, 2019 15:25
Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks @henningandersen. Would you mind splitting this PR to multiple smaller pieces?

/**
* We test that a closed index makes no-op replica allocation only.
*/
public void testClosedIndexReplicaAllocation() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test passed with the current behaviour. Can we make a small PR for this test only?

* Whenever we see a new data node, we clear the information we have on primary to ensure it is at least as recent as the start
* of the new node. This reduces risk of making a decision on stale information from primary.
*/
private void ensureAsyncFetchStorePrimaryRecency(RoutingAllocation allocation) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make a separate PR for this enhancement?

return primaryStore.hasSeqNoInfo()
&& primaryStore.maxSeqNo() == candidateStore.maxSeqNo()
&& primaryStore.provideRecoverySeqNo() <= candidateStore.requireRecoverySeqNo()
&& candidateStore.requireRecoverySeqNo() == primaryStore.maxSeqNo() + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need the last condition?

* Finalize index recovery. Manipulate store files, clean up old files, generate new empty translog and do other
* housekeeping for retention leases.
*/
public void finalizeIndexRecovery(CheckedRunnable<IOException> manipulateStore, long globalCheckpoint,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also make a separate PR for this enhancement?

@henningandersen
Copy link
Contributor Author

Thanks for reviewing @dnhatn , I have marked this WIP and will split it into multiple PRs (and then close this one).

@DaveCTurner
Copy link
Contributor

I have opened #46318 to track the underlying issue, and am closing this since we won't be merging it as it is now. We will certainly use it for inspiration in the work on #46318.

dnhatn added a commit that referenced this pull request Sep 28, 2019
Today, we don't clear the shard info of the primary shard when a new
node joins; then we might risk of making replica allocation decisions
based on the stale information of the primary. The serious problem is
that we can cancel the current recovery which is more advanced than the
copy on the new node due to the old info we have from the primary.

With this change, we ensure the shard info from the primary is not older
than any node when allocating replicas.

Relates #46959

This work was done by Henning in #42518.
Co-authored-by: Henning Andersen <henning.andersen@elastic.co>
dnhatn added a commit that referenced this pull request Oct 2, 2019
Today, we don't clear the shard info of the primary shard when a new
node joins; then we might risk of making replica allocation decisions
based on the stale information of the primary. The serious problem is
that we can cancel the current recovery which is more advanced than the
copy on the new node due to the old info we have from the primary.

With this change, we ensure the shard info from the primary is not older
than any node when allocating replicas.

Relates #46959

This work was done by Henning in #42518.

Co-authored-by: Henning Andersen <henning.andersen@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement WIP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants