Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk operation fail to replicate operations when a mapping update times out #30379

Merged
merged 3 commits into from
May 4, 2018

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented May 4, 2018

Starting with the refactoring in #22778 (released in 5.3)

we may fail to properly replicate operation when a mapping update on master fails.
If a bulk operations needs a mapping update half way, it will send a request to the
master before continuing to index the operations. If that request times out or isn't
acked (i.e., even one node in the cluster didn't process it within 30s), we end up
throwing the exception and aborting the entire bulk. This is a problem because all
operations that were processed so far are not replicated any more to the replicas.
Although these operations were never "acked" to the user (we threw an error) it
cause the local checkpoint on the replicas to lag (on 6.x) and the primary and replica
to diverge.

This PR changes the logic to treat any mapping update failure as a document level
failure, meaning only the relevant indexing operation will fail.

Back port of #30244

we may fail to properly replicate operation when a mapping update on master fails.
If a bulk operations needs a mapping update half way, it will send a request to the
master before continuing to index the operations. If that request times out or isn't
acked (i.e., even one node in the cluster didn't process it within 30s), we end up
throwing the exception and aborting the entire bulk. This is a problem because all
operations that were processed so far are not replicated any more to the replicas.
Although these operations were never "acked" to the user (we threw an error) it
cause the local checkpoint on the replicas to lag (on 6.x) and the primary and replica
to diverge.

This PR changes the logic to treat *any* mapping update failure as a document level
failure, meaning only the relevant indexing operation will fail.

Backport of elastic#30244
@bleskes bleskes added >non-issue :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels May 4, 2018
@bleskes bleskes requested a review from ywelsch May 4, 2018 08:23
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicReference;

public class BlockMasterServiceOnMaster extends SingleNodeDisruption {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5.x does not have a separate MasterService. Maybe just use BlockClusterStateProcessing instead of introducing this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Will do.

@bleskes bleskes merged commit e864427 into elastic:5.6 May 4, 2018
@bleskes bleskes deleted the bulk_mapping_doc_failure_5_x branch May 4, 2018 14:29
@bleskes
Copy link
Contributor Author

bleskes commented May 4, 2018

Thanks @ywelsch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >non-issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants