Bulk operation fail to replicate operations when a mapping update times out #30379

bleskes · 2018-05-04T08:23:42Z

Starting with the refactoring in #22778 (released in 5.3)

we may fail to properly replicate operation when a mapping update on master fails.
If a bulk operations needs a mapping update half way, it will send a request to the
master before continuing to index the operations. If that request times out or isn't
acked (i.e., even one node in the cluster didn't process it within 30s), we end up
throwing the exception and aborting the entire bulk. This is a problem because all
operations that were processed so far are not replicated any more to the replicas.
Although these operations were never "acked" to the user (we threw an error) it
cause the local checkpoint on the replicas to lag (on 6.x) and the primary and replica
to diverge.

This PR changes the logic to treat any mapping update failure as a document level
failure, meaning only the relevant indexing operation will fail.

Back port of #30244

we may fail to properly replicate operation when a mapping update on master fails. If a bulk operations needs a mapping update half way, it will send a request to the master before continuing to index the operations. If that request times out or isn't acked (i.e., even one node in the cluster didn't process it within 30s), we end up throwing the exception and aborting the entire bulk. This is a problem because all operations that were processed so far are not replicated any more to the replicas. Although these operations were never "acked" to the user (we threw an error) it cause the local checkpoint on the replicas to lag (on 6.x) and the primary and replica to diverge. This PR changes the logic to treat *any* mapping update failure as a document level failure, meaning only the relevant indexing operation will fail. Backport of elastic#30244

elasticmachine · 2018-05-04T08:23:44Z

Pinging @elastic/es-distributed

ywelsch

LGTM

ywelsch · 2018-05-04T11:19:21Z

test/framework/src/main/java/org/elasticsearch/test/disruption/BlockMasterServiceOnMaster.java

+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.atomic.AtomicReference;
+
+public class BlockMasterServiceOnMaster extends SingleNodeDisruption {


5.x does not have a separate MasterService. Maybe just use BlockClusterStateProcessing instead of introducing this class?

Good suggestion. Will do.

bleskes · 2018-05-04T14:30:09Z

Thanks @ywelsch

bleskes added >non-issue :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels May 4, 2018

bleskes requested a review from ywelsch May 4, 2018 08:23

ywelsch approved these changes May 4, 2018

View reviewed changes

bleskes added 2 commits May 4, 2018 14:00

rename

55ec22d

remove

4a43e72

bleskes merged commit e864427 into elastic:5.6 May 4, 2018

bleskes deleted the bulk_mapping_doc_failure_5_x branch May 4, 2018 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk operation fail to replicate operations when a mapping update times out #30379

Bulk operation fail to replicate operations when a mapping update times out #30379

bleskes commented May 4, 2018

elasticmachine commented May 4, 2018

ywelsch left a comment

ywelsch May 4, 2018

bleskes May 4, 2018

bleskes commented May 4, 2018

Bulk operation fail to replicate operations when a mapping update times out #30379

Bulk operation fail to replicate operations when a mapping update times out #30379

Conversation

bleskes commented May 4, 2018

elasticmachine commented May 4, 2018

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch May 4, 2018

Choose a reason for hiding this comment

bleskes May 4, 2018

Choose a reason for hiding this comment

bleskes commented May 4, 2018