Bulk operation fail to replicate operations when a mapping update times out #30244

bleskes · 2018-04-30T06:44:13Z

Starting with the refactoring in #22778 (released in 5.3) we may fail to properly replicate operation when a mapping update on master fails. If a bulk operations needs a mapping update half way, it will send a request to the master before continuing to index the operations. If that request times out or isn't acked (i.e., even one node in the cluster didn't process it within 30s), we end up throwing the exception and aborting the entire bulk. This is a problem because all operations that were processed so far are not replicated any more to the replicas. Although these operations were never "acked" to the user (we threw an error) it cause the local checkpoint on the replicas to lag (on 6.x) and the primary and replica to diverge.

This PR does a couple of things:

Most importantly, treat any mapping update failure as a document level failure, meaning only the relevant indexing operation will fail.
Removes the mapping update callbacks from IndexShard.applyIndexOperationOnPrimary and similar methods for simpler execution. We don't use exceptions any more when a mapping update was successful.

I think we need to do more work here (the fact that a single slow node can prevent those mappings updates from being acked and thus fail operations is bad), but I want to keep this as small as I can (it is already too big).

Note that this needs to go to 5.x but I'm not sure how cleanly it will back port. I'll evaluate once this has been reviewed and put into 7.0 & 6.x

elasticmachine · 2018-04-30T06:44:14Z

Pinging @elastic/es-distributed

s1monw

first round. pretty intense, thanks for getting your hands dirty here

s1monw · 2018-04-30T08:47:21Z

server/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

-            return new BulkItemResultHolder(null, indexResult, bulkItemRequest);
-        } else {
-            IndexResponse response = new IndexResponse(primary.shardId(), indexRequest.type(), indexRequest.id(),
+        switch (indexResult.getResultType()) {


what was wrong with a boolean here? What does this result type buy us?

yeah, that's a good question. I was doubting on how to do this and ended up with the enum but I'm not 100% happy with it either. I wanted IndexShard.applyIndexOperationOnPrimary to always return without the extra callbacks. That method can end with success, failure or a required mapping change (I didn't want to make this a failure). The problem with the booleans is you need to coordinate them - make sure that there is no failure and a required mapping update. I hope this clarifies things. I'm open to suggestions.

I am ok with it. stuff like this is always tricky

s1monw · 2018-04-30T08:49:09Z

server/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

            default:
                throw new IllegalStateException("Unexpected request operation type on replica: "
                    + docWriteRequest.opType().getLowercase());
        }
+        if (result.getResultType() == Engine.Result.Type.MAPPING_UPDATE_REQUIRED) {


can't we have a second boolean instead of this context sensitive type?

s1monw · 2018-04-30T08:51:56Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

                opsRecovered++;
                recoveryState.getTranslog().incrementRecoveredOperations();
            } catch (Exception e) {
                if (ExceptionsHelper.status(e) == RestStatus.BAD_REQUEST) {
                    // mainly for MapperParsingException and Failure to detect xcontent
                    logger.info("ignoring recovery of a corrupt translog entry", e);
+                } else if (e instanceof RuntimeException) {


maybe use ExceptionsHelper#convertToRuntime

+1, I failed to find that utility. Thanks for pointing out.

s1monw · 2018-04-30T08:53:18Z

server/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

-            return new BulkItemResultHolder(response, deleteResult, bulkItemRequest);
+                return new BulkItemResultHolder(response, deleteResult, bulkItemRequest);
+            case FAILURE:
+                return new BulkItemResultHolder(null, deleteResult, bulkItemRequest);


I think you should have all options listed here and don't use default. Be explicit here please

sure. I'll add en explicit line + a default for future additions in the enum

s1monw · 2018-04-30T08:56:01Z

server/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

-                request.getAutoGeneratedTimestamp(), request.isRetry(), update -> mappingUpdater.verifyMappings(update, primary.shardId()));
+        Engine.IndexResult result;
+
+        result = primary.applyIndexOperationOnPrimary(request.version(), request.versionType(), sourceToParse,


this looks pretty much like the code in the delete part below. Can we maybe break it out in a shared routine and pass a closure to it to actually process the operation.

+1. I'll do this once the other discussion settles.

s1monw · 2018-04-30T08:57:00Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+                    case MAPPING_UPDATE_REQUIRED:
+                        throw new IllegalArgumentException("unexpected mapping update: " + result.getRequiredMappingUpdate());
+                    case SUCCESS:
+                        break;


please add a default case here to make sure we fail hard if we miss it.

s1monw · 2018-04-30T08:57:14Z

server/src/main/java/org/elasticsearch/index/shard/InternalIndexingStats.java

-            postIndex(shardId, index, result.getFailure());
+        switch (result.getResultType()) {
+            case SUCCESS:
+                if (!index.origin().isRecovery()) {


use == false

s1monw · 2018-04-30T08:57:24Z

server/src/main/java/org/elasticsearch/index/shard/InternalIndexingStats.java

-            postDelete(shardId, delete, result.getFailure());
+        switch (result.getResultType()) {
+            case SUCCESS:
+                if (!delete.origin().isRecovery()) {


use == false

s1monw · 2018-04-30T08:57:37Z

server/src/test/java/org/elasticsearch/discovery/MasterDisruptionIT.java

@@ -449,6 +456,56 @@ public void testVerifyApiBlocksDuringPartition() throws Exception {

    }

+    @TestLogging(


would it help if I fold it into one line? :)

can't we remove it? why is it there?

Because any failure of this tests is useless without these logs. I thought this is what we agreed on - i.e., we selectively enable detailed debug logging on these types of tests.

ok maybe leave a comment

…failure

ywelsch · 2018-04-30T10:14:41Z

server/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

+                request.getAutoGeneratedTimestamp(), request.isRetry());
+            if (result.getResultType() == Engine.Result.Type.MAPPING_UPDATE_REQUIRED) {
+                // double mapping update. We assume that the successful mapping update wasn't yet processed on the node
+                // and retry the entire request again.


how do we expect this to happen? The request must have been processed on the node, otherwise it would not have been acked?
Do we have any tests that cover the behavior here?

That's a fair statement. I don't think this is possible given the current state of the code. I copied it here from the verify method of the mapping updater in order to preserve behavior. See [here])

elasticsearch/server/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

Line 583 in 5e4d0b4

throw new ReplicationOperation.RetryOnPrimaryException(shardId,

) . I think we can clean it up as port of the follow up I have in mind (which we will discuss)

ywelsch · 2018-04-30T10:16:15Z

server/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

-                throw new ReplicationOperation.RetryOnPrimaryException(shardId,
-                        "Dynamic mappings are not available on the node that holds the primary yet");
-            }
+            Objects.requireNonNull(update);


should this be an assertion?

ywelsch · 2018-04-30T10:17:24Z

server/src/main/java/org/elasticsearch/index/engine/Engine.java

        private Translog.Location translogLocation;
        private long took;

        protected Result(Operation.TYPE operationType, Exception failure, long version, long seqNo) {
+            Objects.requireNonNull(failure);


You can inline this directly with the assignment.

this.failure = Objects.requireNonNull(failure);

s1monw

LGTM

bleskes · 2018-04-30T12:48:29Z

@ywelsch thanks. I addressed your comments. Care to take another look?

bleskes · 2018-04-30T18:35:09Z

sample packaging tests

bleskes · 2018-04-30T18:35:40Z

run sample packaging tests

Relates #30244

oded-dd · 2018-05-03T10:20:27Z

We have encountered the same issue with elasticsearch v6.2.4, can you merge it to the 6.2 branch also?

redlus · 2018-05-03T14:07:36Z

Hi
I've referenced this issue in #30351 (now closed, as the first part was caused by this issue exactly). However, this is a critical issue in our production elasticsearch cluster and we must find a solution soon. Could you backport this pull request to elasticsearch v6.2 as well?

This would be extremely helpful.

Thanks!

we may fail to properly replicate operation when a mapping update on master fails. If a bulk operations needs a mapping update half way, it will send a request to the master before continuing to index the operations. If that request times out or isn't acked (i.e., even one node in the cluster didn't process it within 30s), we end up throwing the exception and aborting the entire bulk. This is a problem because all operations that were processed so far are not replicated any more to the replicas. Although these operations were never "acked" to the user (we threw an error) it cause the local checkpoint on the replicas to lag (on 6.x) and the primary and replica to diverge. This PR changes the logic to treat *any* mapping update failure as a document level failure, meaning only the relevant indexing operation will fail. Backport of elastic#30244

…es out (#30379) Starting with the refactoring in #22778 (released in 5.3) we may fail to properly replicate operation when a mapping update on master fails. If a bulk operations needs a mapping update half way, it will send a request to the master before continuing to index the operations. If that request times out or isn't acked (i.e., even one node in the cluster didn't process it within 30s), we end up throwing the exception and aborting the entire bulk. This is a problem because all operations that were processed so far are not replicated any more to the replicas. Although these operations were never "acked" to the user (we threw an error) it cause the local checkpoint on the replicas to lag (on 6.x) and the primary and replica to diverge. This PR changes the logic to treat *any* mapping update failure as a document level failure, meaning only the relevant indexing operation will fail. Back port of #30244 * remove

farin99 · 2018-05-05T05:15:47Z

@bleskes any plans to backport it to 6.2?
I know it is in 6.3, however, as it is a minor release I'm worried about using it as a snapshot in production.
Any other way to get it as bugfix on top of 6.2? maybe on a temp branch :)
If not, what is the readiness of 6.3?

Thanks for your amazing help.

bleskes · 2018-05-05T13:52:51Z

We are actively working on releasing 6.3.0, which will contain this fix. Once the 6.3.0 is out, we will no longer release a new patch release to the 6.2.x series. The reason is that 6.3 is minor release and is fully compatible with previous 6.x releases. You just as easily upgrade to it as you would to a 6.2.x using the standard rolling upgrade mechansim.

farin99 · 2018-05-06T09:33:04Z

Thanks @bleskes, is it safe to assume that 6.3.0 will be released in the next few days?

bleskes · 2018-05-06T09:55:59Z

Thanks @bleskes, is it safe to assume that 6.3.0 will be released in the next few days?

We can't say anything concrete. I don't think in the next few days is realistic.

farin99 · 2018-05-07T07:54:04Z

Got it. Thaks for responding.

candreoliveira · 2018-05-30T13:32:09Z

@bleskes, do you have any ideia about the 6.3 launch? I mean, the plan is to launch in a few days or few months or several months?

bleskes · 2018-05-30T14:44:44Z

@candreoliveira I can't say for sure but several months is extremely unlikely.

jatrost · 2018-06-10T01:46:07Z

@bleskes any more updates on when 6.3 will be released? This specific issue is causing us a lot of problems.

bleskes · 2018-06-10T19:37:37Z

@jt6211 still working on stabilizing it.. soon is all I can say at this point.

Synced-flush consists of three steps: (1) force-flush on every active copy; (2) check for ongoing indexing operations; (3) seal copies if there's no change since step 1. If some indexing operations are completed on the primary but not replicas, then Lucene commits from step 1 on replicas won't be the same as the primary's. And step 2 would pass if it's executed when all pending operations are done. Once step 2 passes, we will incorrectly emit the "out of sync" warning message although nothing wrong here. Relates #28464 Relates #30244

Synced-flush consists of three steps: (1) force-flush on every active copy; (2) check for ongoing indexing operations; (3) seal copies if there's no change since step 1. If some indexing operations are completed on the primary but not replicas, then Lucene commits from step 1 on replicas won't be the same as the primary's. And step 2 would pass if it's executed when all pending operations are done. Once step 2 passes, we will incorrectly emit the "out of sync" warning message although nothing wrong here. Relates elastic#28464 Relates elastic#30244

bleskes added 8 commits April 29, 2018 21:37

remove mapping updater

7c4ae47

fix PrimaryReplicaSyncerTests

5f66d11

fix IndexShardTests

e59cb89

fix PeerRecoveryTargetServiceTests

1b368b4

some more fixes

c01cc0a

fix TransportShardBulkActionTests

7695ff6

fix testMaybeRollTranslogGeneration

18a9ef4

tweak

5f96ed1

bleskes added >bug blocker :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. v7.0.0 v5.6.10 v6.4.0 v6.3.1 labels Apr 30, 2018

bleskes requested review from s1monw and ywelsch April 30, 2018 06:44

s1monw requested changes Apr 30, 2018

View reviewed changes

bleskes added 2 commits April 30, 2018 11:55

Merge remote-tracking branch 'upstream/master' into bulk_mapping_doc_…

f5442d3

…failure

feedback

0c0423a

ywelsch reviewed Apr 30, 2018

View reviewed changes

s1monw approved these changes Apr 30, 2018

View reviewed changes

bleskes added 3 commits April 30, 2018 14:39

fix NP

2e9f984

assert!

5a72822

nicer

d216565

more npes

b64f5f0

dnhatn added a commit that referenced this pull request May 2, 2018

CCR side #30244

eb4281e

Relates #30244

dnhatn added a commit that referenced this pull request May 2, 2018

CCR side #30244

dce4ffe

Relates #30244

redlus mentioned this pull request May 2, 2018

update_mapping timeouts cause replica inconsistencies and prevents synced _flush and force_merge commands #30351

Closed

redlus mentioned this pull request May 3, 2018

Feature request: set timeout for update_mapping + define the cluster_task priority in the REST API #30370

Closed

bleskes mentioned this pull request May 4, 2018

Bulk operation fail to replicate operations when a mapping update times out #30379

Merged

bleskes removed the backport pending label May 4, 2018

dnhatn mentioned this pull request May 7, 2018

Index translogs can get stuck after node failure, doesn't get smaller when flushed anymore #29488

Closed

klahnakoski mentioned this pull request Nov 30, 2018

Upgrade Elasticsearch to v6.5.1 mozilla/ActiveData#108

Closed

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

dnhatn mentioned this pull request Sep 11, 2019

Fix false positive out of sync warning in synced-flush #46576

Merged

		@@ -449,6 +456,56 @@ public void testVerifyApiBlocksDuringPartition() throws Exception {

		}

		@TestLogging(

Bulk operation fail to replicate operations when a mapping update times out #30244

Bulk operation fail to replicate operations when a mapping update times out #30244

Conversation

bleskes commented Apr 30, 2018

elasticmachine commented Apr 30, 2018

s1monw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes Apr 30, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw left a comment

Choose a reason for hiding this comment

bleskes commented Apr 30, 2018

bleskes commented Apr 30, 2018

bleskes commented Apr 30, 2018

oded-dd commented May 3, 2018

redlus commented May 3, 2018

farin99 commented May 5, 2018

bleskes commented May 5, 2018

farin99 commented May 6, 2018

bleskes commented May 6, 2018

farin99 commented May 7, 2018

candreoliveira commented May 30, 2018

bleskes commented May 30, 2018

jatrost commented Jun 10, 2018

bleskes commented Jun 10, 2018

bleskes Apr 30, 2018 •

edited

Loading