Do not optimize append-only operation if normal operation with higher seq# was seen #28787

dnhatn · 2018-02-22T15:56:03Z

When processing an append-only operation, primary knows that operations can only conflict with another instance of the same operation. This is true as the id was freshly generated. However this property doesn't hold for replicas. As soon as an auto-generated ID was indexed into the primary, it can be exposed to a search and users can issue a follow up operation on it. In extremely rare cases, the follow up operation can be arrived and processed on a replica before the original append-only request. In this case we can't simply proceed with the append-only request and blindly add it to the index without consulting the version map. The following scenario can cause difference between primary and replica.

Primary indexes an auto-gen-id doc. (id=X, v=1, s#=20)
A refresh cycle happens on primary
The new doc is picked up and modified - say by a delete by query request - Primary gets a delete doc (id=X, v=2, s#=30)
Delete doc is processed first on the replica (id=X, v=2, s#=30)
Indexing operation arrives on the replica, since it's an auto-gen-id request and the retry marker is lower, we put it into lucene without any check. Replica has a doc the primary doesn't have.

To deal with a potential conflict between an append-only operation and a normal operation on replicas, we need to rely on sequence numbers. This commit maintains the max seqno of non-append-only operations on replica then only apply optimization for an append-only operation only if its seqno is higher than the seqno of all non-append-only.

s1monw

I left some suggestions but LGTM otherwise

s1monw · 2018-02-22T16:13:15Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+    private void updateMaxSeqNoOfNonAppendOnlyOperations(Operation op) {
+        assert op.origin() != Operation.Origin.PRIMARY;
+        maxSeqNoOfNonAppendOnlyOperations.updateAndGet(curr -> Math.max(op.seqNo(), curr));
+        assert maxSeqNoOfNonAppendOnlyOperations.get() >= op.seqNo();


this needs a message

s1monw · 2018-02-22T16:13:46Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+        return index.seqNo() <= maxSeqNoOfNonAppendOnlyOperations.get();
+    }
+
+    private void updateMaxSeqNoOfNonAppendOnlyOperations(Operation op) {


maybe just inline this into the planIndexingAsNonPrimary method? I think that would be cleaner.

s1monw · 2018-02-22T16:14:06Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+         */
+        assert canOptimizeAddDocument(index);
+        assert index.origin() != Operation.Origin.PRIMARY;
+        return index.seqNo() <= maxSeqNoOfNonAppendOnlyOperations.get();


I think you should inline this into planIndexingAsNonPrimary then we don't need all the asserts

DaveCTurner · 2018-02-22T16:15:02Z

Thanks @dnhatn - I think we should agree on elastic/elasticsearch-formal-models#28 before we proceed with this. In particular, it's not yet clear that this is the only problem in this area, and we haven't settled whether we think replicas should have a different version map that only considers seq#s (the document version numbers are now only of importance on the primary).

dnhatn · 2018-02-23T15:34:08Z

@DaveCTurner I understand your concern. Boaz and I discussed and agreed to start the implementation for this and #28790. We hoped that our reasoning for these would be fine. Certainly, we need to sync between the formal model and implementation.

bleskes

This is great. I left some nits. As discussed before, we should wait for blessing via @DaveCTurner 's work before merging.

bleskes · 2018-03-03T00:30:00Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

-                break;
+            final String key = entry.getKey();
+            if (key.equals(MAX_UNSAFE_AUTO_ID_TIMESTAMP_COMMIT_ID)) {
+                maxUnsafeAutoIdTimestamp.set(Math.max(maxUnsafeAutoIdTimestamp.get(), Long.parseLong(entry.getValue())));


question - why the leniency with max?

I removed the max expr.

Looks like the max is still here for time stamps? maybe assert it's -1 (for both seq# and timestamp)

bleskes · 2018-03-03T00:49:38Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

            assert index.version() == 1L : "can optimize on replicas but incoming version is [" + index.version() + "]";
            plan = IndexingStrategy.optimizedAppendOnly(index.seqNo());
        } else {
+            if (appendOnlyRequest == false) {


can we introduce a method similar to mayHaveBeenIndexedBefore that does the check and also updates the maxSeqNoOfNonAppendOnlyOperations? I think it's good to have both marker handling consistent.

Apparently @s1monw prefered the reverse. I'm fine with leaving as is.

bleskes · 2018-03-03T00:54:39Z

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

+        }
+        // doc1 is delayed and arrived after a non-append-only op.
+        final long seqNoDoc1 = localCheckpointTracker.generateSeqNo();
+        Engine.IndexResult regularDoc = engine.index(replicaIndexForDoc(


can we also test a delete?

bleskes · 2018-03-03T00:58:15Z

One more nit - can we maybe give the PR title to be more exact about what the code change? something like do not optimize if a normal operation was seen with a higher seq# ..

dnhatn · 2018-03-04T05:49:05Z

@bleskes I've addressed your comments. Would you please take another look? Thank you.

DaveCTurner · 2018-03-07T18:15:35Z

I have updated elastic/elasticsearch-formal-models#28 as we discussed, so that's awaiting further comments. AIUI the model covers both this PR and #28790. From a first look, this seems to be quite different from the modelled solution and I'd be more comfortable if we brought them closer together (from either/both ends). In particular, we don't yet seem to be looking at the sequence numbers of replication requests here and I think we should.

I'm still travelling and haven't had a lot of sleep so expect more useful/correct/accurate feedback at a later date. I'm sending this now in case it's useful to those in more westerly timezones in the meantime.

bleskes · 2018-03-08T09:25:27Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

-                break;
+            final String key = entry.getKey();
+            if (key.equals(MAX_UNSAFE_AUTO_ID_TIMESTAMP_COMMIT_ID)) {
+                maxUnsafeAutoIdTimestamp.set(Math.max(maxUnsafeAutoIdTimestamp.get(), Long.parseLong(entry.getValue())));


Looks like the max is still here for time stamps? maybe assert it's -1 (for both seq# and timestamp)

bleskes · 2018-03-08T09:34:59Z

In particular, we don't yet seem to be looking at the sequence numbers of replication requests here and I think we should.

@DaveCTurner can you clarify what you mean? we use the seq# of the incoming op to update the maxSeqNoOfNonAppendOnlyOperations field?

DaveCTurner · 2018-03-08T09:42:52Z

Apologies, that was written on my phone and is inaccurate. I meant that the modelled solution stores seq#s in the version map.

Models the fix implemented in elastic/elasticsearch#28787

DaveCTurner

LGTM.

# Conflicts: # server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java # server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java # server/src/test/java/org/elasticsearch/index/replication/ESIndexLevelReplicationTestCase.java # server/src/test/java/org/elasticsearch/index/replication/IndexLevelReplicationTests.java

dnhatn · 2018-03-26T20:52:40Z

Thanks @s1monw, @bleskes and @DaveCTurner for reviewing.

* master: Do not optimize append-only if seen normal op with higher seqno (elastic#28787) [test] packaging: gradle tasks for groovy tests (elastic#29046) Prune only gc deletes below local checkpoint (elastic#28790)

* master: (40 commits) Do not optimize append-only if seen normal op with higher seqno (elastic#28787) [test] packaging: gradle tasks for groovy tests (elastic#29046) Prune only gc deletes below local checkpoint (elastic#28790) remove testUnassignedShardAndEmptyNodesInRoutingTable elastic#28745: remove extra option in the composite rest tests Fold EngineDiskUtils into Store, for better lock semantics (elastic#29156) Add file permissions checks to precommit task Remove execute mode bit from source files Optimize the composite aggregation for match_all and range queries (elastic#28745) [Docs] Add rank_eval size parameter k (elastic#29218) [DOCS] Remove ignore_z_value parameter link Docs: Update docs/index_.asciidoc (elastic#29172) Docs: Link C++ client lib elasticlient (elastic#28949) [DOCS] Unregister repository instead of deleting it (elastic#29206) Docs: HighLevelRestClient#multiSearch (elastic#29144) Add Z value support to geo_shape Remove type casts in logging in server component (elastic#28807) Change BroadcastResponse from ToXContentFragment to ToXContentObject (elastic#28878) REST : Split `RestUpgradeAction` into two actions (elastic#29124) Add error file docs to important settings ...

* es/master: (22 commits) Fix building Javadoc JARs on JDK for client JARs (#29274) Require JDK 10 to build Elasticsearch (#29174) Decouple NamedXContentRegistry from ElasticsearchException (#29253) Docs: Update generating test coverage reports (#29255) [TEST] Fix issue with HttpInfo passed invalid parameter Remove all dependencies from XContentBuilder (#29225) Fix sporadic failure in CompositeValuesCollectorQueueTests Propagate ignore_unmapped to inner_hits (#29261) TEST: Increase timeout for testPrimaryReplicaResyncFailed REST client: hosts marked dead for the first time should not be immediately retried (#29230) TEST: Use different translog dir for a new engine Make SearchStats implement Writeable (#29258) [Docs] Spelling and grammar changes to reindex.asciidoc (#29232) Do not optimize append-only if seen normal op with higher seqno (#28787) [test] packaging: gradle tasks for groovy tests (#29046) Prune only gc deletes below local checkpoint (#28790) remove testUnassignedShardAndEmptyNodesInRoutingTable #28745: remove extra option in the composite rest tests Fold EngineDiskUtils into Store, for better lock semantics (#29156) Add file permissions checks to precommit task ...

This models how indexing and deletion operations are handled on the replica, including the optimisations for append-only operations and the interaction with Lucene commits and the version map. It incorporates - elastic/elasticsearch#28787 - elastic/elasticsearch#28790 - elastic/elasticsearch#29276 - a proposal to always prune tombstones

When processing an append-only operation, primary knows that operations can only conflict with another instance of the same operation. This is true as the id was freshly generated. However this property doesn't hold for replicas. As soon as an auto-generated ID was indexed into the primary, it can be exposed to a search and users can issue a follow up operation on it. In extremely rare cases, the follow up operation can be arrived and processed on a replica before the original append-only request. In this case we can't simply proceed with the append-only request and blindly add it to the index without consulting the version map. The following scenario can cause difference between primary and replica. 1. Primary indexes an auto-gen-id doc. (id=X, v=1, s#=20) 2. A refresh cycle happens on primary 3. The new doc is picked up and modified - say by a delete by query request - Primary gets a delete doc (id=X, v=2, s#=30) 4. Delete doc is processed first on the replica (id=X, v=2, s#=30) 5. Indexing operation arrives on the replica, since it's an auto-gen-id request and the retry marker is lower, we put it into lucene without any check. Replica has a doc the primary doesn't have. To deal with a potential conflict between an append-only operation and a normal operation on replicas, we need to rely on sequence numbers. This commit maintains the max seqno of non-append-only operations on replica then only apply optimization for an append-only operation only if its seq# is higher than the seq# of all non-append-only.

dnhatn added 3 commits February 22, 2018 10:41

Add deleteOperation to replication test case

92671f6

Add out of order delivery for append only request

ff814c8

track max_seqno of non-append-only requests

467045a

dnhatn added >enhancement v7.0.0 v6.3.0 :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. labels Feb 22, 2018

dnhatn requested review from s1monw, bleskes, ywelsch and DaveCTurner February 22, 2018 15:56

s1monw approved these changes Feb 22, 2018

View reviewed changes

inline max_seqno methods

198c504

Add assertions

d1ef185

bleskes suggested changes Mar 3, 2018

View reviewed changes

dnhatn changed the title ~~Do not optimize append-only operation if it may have been exposed~~ Do not optimize append-only op if normal op with higher seq# was seen Mar 4, 2018

dnhatn added 2 commits March 3, 2018 21:28

Just assign max_seqno from commit

3410630

Test with delete

c2b83f0

bleskes approved these changes Mar 8, 2018

View reviewed changes

DaveCTurner added a commit to DaveCTurner/elasticsearch-formal-models that referenced this pull request Mar 26, 2018

Track maxSeqNoOfNonAppendOnlyOperations

f2fcecb

Models the fix implemented in elastic/elasticsearch#28787

DaveCTurner mentioned this pull request Mar 26, 2018

Introduce ReplicaEngine model elastic/elasticsearch-formal-models#29

Merged

DaveCTurner approved these changes Mar 26, 2018

View reviewed changes

dnhatn added 2 commits March 26, 2018 14:35

Minor fix in internal engine test

440f874

assert when bootstrapAppendOnlyInfoFromWriter

05050a5

dnhatn merged commit 0ac89a3 into elastic:master Mar 26, 2018

dnhatn added the backport pending label Mar 26, 2018

dnhatn deleted the append-only-marker branch March 26, 2018 21:01

dnhatn changed the title ~~Do not optimize append-only op if normal op with higher seq# was seen~~ Do not optimize append-only if seen normal op with higher seq# Mar 27, 2018

dnhatn changed the title ~~Do not optimize append-only if seen normal op with higher seq#~~ Do not optimize append-only operation if normal operation with higher seq# was seen Mar 27, 2018

dnhatn removed the backport pending label Mar 28, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not optimize append-only operation if normal operation with higher seq# was seen #28787

Do not optimize append-only operation if normal operation with higher seq# was seen #28787

dnhatn commented Feb 22, 2018

s1monw left a comment

s1monw Feb 22, 2018

s1monw Feb 22, 2018

s1monw Feb 22, 2018

DaveCTurner commented Feb 22, 2018

dnhatn commented Feb 23, 2018

bleskes left a comment

bleskes Mar 3, 2018

dnhatn Mar 4, 2018

bleskes Mar 8, 2018

bleskes Mar 3, 2018

bleskes Mar 8, 2018

bleskes Mar 3, 2018

bleskes commented Mar 3, 2018

dnhatn commented Mar 4, 2018

DaveCTurner commented Mar 7, 2018

bleskes Mar 8, 2018

bleskes commented Mar 8, 2018

DaveCTurner commented Mar 8, 2018

DaveCTurner left a comment

dnhatn commented Mar 26, 2018

Do not optimize append-only operation if normal operation with higher seq# was seen #28787

Do not optimize append-only operation if normal operation with higher seq# was seen #28787

Conversation

dnhatn commented Feb 22, 2018

s1monw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner commented Feb 22, 2018

dnhatn commented Feb 23, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Mar 3, 2018

dnhatn commented Mar 4, 2018

DaveCTurner commented Mar 7, 2018

Choose a reason for hiding this comment

bleskes commented Mar 8, 2018

DaveCTurner commented Mar 8, 2018

DaveCTurner left a comment

Choose a reason for hiding this comment

dnhatn commented Mar 26, 2018