Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not optimize append-only operation if normal operation with higher seq# was seen #28787

Merged
merged 10 commits into from
Mar 26, 2018

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Feb 22, 2018

When processing an append-only operation, primary knows that operations can only conflict with another instance of the same operation. This is true as the id was freshly generated. However this property doesn't hold for replicas. As soon as an auto-generated ID was indexed into the primary, it can be exposed to a search and users can issue a follow up operation on it. In extremely rare cases, the follow up operation can be arrived and processed on a replica before the original append-only request. In this case we can't simply proceed with the append-only request and blindly add it to the index without consulting the version map. The following scenario can cause difference between primary and replica.

  1. Primary indexes an auto-gen-id doc. (id=X, v=1, s#=20)
  2. A refresh cycle happens on primary
  3. The new doc is picked up and modified - say by a delete by query request - Primary gets a delete doc (id=X, v=2, s#=30)
  4. Delete doc is processed first on the replica (id=X, v=2, s#=30)
  5. Indexing operation arrives on the replica, since it's an auto-gen-id request and the retry marker is lower, we put it into lucene without any check. Replica has a doc the primary doesn't have.

To deal with a potential conflict between an append-only operation and a normal operation on replicas, we need to rely on sequence numbers. This commit maintains the max seqno of non-append-only operations on replica then only apply optimization for an append-only operation only if its seqno is higher than the seqno of all non-append-only.

@dnhatn dnhatn added >enhancement v7.0.0 v6.3.0 :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. labels Feb 22, 2018
Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some suggestions but LGTM otherwise

private void updateMaxSeqNoOfNonAppendOnlyOperations(Operation op) {
assert op.origin() != Operation.Origin.PRIMARY;
maxSeqNoOfNonAppendOnlyOperations.updateAndGet(curr -> Math.max(op.seqNo(), curr));
assert maxSeqNoOfNonAppendOnlyOperations.get() >= op.seqNo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs a message

return index.seqNo() <= maxSeqNoOfNonAppendOnlyOperations.get();
}

private void updateMaxSeqNoOfNonAppendOnlyOperations(Operation op) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just inline this into the planIndexingAsNonPrimary method? I think that would be cleaner.

*/
assert canOptimizeAddDocument(index);
assert index.origin() != Operation.Origin.PRIMARY;
return index.seqNo() <= maxSeqNoOfNonAppendOnlyOperations.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should inline this into planIndexingAsNonPrimary then we don't need all the asserts

@DaveCTurner
Copy link
Contributor

Thanks @dnhatn - I think we should agree on elastic/elasticsearch-formal-models#28 before we proceed with this. In particular, it's not yet clear that this is the only problem in this area, and we haven't settled whether we think replicas should have a different version map that only considers seq#s (the document version numbers are now only of importance on the primary).

@dnhatn
Copy link
Member Author

dnhatn commented Feb 23, 2018

@DaveCTurner I understand your concern. Boaz and I discussed and agreed to start the implementation for this and #28790. We hoped that our reasoning for these would be fine. Certainly, we need to sync between the formal model and implementation.

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. I left some nits. As discussed before, we should wait for blessing via @DaveCTurner 's work before merging.

break;
final String key = entry.getKey();
if (key.equals(MAX_UNSAFE_AUTO_ID_TIMESTAMP_COMMIT_ID)) {
maxUnsafeAutoIdTimestamp.set(Math.max(maxUnsafeAutoIdTimestamp.get(), Long.parseLong(entry.getValue())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question - why the leniency with max?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the max expr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the max is still here for time stamps? maybe assert it's -1 (for both seq# and timestamp)

assert index.version() == 1L : "can optimize on replicas but incoming version is [" + index.version() + "]";
plan = IndexingStrategy.optimizedAppendOnly(index.seqNo());
} else {
if (appendOnlyRequest == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we introduce a method similar to mayHaveBeenIndexedBefore that does the check and also updates the maxSeqNoOfNonAppendOnlyOperations? I think it's good to have both marker handling consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently @s1monw prefered the reverse. I'm fine with leaving as is.

}
// doc1 is delayed and arrived after a non-append-only op.
final long seqNoDoc1 = localCheckpointTracker.generateSeqNo();
Engine.IndexResult regularDoc = engine.index(replicaIndexForDoc(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also test a delete?

@bleskes
Copy link
Contributor

bleskes commented Mar 3, 2018

One more nit - can we maybe give the PR title to be more exact about what the code change? something like do not optimize if a normal operation was seen with a higher seq# ..

@dnhatn dnhatn changed the title Do not optimize append-only operation if it may have been exposed Do not optimize append-only op if normal op with higher seq# was seen Mar 4, 2018
@dnhatn
Copy link
Member Author

dnhatn commented Mar 4, 2018

@bleskes I've addressed your comments. Would you please take another look? Thank you.

@DaveCTurner
Copy link
Contributor

I have updated elastic/elasticsearch-formal-models#28 as we discussed, so that's awaiting further comments. AIUI the model covers both this PR and #28790. From a first look, this seems to be quite different from the modelled solution and I'd be more comfortable if we brought them closer together (from either/both ends). In particular, we don't yet seem to be looking at the sequence numbers of replication requests here and I think we should.

I'm still travelling and haven't had a lot of sleep so expect more useful/correct/accurate feedback at a later date. I'm sending this now in case it's useful to those in more westerly timezones in the meantime.

break;
final String key = entry.getKey();
if (key.equals(MAX_UNSAFE_AUTO_ID_TIMESTAMP_COMMIT_ID)) {
maxUnsafeAutoIdTimestamp.set(Math.max(maxUnsafeAutoIdTimestamp.get(), Long.parseLong(entry.getValue())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the max is still here for time stamps? maybe assert it's -1 (for both seq# and timestamp)

@bleskes
Copy link
Contributor

bleskes commented Mar 8, 2018

In particular, we don't yet seem to be looking at the sequence numbers of replication requests here and I think we should.

@DaveCTurner can you clarify what you mean? we use the seq# of the incoming op to update the maxSeqNoOfNonAppendOnlyOperations field?

@DaveCTurner
Copy link
Contributor

Apologies, that was written on my phone and is inaccurate. I meant that the modelled solution stores seq#s in the version map.

DaveCTurner added a commit to DaveCTurner/elasticsearch-formal-models that referenced this pull request Mar 26, 2018
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
#	server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java
#	server/src/test/java/org/elasticsearch/index/replication/ESIndexLevelReplicationTestCase.java
#	server/src/test/java/org/elasticsearch/index/replication/IndexLevelReplicationTests.java
@dnhatn
Copy link
Member Author

dnhatn commented Mar 26, 2018

Thanks @s1monw, @bleskes and @DaveCTurner for reviewing.

@dnhatn dnhatn merged commit 0ac89a3 into elastic:master Mar 26, 2018
@dnhatn dnhatn deleted the append-only-marker branch March 26, 2018 21:01
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Mar 27, 2018
* master:
  Do not optimize append-only if seen normal op with higher seqno (elastic#28787)
  [test] packaging: gradle tasks for groovy tests (elastic#29046)
  Prune only gc deletes below local checkpoint (elastic#28790)
@dnhatn dnhatn changed the title Do not optimize append-only op if normal op with higher seq# was seen Do not optimize append-only if seen normal op with higher seq# Mar 27, 2018
@dnhatn dnhatn changed the title Do not optimize append-only if seen normal op with higher seq# Do not optimize append-only operation if normal operation with higher seq# was seen Mar 27, 2018
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Mar 27, 2018
* master: (40 commits)
  Do not optimize append-only if seen normal op with higher seqno (elastic#28787)
  [test] packaging: gradle tasks for groovy tests (elastic#29046)
  Prune only gc deletes below local checkpoint (elastic#28790)
  remove testUnassignedShardAndEmptyNodesInRoutingTable
  elastic#28745: remove extra option in the composite rest tests
  Fold EngineDiskUtils into Store, for better lock semantics (elastic#29156)
  Add file permissions checks to precommit task
  Remove execute mode bit from source files
  Optimize the composite aggregation for match_all and range queries (elastic#28745)
  [Docs] Add rank_eval size parameter k (elastic#29218)
  [DOCS] Remove ignore_z_value parameter link
  Docs: Update docs/index_.asciidoc (elastic#29172)
  Docs: Link C++ client lib elasticlient (elastic#28949)
  [DOCS] Unregister repository instead of deleting it (elastic#29206)
  Docs: HighLevelRestClient#multiSearch (elastic#29144)
  Add Z value support to geo_shape
  Remove type casts in logging in server component (elastic#28807)
  Change BroadcastResponse from ToXContentFragment to ToXContentObject (elastic#28878)
  REST : Split `RestUpgradeAction` into two actions (elastic#29124)
  Add error file docs to important settings
  ...
martijnvg added a commit that referenced this pull request Mar 28, 2018
* es/master: (22 commits)
  Fix building Javadoc JARs on JDK for client JARs (#29274)
  Require JDK 10 to build Elasticsearch (#29174)
  Decouple NamedXContentRegistry from ElasticsearchException (#29253)
  Docs: Update generating test coverage reports (#29255)
  [TEST] Fix issue with HttpInfo passed invalid parameter
  Remove all dependencies from XContentBuilder (#29225)
  Fix sporadic failure in CompositeValuesCollectorQueueTests
  Propagate ignore_unmapped to inner_hits (#29261)
  TEST: Increase timeout for testPrimaryReplicaResyncFailed
  REST client: hosts marked dead for the first time should not be immediately retried (#29230)
  TEST: Use different translog dir for a new engine
  Make SearchStats implement Writeable (#29258)
  [Docs] Spelling and grammar changes to reindex.asciidoc (#29232)
  Do not optimize append-only if seen normal op with higher seqno (#28787)
  [test] packaging: gradle tasks for groovy tests (#29046)
  Prune only gc deletes below local checkpoint (#28790)
  remove testUnassignedShardAndEmptyNodesInRoutingTable
  #28745: remove extra option in the composite rest tests
  Fold EngineDiskUtils into Store, for better lock semantics (#29156)
  Add file permissions checks to precommit task
  ...
DaveCTurner added a commit to elastic/elasticsearch-formal-models that referenced this pull request Mar 28, 2018
This models how indexing and deletion operations are handled on the replica,
including the optimisations for append-only operations and the interaction with
Lucene commits and the version map.

It incorporates

- elastic/elasticsearch#28787
- elastic/elasticsearch#28790
- elastic/elasticsearch#29276
- a proposal to always prune tombstones
dnhatn added a commit that referenced this pull request Mar 28, 2018
When processing an append-only operation, primary knows that operations
can only conflict with another instance of the same operation. This is
true as the id was freshly generated. However this property doesn't hold
for replicas. As soon as an auto-generated ID was indexed into the
primary, it can be exposed to a search and users can issue a follow up
operation on it. In extremely rare cases, the follow up operation can be
arrived and processed on a replica before the original append-only
request. In this case we can't simply proceed with the append-only
request and blindly add it to the index without consulting the version
map.

The following scenario can cause difference between primary and
replica.

1. Primary indexes an auto-gen-id doc. (id=X, v=1, s#=20)
2. A refresh cycle happens on primary
3. The new doc is picked up and modified - say by a delete by query
   request - Primary gets a delete doc (id=X, v=2, s#=30)
4. Delete doc is processed first on the replica (id=X, v=2, s#=30)
5. Indexing operation arrives on the replica, since it's an auto-gen-id
   request and the retry marker is lower, we put it into lucene without
   any check. Replica has a doc the primary doesn't have.

To deal with a potential conflict between an append-only operation and a
normal operation on replicas, we need to rely on sequence numbers. This
commit maintains the max seqno of non-append-only operations on replica
then only apply optimization for an append-only operation only if its
seq# is higher than the seq# of all non-append-only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement v6.3.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants