Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune only gc deletes below the local checkpoint #28790

Merged
merged 15 commits into from Mar 26, 2018

Conversation

@dnhatn
Copy link
Contributor

commented Feb 22, 2018

Once a document is deleted and Lucene is refreshed, we will not be able to look up the version/seq# associated with that delete in Lucene. As conflicting operations can still be indexed, we need another mechanism to remember these deletes. Therefore deletes should still be stored in the Version Map, even after Lucene is refreshed. Obviously, we can't remember all deletes forever so a trimming mechanism is needed. Currently, we remember deletes for at least 1 minute (the default GC deletes cycle) and clean them periodically. This is, at the moment, the best we can do on the primary for user facing APIs but this arbitrary time limit is problematic for replicas. Furthermore, we can't rely on the primary and replicas doing the trimming in a synchronized manner, and failing to do so results in the replica and primary making different decisions. The following scenario can cause inconsistency between primary and replica.

  1. Primary index doc (index, id=1, v2)
  2. Network packet issue causes index operation to back off and wait
  3. Primary deletes doc (delete, id=1, v3)
  4. Replica processes delete (delete, id=1, v3)
  5. 1+ minute passes (GC deletes runs replica)
  6. Indexing op is finally sent to the replica which no processes it because it forgot about the delete.

We can reply on sequence-numbers to prevent this issue. If we prune only deletes whose seqno at most the local checkpoint, a replica will correctly remember what it needs. The correctness is explained as follows:

Suppose o1 and o2 are two operations on the same document with seq#(o1) < seq#(o2), and o2 arrives before o1 on the replica. o2 is processed normally since it arrives first; when o1 arrives it should be discarded:

  • If seq#(o1) <= LCP, then it will be not be added to Lucene, as it was already previously added.
  • If seq#(o1) > LCP, then it depends on the nature of o2:
    • If o2 is a delete then its seq# is recorded in the VersionMap, since seq#(o2) > seq#(o1) > LCP,
      so a lookup can find it and determine that o1 is stale.
    • If o2 is an indexing then its seq# is either in Lucene (if refreshed) or the VersionMap (if not refreshed yet), so a real-time lookup can find it and determine that o1 is stale.

In this PR, we prefer to deploy a single trimming strategy, which satisfies both requirements, on primary and replicas because:

  • It's simpler - no need to distinguish if an engine is running at primary mode or replica mode or being promoted.
  • If a replica subsequently is promoted, user experience is fully maintained as that replica remembers deletes for the last GC cycle.

However, the version map may consume less memory if we deploy two different trimming strategies for primary and replicas.

Copy link
Member

left a comment

Thx @dnhatn . I left some questions on the tests. Otherwise looks good. Like #28787 , let's wait for @DaveCTurner .

Also - I'm not comfortable with the fact that we skip the local checkpoint check (i.e., not adding to lucene if seq# <= local checkpoint) for optimized operations. It's technically correct but I rather have that as a hard rule that's visible in the code. Can you do a follow up for that?

equalTo(Sets.union(trimmedDeletes, rememberedDeletes)));
engine.refresh("test");
// Only prune deletes below the local checkpoint.
engine.maybePruneDeletes();

This comment has been minimized.

Copy link
@bleskes

bleskes Mar 3, 2018

Member

how does this relate to clock? also refresh already maybe prunes deletes

@dnhatn

This comment has been minimized.

Copy link
Contributor Author

commented Mar 4, 2018

@bleskes You're correct. We don't need a manual clock here. I've removed the clock and also the prune call. I will make a follow-up as you suggested.

dnhatn added 4 commits Mar 7, 2018
# Conflicts:
#	server/src/main/java/org/elasticsearch/index/engine/LiveVersionMap.java
@dnhatn

This comment has been minimized.

Copy link
Contributor Author

commented Mar 9, 2018

@bleskes I've mixed time-based and sequence-based conditions into testPruneOnlyDeletesAtMostLocalCheckpoint as we discussed. Can you take a look? Thank you.

dnhatn added 2 commits Mar 9, 2018
Copy link
Member

left a comment

Baring discussions around the model with @DaveCTurner about the TLA model, this LGTM. I left some nits that don't need another review.

@@ -4572,4 +4577,67 @@ public void testStressUpdateSameDocWhileGettingIt() throws IOException, Interrup
}
}
}

public void testPruneOnlyDeletesAtMostLocalCheckpoint() throws Exception {
IOUtils.close(engine, store);

This comment has been minimized.

Copy link
@bleskes

bleskes Mar 12, 2018

Member

can we try not to mess up with the class's engine ? I rather have a local one enclosed in a try with resources

This comment has been minimized.

Copy link
@dnhatn

dnhatn Mar 26, 2018

Author Contributor

Done

clock.set(randomLongBetween(gcInterval, deleteBatch + gcInterval));
engine.refresh("test");
tombstones.removeIf(v -> v.seqNo < gapSeqNo && v.time < clock.get() - gcInterval);
assertThat(engine.getDeletedTombstones(), containsInAnyOrder(tombstones.toArray()));

This comment has been minimized.

Copy link
@bleskes

bleskes Mar 12, 2018

Member

don't we need to check the size too?

This comment has been minimized.

Copy link
@dnhatn

dnhatn Mar 26, 2018

Author Contributor

We don't have to as containsInAnyOrder also does a size-check.

shards.startAll();
final IndexShard primary = shards.getPrimary();
final IndexShard replica = shards.getReplicas().get(0);
final TimeValue gcInterval = TimeValue.timeValueMillis(scaledRandomIntBetween(1, 1000));

This comment has been minimized.

Copy link
@bleskes

bleskes Mar 12, 2018

Member

I think we can just set this to something very small (10ms?) and also set ThreadPool#ESTIMATED_TIME_INTERVAL_SETTING to 0?

This comment has been minimized.

Copy link
@dnhatn

dnhatn Mar 26, 2018

Author Contributor

done

final boolean isTooOld = currentTime - versionValue.time > pruneInterval;
private boolean canRemoveTombstone(long maxTimestampToPrune, long maxSeqNoToPrune, DeleteVersionValue versionValue) {
// check if the value is old enough and safe to be removed
final boolean isTooOld = versionValue.time < maxTimestampToPrune;

This comment has been minimized.

Copy link
@DaveCTurner

DaveCTurner Mar 13, 2018

Contributor

Why not <=?

This comment has been minimized.

Copy link
@dnhatn

dnhatn Mar 13, 2018

Author Contributor

Because we would like to keep delete tombstones for at least one GC cycle.

-        final boolean isTooOld = currentTime - versionValue.time > pruneInterval;
+        final boolean isTooOld = versionValue.time < maxTimestampToPrune;
@s1monw
s1monw approved these changes Mar 13, 2018
Copy link
Contributor

left a comment

LGTM too!

DaveCTurner added a commit to DaveCTurner/elasticsearch-formal-models that referenced this pull request Mar 26, 2018
Copy link
Contributor

left a comment

LGTM.

dnhatn added 4 commits Mar 26, 2018
# Conflicts:
#	server/src/test/java/org/elasticsearch/index/engine/LiveVersionMapTests.java
#	server/src/test/java/org/elasticsearch/index/replication/ESIndexLevelReplicationTestCase.java
@dnhatn

This comment has been minimized.

Copy link
Contributor Author

commented Mar 26, 2018

Thanks @bleskes, @s1monw and @DaveCTurner for reviewing.

@dnhatn dnhatn merged commit 8795760 into elastic:master Mar 26, 2018
2 checks passed
2 checks passed
CLA Commit author is a member of Elasticsearch
Details
elasticsearch-ci Build finished.
Details
@dnhatn dnhatn deleted the dnhatn:gc-delete branch Mar 26, 2018
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Mar 27, 2018
* master:
  Do not optimize append-only if seen normal op with higher seqno (elastic#28787)
  [test] packaging: gradle tasks for groovy tests (elastic#29046)
  Prune only gc deletes below local checkpoint (elastic#28790)
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Mar 27, 2018
* master: (40 commits)
  Do not optimize append-only if seen normal op with higher seqno (elastic#28787)
  [test] packaging: gradle tasks for groovy tests (elastic#29046)
  Prune only gc deletes below local checkpoint (elastic#28790)
  remove testUnassignedShardAndEmptyNodesInRoutingTable
  elastic#28745: remove extra option in the composite rest tests
  Fold EngineDiskUtils into Store, for better lock semantics (elastic#29156)
  Add file permissions checks to precommit task
  Remove execute mode bit from source files
  Optimize the composite aggregation for match_all and range queries (elastic#28745)
  [Docs] Add rank_eval size parameter k (elastic#29218)
  [DOCS] Remove ignore_z_value parameter link
  Docs: Update docs/index_.asciidoc (elastic#29172)
  Docs: Link C++ client lib elasticlient (elastic#28949)
  [DOCS] Unregister repository instead of deleting it (elastic#29206)
  Docs: HighLevelRestClient#multiSearch (elastic#29144)
  Add Z value support to geo_shape
  Remove type casts in logging in server component (elastic#28807)
  Change BroadcastResponse from ToXContentFragment to ToXContentObject (elastic#28878)
  REST : Split `RestUpgradeAction` into two actions (elastic#29124)
  Add error file docs to important settings
  ...
martijnvg added a commit that referenced this pull request Mar 28, 2018
* es/master: (22 commits)
  Fix building Javadoc JARs on JDK for client JARs (#29274)
  Require JDK 10 to build Elasticsearch (#29174)
  Decouple NamedXContentRegistry from ElasticsearchException (#29253)
  Docs: Update generating test coverage reports (#29255)
  [TEST] Fix issue with HttpInfo passed invalid parameter
  Remove all dependencies from XContentBuilder (#29225)
  Fix sporadic failure in CompositeValuesCollectorQueueTests
  Propagate ignore_unmapped to inner_hits (#29261)
  TEST: Increase timeout for testPrimaryReplicaResyncFailed
  REST client: hosts marked dead for the first time should not be immediately retried (#29230)
  TEST: Use different translog dir for a new engine
  Make SearchStats implement Writeable (#29258)
  [Docs] Spelling and grammar changes to reindex.asciidoc (#29232)
  Do not optimize append-only if seen normal op with higher seqno (#28787)
  [test] packaging: gradle tasks for groovy tests (#29046)
  Prune only gc deletes below local checkpoint (#28790)
  remove testUnassignedShardAndEmptyNodesInRoutingTable
  #28745: remove extra option in the composite rest tests
  Fold EngineDiskUtils into Store, for better lock semantics (#29156)
  Add file permissions checks to precommit task
  ...
DaveCTurner added a commit to elastic/elasticsearch-formal-models that referenced this pull request Mar 28, 2018
This models how indexing and deletion operations are handled on the replica,
including the optimisations for append-only operations and the interaction with
Lucene commits and the version map.

It incorporates

- elastic/elasticsearch#28787
- elastic/elasticsearch#28790
- elastic/elasticsearch#29276
- a proposal to always prune tombstones
dnhatn added a commit that referenced this pull request Mar 28, 2018
Once a document is deleted and Lucene is refreshed, we will not be able 
to look up the `version/seq#` associated with that delete in Lucene. As
conflicting operations can still be indexed, we need another mechanism
to remember these deletes. Therefore deletes should still be stored in
the Version Map, even after Lucene is refreshed. Obviously, we can't
remember all deletes forever so a trimming mechanism is needed.
Currently, we remember deletes for at least 1 minute (the default GC
deletes cycle) and clean them periodically. This is, at the moment, the
best we can do on the primary for user facing APIs but this arbitrary
time limit is problematic for replicas. Furthermore, we can't rely on
the primary and replicas doing the trimming in a synchronized manner,
and failing to do so results in the replica and primary making different
decisions. 

The following scenario can cause inconsistency between
primary and replica.

1. Primary index doc (index, id=1, v2)
2. Network packet issue causes index operation to back off and wait
3. Primary deletes doc (delete, id=1, v3)
4. Replica processes delete (delete, id=1, v3)
5. 1+ minute passes (GC deletes runs replica)
6. Indexing op is finally sent to the replica which no processes it 
   because it forgot about the delete.

We can reply on sequence-numbers to prevent this issue. If we prune only 
deletes whose seqno at most the local checkpoint, a replica will
correctly remember what it needs. The correctness is explained as
follows:

Suppose o1 and o2 are two operations on the same document with seq#(o1) 
< seq#(o2), and o2 arrives before o1 on the replica. o2 is processed
normally since it arrives first; when o1 arrives it should be discarded:
 
1. If seq#(o1) <= LCP, then it will be not be added to Lucene, as it was
  already previously added.

2. If seq#(o1)  > LCP, then it depends on the nature of o2:
  - If o2 is a delete then its seq# is recorded in the VersionMap,
    since seq#(o2) > seq#(o1) > LCP, so a lookup can find it and
    determine that o1 is stale.
  
  - If o2 is an indexing then its seq# is either in Lucene (if
    refreshed) or the VersionMap (if not refreshed yet), so a 
    real-time lookup can find it and determine that o1 is stale.

In this PR, we prefer to deploy a single trimming strategy, which 
satisfies both requirements, on primary and replicas because:

- It's simpler - no need to distinguish if an engine is running at
primary mode or replica mode or being promoted.

- If a replica subsequently is promoted, user experience is fully
maintained as that replica remembers deletes for the last GC cycle.

However, the version map may consume less memory if we deploy two 
different trimming strategies for primary and replicas.
@colings86 colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.