Throw back replica local checkpoint on new primary #25452

jasontedor · 2017-06-28T18:38:29Z

This commit causes a replica to throw back its local checkpoint to the global checkpoint when learning of a new primary through a replica operation.

Relates #10708, relates #25355

This commit causes a replica to throwback its local checkpoint to the global checkpoint when learning of a new primary through a replica operation.

ywelsch · 2017-06-29T15:55:23Z

core/src/main/java/org/elasticsearch/index/seqno/LocalCheckpointTracker.java

+        assert checkpoint <= this.checkpoint;
+        processedSeqNo.clear();
+        firstProcessedSeqNo = checkpoint + 1;
+        nextSeqNo = checkpoint + 1;


I think that resetting nextSeqNo is incorrect. Assume that the primary-replica resync fails and that the shard here would be promoted to primary, in that case it would reuse the sequence numbers to override stuff it already had. I'll reach out to discuss.

We had a very long discussion about this. The solution here is fine if we add a follow-up that resets the local checkpoint tracker state on a primary during promotion (the newly promoted primary needs to reset its local checkpoint and mark the sequence numbers in its translog as completed to reestablish the state of the local checkpoint tracker, it has to do this before filling the gaps).

Also, such a follow-up will introduce a test that captures the problem here, namely that if we do not do something as outlined above, in this scenario a newly promoted primary can overwrite history.

Thinking about this some more, I agree with the assessment we had, except for one thing: We should not reset the nextSeqNo variable which is exposed as getMaxSeqNo. Otherwise when writing out segments, this max sequence number information which we take from the local checkpoint tracker would be incorrect, i.e. there could be a document in the segment where the sequence number would be above max.

Put differently, nextSeqNo is not tied to the bit set (which represents the pending confirmation marker). Instead it tracks the actual translog.

ywelsch · 2017-06-30T07:38:29Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+                                    getLocalCheckpoint(),
+                                    globalCheckpoint,
+                                    globalCheckpoint);
+                            getEngine().seqNoService().resetLocalCheckpoint(globalCheckpoint);


The global checkpoint that is provided by the new primary might be lower than the global checkpoint that we currently have (e.g. the failed primary did communicate the latest global checkpoint to us, but not to the newly appointed primary).
First we have to update the global checkpoint, then use the newly computed global checkpoint to reset the local checkpoint, otherwise the local checkpoint could end up below the global checkpoint.

I pushed 5e9d79f.

ywelsch

I requested a change in how nextSeqNo is updated and left 2 nits.

ywelsch · 2017-07-01T09:11:10Z

core/src/main/java/org/elasticsearch/index/seqno/LocalCheckpointTracker.java

+        assert checkpoint <= this.checkpoint;
+        processedSeqNo.clear();
+        firstProcessedSeqNo = checkpoint + 1;
+        nextSeqNo = checkpoint + 1;


Thinking about this some more, I agree with the assessment we had, except for one thing: We should not reset the nextSeqNo variable which is exposed as getMaxSeqNo. Otherwise when writing out segments, this max sequence number information which we take from the local checkpoint tracker would be incorrect, i.e. there could be a document in the segment where the sequence number would be above max.

Put differently, nextSeqNo is not tied to the bit set (which represents the pending confirmation marker). Instead it tracks the actual translog.

ywelsch · 2017-07-01T09:16:28Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

@@ -2057,7 +2057,16 @@ public void acquireReplicaOperationPermit(final long operationPrimaryTerm, final
                            assert operationPrimaryTerm > primaryTerm :
                                "shard term already update.  op term [" + operationPrimaryTerm + "], shardTerm [" + primaryTerm + "]";
                            primaryTerm = operationPrimaryTerm;
+                            logger.trace(
+                                    "detected new primary with primary term [{}], "
+                                            + "resetting local checkpoint from [{}] to [{}], "


This log line is incorrect, we don't know the value yet at this point towards which we are going to reset the local checkpoint. It is only determined after setting the global checkpoint in the line below. I think it's easiest to move the logging one line below and use getGlobalCheckpoint(). I would also leave out the part which says "updating global checkpoint to {}" as the given value might be below the current global checkpoint, which might be misleading in this message (we already have trace logging for the global checkpoint updates).

ywelsch · 2017-07-01T09:17:44Z

core/src/test/java/org/elasticsearch/index/seqno/LocalCheckpointTrackerTests.java

-                    .build()),
-            SequenceNumbersService.NO_OPS_PERFORMED,
-            SequenceNumbersService.NO_OPS_PERFORMED
+                IndexSettingsModule.newIndexSettings(


why reformat?

jasontedor · 2017-07-03T20:49:02Z

Thanks @ywelsch, I have addressed your feedback.

* master: (52 commits) Include shared/attributes.asciidoc from docs master Fixed page breaks for ICU Collation Keyword Fields Remove QueryParseContext (elastic#25486) [Test] Use a common testing class for all XContent filtering tests (elastic#25491) Tests fix - Significant terms/text aggs (elastic#25499) [DOCS] add docs for REST high level client index method (elastic#25501) Tests: Add Debian 9 (Stretch) to the packaging tests test: Run flush before upgrade and refresh after upgrade. Fix third party audit for repository-hdfs [TEST] Expect nodes getting disconnected quickly testPrimaryFailureIncreasesTerm should use assertBusy to wait for yellow Cleanup network / transport related settings (elastic#25489) Fix repository-hdfs plugin packaging test Remove allocation id from replica replication response (elastic#25488) Adjust BWC version on bad allocation request test Upgrading HDFS Repository Plugin to use HDFS 2.8.1 Client (elastic#25497) Adjust status on bad allocation explain requests Preliminary support for ARM Add doc note regarding explicit publish host Fix typo in name of test ...

ywelsch

I think there is one more edge-case that needs to be covered (when global checkpoint is SequenceNumbersService.UNASSIGNED_SEQ_NO), otherwise PR looks good to me.

ywelsch · 2017-07-04T13:20:28Z

core/src/test/java/org/elasticsearch/index/seqno/LocalCheckpointTrackerTests.java

+
+    public void testResetCheckpoint() {
+        final int operations = 1024 - scaledRandomIntBetween(0, 1024);
+        int maxSeqNo = Math.toIntExact(SequenceNumbersService.NO_OPS_PERFORMED);


neat, I did not know about this method

ywelsch · 2017-07-04T13:29:12Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+                                    operationPrimaryTerm,
+                                    getLocalCheckpoint(),
+                                    getGlobalCheckpoint());
+                            getEngine().seqNoService().resetLocalCheckpoint(getGlobalCheckpoint());


As we are feeding this method the current global checkpoint, which could be still unknown, is it possible that we call resetLocalCheckpoint with SequenceNumbersService.UNASSIGNED_SEQ_NO? If so, I think that that would be bad. The method resetLocalCheckpoint should have an assertion, similar to the constructor. Also we need to make sure to special-case this.

I do not think this is possible after we address #25415. A newly created primary will update its local checkpoint to -1 and calculate a global checkpoint of -1. Replicas that recover from this primary will receive a global checkpoint of -1 that they would maintain if promoted. Similarly for relocation. Thus I think that we will never see -2 here.

I think we should only add an assertion here.

To recap a discussion we had via another channel, we do have to worry about -2 here in the case when a primary on 5.x dies and a replica on 6.x is promoted and initiates are re-sync to another 6.x replica. I pushed a d1e0ec2.

* master: [Analysis] Support normalizer in request param (elastic#24767) Remove deprecated IdsQueryBuilder constructor (elastic#25529) Adds check for negative search request size (elastic#25397) test: also inspect the upgrade api response to check whether the upgrade really ran [DOCS] restructure java clients docs pages (elastic#25517)

jasontedor · 2017-07-04T14:21:52Z

@ywelsch I addressed your feedback, would you look again?

This reverts commit e8e4544.

This reverts commit 93e751f.

This reverts commit 385c948.

ywelsch

LGTM.

jasontedor requested a review from ywelsch June 28, 2017 18:38

bleskes mentioned this pull request Jun 28, 2017

Sequence Numbers related work slated for 6.0.0 #25355

Closed

9 tasks

Throw back replica local checkpoint on new primary

2115f4a

This commit causes a replica to throwback its local checkpoint to the global checkpoint when learning of a new primary through a replica operation.

jasontedor force-pushed the local-checkpoint-throwback branch from f09a99f to 2115f4a Compare June 28, 2017 18:39

jasontedor changed the title ~~Throwback replica local checkpoint on new primary~~ Throw back replica local checkpoint on new primary Jun 28, 2017

Checkstyle

8f74e92

ywelsch reviewed Jun 29, 2017

View reviewed changes

ywelsch reviewed Jun 30, 2017

View reviewed changes

Fix order and beef up test

5e9d79f

ywelsch suggested changes Jul 1, 2017

View reviewed changes

jasontedor added 4 commits July 3, 2017 16:02

Do not reset max seq no

1e7cee9

Fix logging statement

f33925d

Revert accidental format changes

cbe568b

Remove import

0174af4

ywelsch suggested changes Jul 4, 2017

View reviewed changes

jasontedor added 2 commits July 4, 2017 10:20

Add assertion

385c948

jasontedor added 6 commits July 4, 2017 10:48

Fix test

93e751f

Fix tests

e8e4544

Revert "Fix tests"

49742d4

This reverts commit e8e4544.

Revert "Fix test"

5064e8c

This reverts commit 93e751f.

Revert "Add assertion"

5640e2a

This reverts commit 385c948.

Special case

d1e0ec2

jasontedor requested a review from ywelsch July 4, 2017 16:30

jasontedor added 2 commits July 4, 2017 15:25

Fix test

e1bc5c7

Cleanup

6d67289

ywelsch approved these changes Jul 5, 2017

View reviewed changes

jasontedor merged commit 7dcd81b into elastic:master Jul 5, 2017

jasontedor deleted the local-checkpoint-throwback branch July 5, 2017 13:17

bleskes mentioned this pull request Jul 10, 2017

Add Sequence Numbers to write operations #10708

Closed

64 tasks

clintongormley added :Sequence IDs >enhancement v6.0.0 labels Jul 10, 2017

colings86 added v6.0.0-beta1 and removed v6.0.0 labels Jul 31, 2017

clintongormley added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throw back replica local checkpoint on new primary #25452

Throw back replica local checkpoint on new primary #25452

jasontedor commented Jun 28, 2017 •

edited

ywelsch Jun 29, 2017

jasontedor Jun 29, 2017

jasontedor Jun 29, 2017

ywelsch Jul 1, 2017

ywelsch Jun 30, 2017

jasontedor Jun 30, 2017

ywelsch left a comment

ywelsch Jul 1, 2017

ywelsch Jul 1, 2017

ywelsch Jul 1, 2017

jasontedor commented Jul 3, 2017

ywelsch left a comment

ywelsch Jul 4, 2017

ywelsch Jul 4, 2017

jasontedor Jul 4, 2017

jasontedor Jul 4, 2017

jasontedor commented Jul 4, 2017

ywelsch left a comment

Throw back replica local checkpoint on new primary #25452

Throw back replica local checkpoint on new primary #25452

Conversation

jasontedor commented Jun 28, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor commented Jul 3, 2017

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor commented Jul 4, 2017

ywelsch left a comment

Choose a reason for hiding this comment

jasontedor commented Jun 28, 2017 •

edited