CCR: Following primary should process operations once #34288

dnhatn · 2018-10-04T03:15:07Z

Today we rewrite the operations from the leader with the term of the
following primary because the follower should own its history. The
problem is that a newly promoted primary may re-assign its term to
operations which were replicated to replicas before by the previous
primary. If this happens, some operations with the same seq_no may be
assigned different terms. This is not good for the future optimistic
locking using a combination of seqno and term.

This change ensures that the primary of a follower only processes an
operation if that operation was not processed before. The skipped
operations are guaranteed to be delivered to replicas via either
primary-replica resync or peer-recovery. However, the primary must not
acknowledge until the global checkpoint is at least the highest seqno of
all skipped ops (i.e., they all have been processed on every replica).

Relates #31751
Relates #31113

Today we rewrite the operations from the leader with the term of the following primary because the follower should own its history. The problem is that a newly promoted primary may re-assign its term to operations which were replicated to replicas before by the previous primary. If this happens, some operations with the same seq_no may be assigned different terms. This is not good for the future optimistic locking using a combination of seqno and term. This change ensures that the primary of a follower only processes an operation if that operation was not processed before.

elasticmachine · 2018-10-04T03:15:09Z

Pinging @elastic/es-distributed

bleskes

Looking awesome. I left some nits and suggestions.

bleskes · 2018-10-09T07:03:07Z

...rc/main/java/org/elasticsearch/xpack/ccr/action/bulk/TransportBulkShardOperationsAction.java

+                }
+            }
+        }
+        assert appliedOperations.size() == sourceOperations.size() || waitingForGlobalCheckpoint != SequenceNumbers.UNASSIGNED_SEQ_NO;


can you add a message?

bleskes · 2018-10-09T07:10:08Z

...rc/main/java/org/elasticsearch/xpack/ccr/action/bulk/TransportBulkShardOperationsAction.java

+        for (final Translog.Operation operation : request.getOperations()) {
+            final Engine.Result result = replica.applyTranslogOperation(operation, Engine.Operation.Origin.REPLICA);
+            if (result.getResultType() != Engine.Result.Type.SUCCESS) {
+                assert false : "failure should never happens on replicas; op=[" + operation + "] error=" + result.getFailure() + "]";


doc level failure (normal failures are OK from an algorithmic perspective).

bleskes · 2018-10-09T07:16:58Z

...rc/main/java/org/elasticsearch/xpack/ccr/action/bulk/TransportBulkShardOperationsAction.java

+                        listener.onFailure(e);
+                    } else {
+                        assert waitingForGlobalCheckpoint <= gcp : waitingForGlobalCheckpoint + " > " + gcp;
+                        fillResponse.run();


fillResponse can throw an already closed exception. We should make sure we deal with exceptions here correctly

Maybe warp the listener using ActionListener#wrap which does the write things and will simplify the code here too.

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/index/engine/FollowingEngine.java

bleskes · 2018-10-09T07:25:11Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/index/engine/FollowingEngine.java

+
+    @Override
+    public NoOpResult noOp(NoOp noOp) {
+        // TODO: Make sure we process NoOp once.


why can't we do this now in this PR in the same way?

This is because NoOps don't have _id and they are processed without the _id lock. I am not sure if we need to introduce a fake _id (for locking purpose) for Noops. Thus, I prefer to make it in a separate PR so we can see it more clear.

bleskes · 2018-10-09T07:41:39Z

...in/ccr/src/test/java/org/elasticsearch/xpack/ccr/action/ShardFollowTaskReplicationTests.java

+            }
+        }
+        for (IndexShard replica : follower.getReplicas()) {
+            try (Translog.Snapshot rSnapshot = replica.newChangesSnapshot("test", 0, Long.MAX_VALUE, false)) {


same comment - can we check the content of the ops?

...in/ccr/src/test/java/org/elasticsearch/xpack/ccr/action/ShardFollowTaskReplicationTests.java

bleskes · 2018-10-09T07:45:33Z

...ugin/ccr/src/test/java/org/elasticsearch/xpack/ccr/action/bulk/BulkShardOperationsTests.java

+            expectThrows(ElasticsearchTimeoutException.class, () -> listener.actionGet(TimeValue.timeValueMillis(1)));
+
+            shard.updateGlobalCheckpointOnReplica(randomLongBetween(waitingForGlobalCheckpoint, shard.getLocalCheckpoint()), "test");
+            assertThat(listener.actionGet(TimeValue.timeValueSeconds(5)).getMaxSeqNo(), equalTo(shard.seqNoStats().getMaxSeqNo()));


can we make this just get()? I'm not so comfortable with 5s (it's short) but also we typically let the suite time out so we can get a thread dump (although I suspect it won't be that helpful here, it might)

bleskes · 2018-10-09T07:45:41Z

...ugin/ccr/src/test/java/org/elasticsearch/xpack/ccr/action/bulk/BulkShardOperationsTests.java

+            long waitingForGlobalCheckpoint = randomLongBetween(-1, shard.getGlobalCheckpoint());
+            CcrWritePrimaryResult primaryResult = new CcrWritePrimaryResult(request, null, shard, waitingForGlobalCheckpoint, logger);
+            primaryResult.respond(listener);
+            assertThat(listener.actionGet(TimeValue.timeValueSeconds(5)).getMaxSeqNo(), equalTo(shard.seqNoStats().getMaxSeqNo()));


same comment

bleskes · 2018-10-09T07:50:30Z

.../plugin/ccr/src/test/java/org/elasticsearch/xpack/ccr/index/engine/FollowingEngineTests.java

+                for (Engine.Operation op : operations) {
+                    Engine.Operation.Origin nonPrimary = randomValueOtherThan(Engine.Operation.Origin.PRIMARY,
+                        () -> randomFrom(Engine.Operation.Origin.values()));
+                    Engine.Result result = applyOperation(followingEngine, op, nonPrimary);


any chance we can also check that this wasn't indexed to lucene? maybe doc counts?

dnhatn · 2018-10-09T18:49:10Z

@bleskes Thanks so much for allocating some time on this. I have addressed your suggestions. Could you please have another look?

bleskes

LGTM

bleskes · 2018-10-10T15:47:02Z

...rc/main/java/org/elasticsearch/xpack/ccr/action/bulk/TransportBulkShardOperationsAction.java

                final SeqNoStats seqNoStats = primary.seqNoStats();
-                // return a fresh global checkpoint after the operations have been replicated for the shard follow task


why lose the comment?

...rc/main/java/org/elasticsearch/xpack/ccr/action/bulk/TransportBulkShardOperationsAction.java

dnhatn · 2018-10-10T19:39:26Z

Thanks @bleskes.

This issue was resolved by #34288. Closes #33337 Relates #34288

Today we rewrite the operations from the leader with the term of the following primary because the follower should own its history. The problem is that a newly promoted primary may re-assign its term to operations which were replicated to replicas before by the previous primary. If this happens, some operations with the same seq_no may be assigned different terms. This is not good for the future optimistic locking using a combination of seqno and term. This change ensures that the primary of a follower only processes an operation if that operation was not processed before. The skipped operations are guaranteed to be delivered to replicas via either primary-replica resync or peer-recovery. However, the primary must not acknowledge until the global checkpoint is at least the highest seqno of all skipped ops (i.e., they all have been processed on every replica). Relates #31751 Relates #31113

dnhatn · 2018-10-11T04:59:54Z

Sadly, we might hit deadlock if the FollowTask has more fetchers than writers.

Suppose the leader has two operations [seq#0, seq#1]; FollowTask has two fetchers with fetch-size=1, and one writer with write-size=1.

The FollowTask issues two concurrent fetch requests: {from_seq_no: 0, num_ops:1} and {from_seq_no: 1, num_ops:1}
The request which fetches [seq#1] completes before; then it triggers a write request containing only seq#1
The primary of a follower fails after it has replicated seq#1 to replicas
Since the old primary did not respond, the FollowTask issues another write request containing seq#1 (resend the previous write request)
The new primary has seq#1 already; thus it won't replicate seq#1 to replicas but will wait for the global checkpoint to advance at least seq#1.

The problem is the FollowTask has only one writer and that writer is waiting for seq#0 which won't be delivered until the writer completed.

One solution that I see is to delay the write requests if there is a gap between the last write request and the next write request (the fetched operations are sorted by seq_no already).

dnhatn · 2018-10-11T19:59:14Z

CI has found one case: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/162/console.

This is a follow-up for elastic#34288 (comment). Relates elastic#34288

Since #34288, we might hit deadlock if the FollowTask has more fetchers than writers. This can happen in the following scenario: Suppose the leader has two operations [seq#0, seq#1]; the FollowTask has two fetchers and one writer. 1. The FollowTask issues two concurrent fetch requests: {from_seq_no: 0, num_ops:1} and {from_seq_no: 1, num_ops:1} to read seq#0 and seq#1 respectively. 2. The second request which fetches seq#1 completes before, and then it triggers a write request containing only seq#1. 3. The primary of a follower fails after it has replicated seq#1 to replicas. 4. Since the old primary did not respond, the FollowTask issues another write request containing seq#1 (resend the previous write request). 5. The new primary has seq#1 already; thus it won't replicate seq#1 to replicas but will wait for the global checkpoint to advance at least seq#1. The problem is that the FollowTask has only one writer and that writer is waiting for seq#0 which won't be delivered until the writer completed. This PR proposes to replicate existing operations with the old primary term (instead of the current term) on the follower. In particular, when the following primary detects that it has processed an process already, it will look up the term of an existing operation with the same seq_no in the Lucene index, then rewrite that operation with the old term before replicating it to the following replicas. This approach is wait-free but requires soft-deletes on the follower. Relates #34288

This is a follow-up for #34288. Relates #34412

Since #34288, we might hit deadlock if the FollowTask has more fetchers than writers. This can happen in the following scenario: Suppose the leader has two operations [seq#0, seq#1]; the FollowTask has two fetchers and one writer. 1. The FollowTask issues two concurrent fetch requests: {from_seq_no: 0, num_ops:1} and {from_seq_no: 1, num_ops:1} to read seq#0 and seq#1 respectively. 2. The second request which fetches seq#1 completes before, and then it triggers a write request containing only seq#1. 3. The primary of a follower fails after it has replicated seq#1 to replicas. 4. Since the old primary did not respond, the FollowTask issues another write request containing seq#1 (resend the previous write request). 5. The new primary has seq#1 already; thus it won't replicate seq#1 to replicas but will wait for the global checkpoint to advance at least seq#1. The problem is that the FollowTask has only one writer and that writer is waiting for seq#0 which won't be delivered until the writer completed. This PR proposes to replicate existing operations with the old primary term (instead of the current term) on the follower. In particular, when the following primary detects that it has processed an process already, it will look up the term of an existing operation with the same seq_no in the Lucene index, then rewrite that operation with the old term before replicating it to the following replicas. This approach is wait-free but requires soft-deletes on the follower. Relates #34288

This is a follow-up for #34288. Relates #34412

Today we rewrite the operations from the leader with the term of the following primary because the follower should own its history. The problem is that a newly promoted primary may re-assign its term to operations which were replicated to replicas before by the previous primary. If this happens, some operations with the same seq_no may be assigned different terms. This is not good for the future optimistic locking using a combination of seqno and term. This change ensures that the primary of a follower only processes an operation if that operation was not processed before. The skipped operations are guaranteed to be delivered to replicas via either primary-replica resync or peer-recovery. However, the primary must not acknowledge until the global checkpoint is at least the highest seqno of all skipped ops (i.e., they all have been processed on every replica). Relates #31751 Relates #31113

This issue was resolved by #34288. Closes #33337 Relates #34288

Since #34288, we might hit deadlock if the FollowTask has more fetchers than writers. This can happen in the following scenario: Suppose the leader has two operations [seq#0, seq#1]; the FollowTask has two fetchers and one writer. 1. The FollowTask issues two concurrent fetch requests: {from_seq_no: 0, num_ops:1} and {from_seq_no: 1, num_ops:1} to read seq#0 and seq#1 respectively. 2. The second request which fetches seq#1 completes before, and then it triggers a write request containing only seq#1. 3. The primary of a follower fails after it has replicated seq#1 to replicas. 4. Since the old primary did not respond, the FollowTask issues another write request containing seq#1 (resend the previous write request). 5. The new primary has seq#1 already; thus it won't replicate seq#1 to replicas but will wait for the global checkpoint to advance at least seq#1. The problem is that the FollowTask has only one writer and that writer is waiting for seq#0 which won't be delivered until the writer completed. This PR proposes to replicate existing operations with the old primary term (instead of the current term) on the follower. In particular, when the following primary detects that it has processed an process already, it will look up the term of an existing operation with the same seq_no in the Lucene index, then rewrite that operation with the old term before replicating it to the following replicas. This approach is wait-free but requires soft-deletes on the follower. Relates #34288

This is a follow-up for #34288. Relates #34412

dnhatn added >non-issue :Distributed/CCR Issues around the Cross Cluster State Replication features labels Oct 4, 2018

dnhatn requested review from bleskes, ywelsch and jasontedor October 4, 2018 03:15

dnhatn added >non-issue and removed >non-issue labels Oct 4, 2018

dnhatn added 7 commits October 4, 2018 00:51

add comment

3d68d26

improve tests

c9d386f

Merge branch 'master' into ccr-index-once

b60b697

assert long

c7fac66

simplify primary result test

102887b

Merge branch 'master' into ccr-index-once

2f254f4

Adjust timeout

85844a6

bleskes suggested changes Oct 9, 2018

View reviewed changes

dnhatn added 2 commits October 9, 2018 08:25

Merge branch 'master' into ccr-index-once

5bcd2fc

boaz’s feedback

6d318ed

dnhatn requested a review from bleskes October 9, 2018 18:49

dnhatn added 2 commits October 9, 2018 15:47

ignore AlreadyClosedException exception

eb484d4

stylecheck

16c8501

bleskes approved these changes Oct 10, 2018

View reviewed changes

bleskes reviewed Oct 10, 2018

View reviewed changes

...rc/main/java/org/elasticsearch/xpack/ccr/action/bulk/TransportBulkShardOperationsAction.java Show resolved Hide resolved

restore comment about gcp

146c8bb

dnhatn merged commit 33791ac into elastic:master Oct 10, 2018

dnhatn deleted the ccr-index-once branch October 10, 2018 19:40

dnhatn added the backport pending label Oct 10, 2018

dnhatn added a commit that referenced this pull request Oct 10, 2018

Unmute testFollowIndexAndCloseNode

7bc11a8

This issue was resolved by #34288. Closes #33337 Relates #34288

dnhatn added a commit that referenced this pull request Oct 11, 2018

Unmute testFollowIndexAndCloseNode

1630d59

This issue was resolved by #34288. Closes #33337 Relates #34288

dnhatn removed the backport pending label Oct 11, 2018

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Oct 11, 2018

CCR: Following primary should process NoOps once

488e073

This is a follow-up for elastic#34288 (comment). Relates elastic#34288

This was referenced Oct 11, 2018

CCR: Following primary should process NoOps once #34408

Merged

CCR: Replicate existing ops with old term on follower #34412

Merged

dnhatn added a commit that referenced this pull request Oct 20, 2018

CCR: Following primary should process NoOps once (#34408)

d90b673

This is a follow-up for #34288. Relates #34412

dnhatn added a commit that referenced this pull request Oct 21, 2018

CCR: Following primary should process NoOps once (#34408)

28bbf45

This is a follow-up for #34288. Relates #34412

kcm pushed a commit that referenced this pull request Oct 30, 2018

Unmute testFollowIndexAndCloseNode

aadd8c1

This issue was resolved by #34288. Closes #33337 Relates #34288

kcm pushed a commit that referenced this pull request Oct 30, 2018

CCR: Following primary should process NoOps once (#34408)

4228520

This is a follow-up for #34288. Relates #34412

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCR: Following primary should process operations once #34288

CCR: Following primary should process operations once #34288

dnhatn commented Oct 4, 2018 •

edited

elasticmachine commented Oct 4, 2018

bleskes left a comment

bleskes Oct 9, 2018

bleskes Oct 9, 2018

bleskes Oct 9, 2018

bleskes Oct 9, 2018

bleskes Oct 9, 2018

dnhatn Oct 9, 2018

bleskes Oct 10, 2018

bleskes Oct 9, 2018

bleskes Oct 9, 2018

bleskes Oct 9, 2018

bleskes Oct 9, 2018

dnhatn commented Oct 9, 2018

bleskes left a comment

bleskes Oct 10, 2018

dnhatn commented Oct 10, 2018

dnhatn commented Oct 11, 2018

dnhatn commented Oct 11, 2018

		final SeqNoStats seqNoStats = primary.seqNoStats();
		// return a fresh global checkpoint after the operations have been replicated for the shard follow task

CCR: Following primary should process operations once #34288

CCR: Following primary should process operations once #34288

Conversation

dnhatn commented Oct 4, 2018 • edited

elasticmachine commented Oct 4, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn commented Oct 9, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn commented Oct 10, 2018

dnhatn commented Oct 11, 2018

dnhatn commented Oct 11, 2018

dnhatn commented Oct 4, 2018 •

edited