Rewrite shard follow node task logic #31581

martijnvg · 2018-06-26T14:06:38Z

The current shard follow mechanism is complex and does not give us easy ways the have visibility into the system (e.g. why we are falling behind).
The main reason why it is complex is because the current design is highly asynchronous. Also in the current model it is hard to apply backpressure
other than reducing the concurrent reads from the leader shard.

This PR has the following changes:

Rewrote the shard follow task to coordinate the shard follow mechanism between a leader and follow shard in a single threaded manner.
This allows for better unit testing and makes it easier to add stats.
All write operations read from the shard changes api should be added to a buffer instead of directly sending it to the bulk shard operations api.
This allows to apply backpressure. In this PR there is a limit that controls how many write ops are allowed in the buffer after which no new reads
will be performed until the number of ops is below that limit.
The shard changes api includes the current global checkpoint on the leader shard copy. This allows reading to be a more self sufficient process;
instead of relying on a background thread to fetch the leader shard's global checkpoint.
Reading write operations from the leader shard (via shard changes api) is a seperate step then writing the write operations (via bulk shards operations api).
Whereas before a read would immediately result into a write.
The bulk shard operations api returns the local checkpoint on the follow primary shard, to keep the shard follow task up to date with what has been written.
Moved the shard follow logic that was previously in ShardFollowTasksExecutor to ShardFollowNodeTask.
Moved over the changes from [CCR] Made shard follow task more resilient against node failure and #31242 to make shard follow mechanism resilient from node and shard failures.

Relates to #30086

elasticmachine · 2018-06-26T14:06:39Z

Pinging @elastic/es-distributed

martijnvg · 2018-06-26T14:08:40Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardChangesAction.java

@@ -274,6 +287,15 @@ protected Response newResponse() {
                    break;
                }
            }
+        } catch (IllegalStateException e) {
+            // TODO: handle peek reads better.


I know this is not cool. I'm still thinking about this.

can you clarify what's happening here? if we move to maximum size semantics (rather than a hard limit) I don't think this is a problem? we just return empty.

So I changed the size semantics, allowing to return less documents then was requested (as long as there are no gaps). But then I ran into a problem that when the shard follow task requested a specific range of ops it knew to be there, a replica shard copy may not have it and thus less ops were returned. This was before I added the logic to deal with less operations being returned because max_translog_bytes limit met and I realize now that this problem should be solved in the same way as the max translog bytes limit scenario. So I can safely change the maximum size semantics in the shard changes api.

I dont' get it. can you point us to the place where the ISE is coming from? Is it due toe source not being available?

This is gone now. I did make a change to LuceneChangesSnapshot's rangeCheck(...) method to not be strict about not returning up to toSeqNo.

The current shard follow mechanism is complex and does not give us easy ways the have visibility into the system (e.g. why we are falling behind). The main reason why it is complex is because the current design is highly asynchronous. Also in the current model it is hard to apply backpressure other than reducing the concurrent reads from the leader shard. This PR has the following changes: * Rewrote the shard follow task to coordinate the shard follow mechanism between a leader and follow shard in a single threaded manner. This allows for better unit testing and makes it easier to add stats. * All write operations read from the shard changes api should be added to a buffer instead of directly sending it to the bulk shard operations api. This allows to apply backpressure. In this PR there is a limit that controls how many write ops are allowed in the buffer after which no new reads will be performed until the number of ops is below that limit. * The shard changes api includes the current global checkpoint on the leader shard copy. This allows reading to be a more self sufficient process; instead of relying on a background thread to fetch the leader shard's global checkpoint. * Reading write operations from the leader shard (via shard changes api) is a seperate step then writing the write operations (via bulk shards operations api). Whereas before a read would immediately result into a write. * The bulk shard operations api returns the local checkpoint on the follow primary shard, to keep the shard follow task up to date with what has been written. * Moved the shard follow logic that was previously in ShardFollowTasksExecutor to ShardFollowNodeTask. * Moved over the changes from elastic#31242 to make shard follow mechanism resilient from node and shard failures. Relates to elastic#30086

bleskes

Thanks @martijnvg for putting it up so quickly. I did an initial pass mostly focussed on ShardFollowNodeTask

bleskes · 2018-06-27T12:19:07Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardChangesAction.java

@@ -54,9 +54,9 @@ public Response newResponse() {
    public static class Request extends SingleShardRequest<Request> {

        private long minSeqNo;
-        private long maxSeqNo;
+        private Long maxSeqNo;


can we switch everything for maxSeqNo to size? (I struggle to find a size name with an operation count in it, maybe people prefer maxOperationCount and maxOperationSizeInBytes as the byte limiter.

also I don't think this should be nullable (I'll comment separately about how to change the reason why it's null now)

bleskes · 2018-06-27T12:21:45Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardChangesAction.java

            this.operations = operations;
        }

        public long getIndexMetadataVersion() {
            return indexMetadataVersion;
        }

+        public long getLeaderGlobalCheckpoint() {


can we just call it global checkpoint? everything here is leader.

bleskes · 2018-06-27T12:23:05Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardChangesAction.java

@@ -274,6 +287,15 @@ protected Response newResponse() {
                    break;
                }
            }
+        } catch (IllegalStateException e) {
+            // TODO: handle peek reads better.


can you clarify what's happening here? if we move to maximum size semantics (rather than a hard limit) I don't think this is a problem? we just return empty.

bleskes · 2018-06-27T12:24:50Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+                        String description,
+                        TaskId parentTask,
+                        Map<String, String> headers,
+                        Client leaderClient,


Instead of passing the clients in the constructor, I would like to make this class abstract where all the methods that require a client are abstract. Then the PersistentTaskExecutor can instantiate a method that delegates async requests via clients but tests can do something else (synchronously return something, throw exceptions or what ever)

can't this be done with a filter client as well? I don't think we should do this as abstract classes it will make things more complicated?

bleskes · 2018-06-27T12:27:28Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+    void start(long leaderGlobalCheckpoint, long followGlobalCheckpoint) {
+        this.lastRequestedSeqno = followGlobalCheckpoint;
+        this.processedGlobalCheckpoint = followGlobalCheckpoint;
+        this.leaderGlobalCheckpoint = leaderGlobalCheckpoint;


I think we can set this to the followGlobalCheckpoint and send a pick request. No need to preflight imo

bleskes · 2018-06-27T14:08:53Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+
+    private synchronized void coordinateWrites() {
+        while (true) {
+            if (buffer.isEmpty()) {


same request - please have a method called haveWriteBudget and do while(haveWriteBudget() && buffer.isEmpty() == false) {

bleskes · 2018-06-27T14:20:33Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+        }
+    }
+
+    private void sendShardChangesRequest(long from, Long to) {


the current to parameter represents a hard upper bound this request is responsible for. Can we name it something that reflects this requireOperationsUpTo and also never set it to null? if need be we can set it to from or the leader global checkpoint (and it shouldn't be used as the size limit of the request).

bleskes · 2018-06-27T14:22:07Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+            e -> handleFailure(e, () -> sendShardChangesRequest(from, to)));
+    }
+
+    private synchronized void handleResponse(long from, Long to, ShardChangesAction.Response response) {


there is no need fo synchronizing here, better to sync the maybeUpdateMapping if need be?

bleskes · 2018-06-27T14:24:34Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+    private void handleFailure(Exception e, Runnable task) {
+        assert e != null;
+        if (shouldRetry(e)) {
+            if (isStopped() == false && retryCounter.incrementAndGet() <= RETRY_LIMIT) {


this retry counter is tricky as we need to have a budget that allows all current read/writers to fail on a network hiccup. There's also the question on how people know what happen when the task is failed (where we might need support from persistent tasks). I think we can leave this for now but have to deal with it in a follow up.

this retry counter is truck as we need to have a budget that allows all current read/writers to fail on a network hiccup.

Do you mean have a counter for reading, writing and mapping updates?

There's also the question on how people know what happen when the task is failed (where we might need support from persistent tasks).

👍

I think we can leave this for now but have to deal with it in a follow up.

Agreed

bleskes · 2018-06-27T14:25:00Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+        // TODO: What other exceptions should be retried?
+        return NetworkExceptionHelper.isConnectException(e) ||
+            NetworkExceptionHelper.isCloseConnectionException(e) ||
+            e instanceof ActionTransportException ||


what is this one?

do we want to reuse org.elasticsearch.action.support.TransportActions#isShardNotAvailableException ?

what is this one?

Not sure. I removed this one.

do we want to reuse TransportActions#isShardNotAvailableException?

Makes sense when TransportSingleShardAction bubbles this exception up.

…anges api and LuceneChangesSnapshot to allow a response to not return up to `toSeqNo`.

s1monw · 2018-06-28T12:26:15Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+                        String description,
+                        TaskId parentTask,
+                        Map<String, String> headers,
+                        Client leaderClient,


can't this be done with a filter client as well? I don't think we should do this as abstract classes it will make things more complicated?

s1monw · 2018-06-28T12:27:43Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+                numConcurrentReads++;
+                long from = lastRequestedSeqno + 1;
+                long to = from + maxReadSize <= leaderGlobalCheckpoint ? from + maxReadSize : leaderGlobalCheckpoint;
+                LOGGER.debug("{}[{}] read [{}/{}]", params.getFollowShardId(), numConcurrentReads, from, to);


IMO lets drop them all. IF you have to make them trace you can also just add them back if you need it.

s1monw · 2018-06-28T12:28:59Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+                scheduler.accept(TimeValue.timeValueMillis(500), this::coordinateReads);
+            }
+        } else {
+            if (numConcurrentReads == 0) {


s1monw · 2018-06-28T12:30:12Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+        // TODO: What other exceptions should be retried?
+        return NetworkExceptionHelper.isConnectException(e) ||
+            NetworkExceptionHelper.isCloseConnectionException(e) ||
+            e instanceof ActionTransportException ||


s1monw · 2018-06-28T12:32:59Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+        request.setMinSeqNo(from);
+        request.setMaxSeqNo(to);
+        request.setMaxTranslogsBytes(params.getMaxTranslogBytes());
+        leaderClient.execute(ShardChangesAction.INSTANCE, request, new ActionListener<ShardChangesAction.Response>() {


you can use ActionListener.wrap(handler::accept, errorHandler::accept) instead

I will do that

s1monw · 2018-06-28T12:35:55Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

            this.processedGlobalCheckpoint = processedGlobalCheckpoint;
+            this.numberOfConcurrentReads = numberOfConcurrentReads;


please make sure they are positive or don't use vint to serialize them

there is validation for that in the api that creates the persistent task

s1monw · 2018-06-28T12:47:39Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java


-    ShardFollowNodeTask(long id, String type, String action, String description, TaskId parentTask, Map<String, String> headers) {
+    ShardFollowNodeTask(long id,
+                        String type,


can we not have every arg on a sep line please?

s1monw · 2018-06-28T12:48:54Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+
+    private volatile int numConcurrentReads = 0;
+    private volatile int numConcurrentWrites = 0;
+    private volatile long processedGlobalCheckpoint = 0;


those variables seem to be read together. I wonder if we should not make the volatile but rather synchronize their reading?

Not sure. I don't want to make the getStatus() call synchronized which can be invoked when list tasks api is used.

s1monw · 2018-06-28T12:49:12Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+
+    private static final Logger LOGGER = Loggers.getLogger(ShardFollowNodeTask.class);
+
+    final Client leaderClient;


why is this pkg private?

this gone now. It was for something silly.

s1monw · 2018-06-28T12:49:51Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardChangesAction.java

@@ -274,6 +287,15 @@ protected Response newResponse() {
                    break;
                }
            }
+        } catch (IllegalStateException e) {
+            // TODO: handle peek reads better.


I dont' get it. can you point us to the place where the ISE is coming from? Is it due toe source not being available?

… do a single read. if the shard changes response returns no hits and had no target seqno then it should schedule coordinate reads phase.

…ollowNodeTask

martijnvg · 2018-06-28T15:26:57Z

@bleskes @simonw I've updated the PR.

…ite_2

#31581

… response after it replication has been completed

the maxOperationSizeInBytes limit has been met

martijnvg · 2018-07-05T15:19:26Z

@bleskes Thanks for reviewing. I think I've addressed all of your comments.

…ite_2

…eSize into maxBatchOperationCount

bleskes

Thx @martijnvg - I left some final comments based on our discussions.

bleskes · 2018-07-09T07:56:07Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/FollowIndexAction.java

+            if (maxConcurrentWriteBatches < 1) {
+                throw new IllegalArgumentException("maxConcurrentWriteBatches must be larger than 0");
+            }
+            if (maxWriteBufferSize < 1) {


can we also check the time values are non null?

bleskes · 2018-07-09T07:56:55Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/FollowIndexAction.java

@@ -251,10 +272,21 @@ void start(Request request, String clusterNameAlias, IndexMetaData leaderIndexMe
                .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));for (int i = 0; i < numShards; i++) {
                final int shardId = i;
                String taskId = followIndexMetadata.getIndexUUID() + "-" + shardId;
+                TimeValue retryTimeout = ShardFollowNodeTask.DEFAULT_RETRY_TIMEOUT;
+                if (request.retryTimeout != null) {


I don't think these can be null?

bleskes · 2018-07-09T08:05:06Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+        this.idleShardChangesRequestDelay = idleShardChangesRequestDelay;
+    }
+
+    void start(long followGlobalCheckpoint) {


nit: followerGlobalCheckpoint ?

bleskes · 2018-07-09T08:06:14Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+        while (hasReadBudget() && lastRequestedSeqno < globalCheckpoint) {
+            numConcurrentReads++;
+            long from = lastRequestedSeqno + 1;
+            LOGGER.trace("{}[{}] read [{}/{}]", params.getFollowShardId(), numConcurrentReads, from, maxReadSize);


nit: move this after the maxRequiredSeqno and use maxRequiredSeqno for the range?

bleskes · 2018-07-09T08:08:18Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+            lastRequestedSeqno = maxRequiredSeqno;
+        }
+
+        if (numConcurrentReads == 0) {


I just relized this has a bug - we need to check if we have budget here too (add this to the list of tests :)) - we may have no readers because the write buffer is full.

In that the case the buffer is full and we don't start a read then the shard follow task comes to a halt, because when the write buffer has been consumed nothing else will happen as this was the last point a read will be started.

so maybe when handeling bulk shard operation responses we should also check whether a shard changes request should be fired off?

while in this case we only make a copy if maxBatchSizeInBytes limit has been reached.

he - I'm pretty I saw we did in some previous iterations. And yes - it must allow reads because we block reads when we don't consume the write buffer. In a sense the right moment is when you coordinate writes and you consumed some buffer elements. Both are fine with me

bleskes · 2018-07-09T08:14:14Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

@@ -90,6 +358,12 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
            {
                builder.field(PROCESSED_GLOBAL_CHECKPOINT_FIELD.getPreferredName(), processedGlobalCheckpoint);
            }
+            {


nit - what do we need these extra {} - just inline this in the previous block?

I started doing this here, because I agreed with this comment: https://github.com/elastic/x-pack-elasticsearch/pull/4290#discussion_r179411015

one set of {} is awesome. But you can inline all fields in one set?

yes, that makes sense.

bleskes · 2018-07-09T08:16:25Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+                ops[i] = buffer.remove();
+                sumEstimatedSize += ops[i].estimateSize();
+                if (sumEstimatedSize > params.getMaxBatchSizeInBytes()) {
+                    ops = Arrays.copyOf(ops, i + 1);


I wonder if we should use an ArrayList with initial capacity. We can then change the request etc to use List<> instead of array

In that case we always make a copy of the underlying array in ArrayList, while in this case we only make a copy if maxBatchSizeInBytes limit has been reached.

In that case we always make a copy of the underlying array in ArrayList,

Not if you change the request and such to use lists.

while in this case we only make a copy if maxBatchSizeInBytes limit has been reached.
which may be frequent (i.e., every request), which is why I was considering making a change.

bleskes · 2018-07-09T08:41:20Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+            e -> handleFailure(e, () -> sendShardChangesRequest(from, maxOperationCount, maxRequiredSeqNo)));
+    }
+
+    private void handleReadResponse(long from, int maxOperationCount, long maxRequiredSeqNo, ShardChangesAction.Response response) {


What do you think of:

private void handleReadResponse(long from, int maxOperationCount, long maxRequiredSeqNo, ShardChangesAction.Response response) { maybeUpdateMapping(response.getIndexMetadataVersion(), () -> { synchronized (ShardFollowNodeTask.this) { globalCheckpoint = Math.max(globalCheckpoint, response.getGlobalCheckpoint()); final long newMinRequiredSeqNo; if (response.getOperations().length == 0) { newMinRequiredSeqNo = from; } else { assert response.getOperations()[0].seqNo() == from : "first operation is not what we asked for. From is [" + from + "], got " + response.getOperations()[0]; buffer.addAll(Arrays.asList(response.getOperations())); final long maxSeqNo = response.getOperations()[response.getOperations().length - 1].seqNo(); assert maxSeqNo== Arrays.stream(response.getOperations()).mapToLong(Translog.Operation::seqNo).max().getAsLong(); newMinRequiredSeqNo = maxSeqNo + 1; // update last requested seq no as we may have gotten more than we asked for and we don't want to ask it again. lastRequestedSeqno = Math.max(lastRequestedSeqno, maxSeqNo); assert lastRequestedSeqno <= globalCheckpoint: "lastRequestedSeqno [" + lastRequestedSeqno + "] is larger than the global checkpoint [" + globalCheckpoint + "]"; coordinateWrites(); } if (newMinRequiredSeqNo < maxRequiredSeqNo) { int newSize = (int) (maxRequiredSeqNo - newMinRequiredSeqNo) + 1; LOGGER.trace("{} received [{}] ops, still missing [{}/{}], continuing to read...", params.getFollowShardId(), response.getOperations().length, newMinRequiredSeqNo, maxRequiredSeqNo); sendShardChangesRequest(newMinRequiredSeqNo, newSize, maxRequiredSeqNo); } else { // read is completed, decrement numConcurrentReads--; if (response.getOperations().length == 0 && globalCheckpoint == lastRequestedSeqno) { // we got nothing and we have no reason to believe asking again well get us more, treat shard as idle and delay // future requests LOGGER.trace("{} received no ops and no known ops to fetch, scheduling to coordinate reads", params.getFollowShardId()); scheduler.accept(idleShardChangesRequestDelay, this::coordinateReads); } else { coordinateReads(); } } } }); }

PS - note the difference in handling of lastRequestedSeqno - I think the way you had it had a bug.

bleskes · 2018-07-09T08:43:21Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

    }

    public static class Status implements Task.Status {

        public static final String NAME = "shard-follow-node-task-status";

        static final ParseField PROCESSED_GLOBAL_CHECKPOINT_FIELD = new ParseField("processed_global_checkpoint");
+        static final ParseField NUMBER_OF_CONCURRENT_READS_FIELD = new ParseField("number_of_concurrent_reads");


I'nm still missing the buffer size, the max requested seq no, leader global checkpoint , follower global checkpoint etc. I'm fine with a follow up for those, but that's what I meant.

bleskes · 2018-07-09T08:47:49Z

...rc/main/java/org/elasticsearch/xpack/ccr/action/bulk/TransportBulkShardOperationsAction.java

+        @Override
+        protected void respondIfPossible(Exception ex) {
+            assert Thread.holdsLock(this);
+            // maybe invoked multiple times, but that is ok as global checkpoint does not go backwards


I wonder if we should override the respond method - this one is called once while the replication operation is completed (but post writes may still be inflight). Seems like cleaner logic to follow and we can write a comment there as the why we do it (get a fresh global checkpoint once the current batch has been fully replicated)

…ite_2

martijnvg · 2018-07-09T13:35:58Z

@bleskes I've updated the PR.

bleskes · 2018-07-10T09:53:29Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

+                ops.get(ops.size() - 1).seqNo(), ops.size());
+            sendBulkShardOperationsRequest(ops);
+
+            // In case that buffer is higher than max write buffer size then reads may all have been stopped,


it feels a bit weird to have this here, in the coordinateWrite loop. Since it's always safe to call coordinateReads, how about just calling it once in handleWriteResponse and be done with it?

bleskes

LGTM, assuming you agree with my feedback. There is no need for another cycle if you do.

bleskes · 2018-07-10T09:54:50Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

    }

    public static class Status implements Task.Status {

        public static final String NAME = "shard-follow-node-task-status";

-        static final ParseField PROCESSED_GLOBAL_CHECKPOINT_FIELD = new ParseField("processed_global_checkpoint");
+        static final ParseField GLOBAL_CHECKPOINT_FIELD = new ParseField("leader_global_checkpoint");


can we align the name with the field (LEADER_GLOBAL_CHECKPOINT_FIELD & FOLLOWER_GLOBAL_CHECKPOINT_FIELD)?

…rites() is always safe

The current shard follow mechanism is complex and does not give us easy ways the have visibility into the system (e.g. why we are falling behind). The main reason why it is complex is because the current design is highly asynchronous. Also in the current model it is hard to apply backpressure other than reducing the concurrent reads from the leader shard. This PR has the following changes: * Rewrote the shard follow task to coordinate the shard follow mechanism between a leader and follow shard in a single threaded manner. This allows for better unit testing and makes it easier to add stats. * All write operations read from the shard changes api should be added to a buffer instead of directly sending it to the bulk shard operations api. This allows to apply backpressure. In this PR there is a limit that controls how many write ops are allowed in the buffer after which no new reads will be performed until the number of ops is below that limit. * The shard changes api includes the current global checkpoint on the leader shard copy. This allows reading to be a more self sufficient process; instead of relying on a background thread to fetch the leader shard's global checkpoint. * Reading write operations from the leader shard (via shard changes api) is a separate step then writing the write operations (via bulk shards operations api). Whereas before a read would immediately result into a write. * The bulk shard operations api returns the local checkpoint on the follow primary shard, to keep the shard follow task up to date with what has been written. * Moved the shard follow logic that was previously in ShardFollowTasksExecutor to ShardFollowNodeTask. * Moved over the changes from #31242 to make shard follow mechanism resilient from node and shard failures. Relates to #30086

martijnvg added review :Distributed/CCR Issues around the Cross Cluster State Replication features labels Jun 26, 2018

martijnvg requested a review from bleskes June 26, 2018 14:06

martijnvg requested a review from jasontedor June 26, 2018 14:06

This was referenced Jun 26, 2018

[CCR] Made shard follow task more resilient against node failure and #31242

Closed

[CCR] Refactor ChunksCoordinator to continuously look for changes in leader shard #30898

Closed

martijnvg commented Jun 26, 2018

View reviewed changes

martijnvg force-pushed the ccr_follow_shard_task_rewrite_2 branch from 038d90c to 516dcb7 Compare June 27, 2018 05:29

bleskes suggested changes Jun 27, 2018

View reviewed changes

martijnvg added 4 commits June 28, 2018 10:42

Made to param required like before and changed validation in shard ch…

9dec920

…anges api and LuceneChangesSnapshot to allow a response to not return up to `toSeqNo`.

changed log levels

7ae1f1d

renamed maxTranslogBytes to maxOperationSizeInBytes

5510731

changed shard changes api to be size based instead of range based

f3d58e0

s1monw requested changes Jun 28, 2018

View reviewed changes

martijnvg added 8 commits June 28, 2018 16:08

start the node task only with followGlobalCheckpoint

0330e5f

iter

adc829f

coordinate reads should not schedule, but if there is budget at least…

0b8e6a7

… do a single read. if the shard changes response returns no hits and had no target seqno then it should schedule coordinate reads phase.

s/leaderGlobalCheckpoint/globalCheckpoint

7b3fb30

Separate the code that uses client and makes remote calls from ShardF…

a0059ae

…ollowNodeTask

iter, exception handling.

a356fbe

iter

fa6bb6f

adjusted test

e73d990

martijnvg added 3 commits June 29, 2018 08:08

improve coordinateReads()

942c777

improved assert

9cdf320

Merge remote-tracking branch 'es/ccr' into ccr_follow_shard_task_rewr…

b5c6187

…ite_2

martijnvg added a commit that referenced this pull request Jun 29, 2018

muted tests that will be replaced by the shard follow task refactoring:

8ecfcc3

#31581

dnhatn pushed a commit that referenced this pull request Jun 29, 2018

muted tests that will be replaced by the shard follow task refactoring:

de00a85

#31581

implemented custom WritePrimaryResult to include global checkpoint in…

b100b97

… response after it replication has been completed

martijnvg mentioned this pull request Jul 5, 2018

Improve shard follow task retry mechanism #31816

Closed

martijnvg added 5 commits July 5, 2018 14:23

made retry timeout configurable

9048913

added to test whether we always return at least 1 op when

fb8d47e

the maxOperationSizeInBytes limit has been met

changed retry delay in tests to 10ms

b0e3d49

made idelShardRetryDelay a parameter in the follow api

b6f0320

added comment

d17a4b3

martijnvg added 3 commits July 6, 2018 17:15

Merge remote-tracking branch 'es/ccr' into ccr_follow_shard_task_rewr…

877d566

…ite_2

made parameter names consistent and collapsed maxReadSize and maxWrit…

53fe1c2

…eSize into maxBatchOperationCount

also apply maxBatchSizeInBytes on write side

813498a

bleskes suggested changes Jul 9, 2018

View reviewed changes

martijnvg added 8 commits July 9, 2018 11:50

iter

cb03c9a

iter2

7682b8c

easier to understand CcrWritePrimaryResult logic

80c895d

added some more stats

e6c5422

Merge remote-tracking branch 'es/ccr' into ccr_follow_shard_task_rewr…

3eb0056

…ite_2

changed ops in bulk shard ops request from array to list

efee1af

one set of {} is good enough

2367e06

only peek read when there is budget

d9086fe

variable rename

bf251af

bleskes reviewed Jul 10, 2018

View reviewed changes

bleskes approved these changes Jul 10, 2018

View reviewed changes

martijnvg added 3 commits July 10, 2018 12:47

call coordinateReads() from handleWriteResponse()

775de27

renamed fields and variables

cf63047

the if check is not needed, calling coordinateReads() and coordinateW…

cb09fd7

…rites() is always safe

martijnvg merged commit 8e1ef0c into elastic:ccr Jul 10, 2018

dnhatn mentioned this pull request Jul 28, 2018

ShardFollowNodeTask fetch operations twice #32453

Closed

		this.processedGlobalCheckpoint = processedGlobalCheckpoint;
		this.numberOfConcurrentReads = numberOfConcurrentReads;


		private static final Logger LOGGER = Loggers.getLogger(ShardFollowNodeTask.class);

		final Client leaderClient;

Rewrite shard follow node task logic #31581

Rewrite shard follow node task logic #31581

Conversation

martijnvg commented Jun 26, 2018 • edited

elasticmachine commented Jun 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes Jun 27, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes Jun 27, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes Jun 27, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnvg commented Jun 28, 2018

martijnvg commented Jul 5, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnvg commented Jul 9, 2018

Choose a reason for hiding this comment

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnvg commented Jun 26, 2018 •

edited

bleskes Jun 27, 2018 •

edited

bleskes Jun 27, 2018 •

edited

bleskes Jun 27, 2018 •

edited