[CCR] Improve shard follow task's retryable error handling #33371

martijnvg · 2018-09-04T10:06:53Z

Improve the failure handling of retryable errors by retrying remote calls in
a exponential backoff like manner. The delay between a retry would not be
longer than the configured max retry delay. Also retryable errors will be
retried indefinitely.

Relates to #30086

elasticmachine · 2018-09-04T10:06:54Z

Pinging @elastic/es-distributed

Improve failure handling of retryable errors by retrying remote calls in a exponential backoff like manner. The delay between a retry would not be longer than the configured max retry delay. Also retryable errors will be retried indefinitely. Relates to elastic#30086

jasontedor

I left a comment about the delay computation.

jasontedor · 2018-09-04T17:38:07Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

        } else {
            markAsFailed(e);
        }
    }

-    private boolean shouldRetry(Exception e) {
+    static long computeDelay(int currentRetry, long maxRetryDelayInMillis) {
+        long expectedBackOff = Math.round(10 * Math.exp(0.8d * currentRetry) - 1);


This is not quite what I think of as exponential backoff. The problem that I see with an implementation like this is that we can have a thundering herd problem. If there is some failure that causes a bunch of tasks to simultaneously fail (e.g., say that we have a bunch of outstanding fetch tasks waiting for a response, and the network connection breaks, failing all of them), all of the retries will keep waking up at the same time, itself potentially causing issues due to a herd.

Typically it would be that there is a random component in exponential backoff, to avoid this herding. As a first approach, what I suggest here is: choose a random value k between 0 and 2^number of retries - 1. Then retry after k * delay seconds. We can cap this at max retry delay.

Right, I did think about this, but when looking at ExponentialBackoffIterator class to see how it was implemented elsewhere I decided to use that. I will change this.

I've changed how the delay is now computed. With this randomness the retying mechanism is much more solid.

…Y_TIME

jasontedor

I left a comment.

jasontedor · 2018-09-05T17:40:07Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

        } else {
            markAsFailed(e);
        }
    }

-    private boolean shouldRetry(Exception e) {
+    static long computeDelay(int currentRetry, long maxRetryDelayInMillis) {
+        long n = Math.round(Math.pow(2, currentRetry - 1));


I think that we need to guard against overflow here!

jasontedor · 2018-09-05T17:42:28Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

@@ -56,20 +57,22 @@
 */
 public abstract class ShardFollowNodeTask extends AllocatedPersistentTask {

+    public static final SecureRandom RANDOM_INSTANCE = new SecureRandom();


We can use Randomness#get here.

jasontedor · 2018-09-05T17:43:39Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

-                    "] times, aborting...", e));
-            }
+        if (shouldRetry(e) && isStopped() == false) {
+            LOGGER.debug(new ParameterizedMessage("{} error during follow shard task, retrying...", params.getFollowShardId()), e);


We should debug log here the number of times that we have retried up to here.

…w_task_retryable_failure_handling

martijnvg · 2018-09-05T19:04:35Z

@jasontedor I've updated the PR.

* master: (91 commits) Preserve cluster settings on full restart tests (elastic#33590) Use IndexWriter.getFlushingBytes() rather than tracking it ourselves (elastic#33582) Fix upgrading of list settings (elastic#33589) Add read-only Engine (elastic#33563) HLRC: Add ML get categories API (elastic#33465) SQL: Adds MONTHNAME, DAYNAME and QUARTER functions (elastic#33411) Add predicate_token_filter (elastic#33431) Fix Replace function. Adds more tests to all string functions. (elastic#33478) [ML] Rename input_fields to column_names in file structure (elastic#33568) Add full cluster restart base class (elastic#33577) Validate list values for settings (elastic#33503) Copy and validatie soft-deletes setting on resize (elastic#33517) Test: Fix package name SQL: Fix result column names for arithmetic functions (elastic#33500) Upgrade to latest Lucene snapshot (elastic#33505) Enable not wiping cluster settings after REST test (elastic#33575) MINOR: Remove Dead Code in SearchScript (elastic#33569) [Test] Remove duplicate method in TestShardRouting (elastic#32815) mute test on windows Update beats template to include apm-server metrics (elastic#33286) ...

jasontedor

LGTM.

jasontedor · 2018-09-11T14:01:23Z

Separately I think we want some validation on max_retry_delay as today it can be negative and also set too large. Maybe:

diff --git a/x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/FollowIndexAction.java b/x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/FollowIndexAction.java
index 26a9d11eb32..4b235b55991 100644
--- a/x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/FollowIndexAction.java
+++ b/x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/FollowIndexAction.java
@@ -62,6 +62,8 @@ import java.util.concurrent.atomic.AtomicInteger;
 import java.util.concurrent.atomic.AtomicReferenceArray;
 import java.util.stream.Collectors;
 
+import static org.elasticsearch.action.ValidateActions.addValidationError;
+
 public class FollowIndexAction extends Action<AcknowledgedResponse> {
 
     public static final FollowIndexAction INSTANCE = new FollowIndexAction();
@@ -210,9 +212,19 @@ public class FollowIndexAction extends Action<AcknowledgedResponse> {
             return maxBatchOperationCount;
         }
 
+        private static final TimeValue MAX_MAX_RETRY_DELAY = TimeValue.timeValueMinutes(5);
+
         @Override
         public ActionRequestValidationException validate() {
-            return null;
+            ActionRequestValidationException e = null;
+            if (maxRetryDelay.millis() <= 0) {
+                e = addValidationError("max_retry_delay must be positive but was [" + maxRetryDelay.getStringRep() + "]", e);
+            }
+            if (maxRetryDelay.millis() > MAX_MAX_RETRY_DELAY.millis()) {
+                e = addValidationError(
+                        "max_retry_delay must be less than [" + MAX_MAX_RETRY_DELAY + "[ but was [" + maxRetryDelay.getStringRep() + "]", e);
+            }
+            return e;
         }
 
         @Override

What do you think?

…w_task_retryable_failure_handling

Improve failure handling of retryable errors by retrying remote calls in a exponential backoff like manner. The delay between a retry would not be longer than the configured max retry delay. Also retryable errors will be retried indefinitely. Relates to #30086

* master: (43 commits) [HLRC][ML] Add ML put datafeed API to HLRC (elastic#33603) Update AWS SDK to 1.11.406 in repository-s3 (elastic#30723) Expose CCR stats to monitoring (elastic#33617) [Docs] Update match-query.asciidoc (elastic#33610) TEST: Adjust rollback condition when shard is empty [CCR] Improve shard follow task's retryable error handling (elastic#33371) Forbid negative `weight` in Function Score Query (elastic#33390) Clarify context suggestions filtering and boosting (elastic#33601) Disable CCR REST endpoints if CCR disabled (elastic#33619) Lower version on full cluster restart settings test Upgrade remote cluster settings (elastic#33537) NETWORKING: http.publish_host Should Contain CNAME (elastic#32806) Add test coverage for global checkpoint listeners Reset replica engine to global checkpoint on promotion (elastic#33473) HLRC: ML Delete Forecast API (elastic#33526) Remove debug logging in full cluster restart tests (elastic#33612) Expose CCR to the transport client (elastic#33608) Mute testIndexDeletionWhenNodeRejoins SQL: Make Literal a NamedExpression (elastic#33583) [DOCS] Adds missing built-in user information (elastic#33585) ...

* master: (128 commits) [HLRC][ML] Add ML put datafeed API to HLRC (elastic#33603) Update AWS SDK to 1.11.406 in repository-s3 (elastic#30723) Expose CCR stats to monitoring (elastic#33617) [Docs] Update match-query.asciidoc (elastic#33610) TEST: Adjust rollback condition when shard is empty [CCR] Improve shard follow task's retryable error handling (elastic#33371) Forbid negative `weight` in Function Score Query (elastic#33390) Clarify context suggestions filtering and boosting (elastic#33601) Disable CCR REST endpoints if CCR disabled (elastic#33619) Lower version on full cluster restart settings test Upgrade remote cluster settings (elastic#33537) NETWORKING: http.publish_host Should Contain CNAME (elastic#32806) Add test coverage for global checkpoint listeners Reset replica engine to global checkpoint on promotion (elastic#33473) HLRC: ML Delete Forecast API (elastic#33526) Remove debug logging in full cluster restart tests (elastic#33612) Expose CCR to the transport client (elastic#33608) Mute testIndexDeletionWhenNodeRejoins SQL: Make Literal a NamedExpression (elastic#33583) [DOCS] Adds missing built-in user information (elastic#33585) ...

martijnvg added >enhancement review v7.0.0 :Distributed/CCR Issues around the Cross Cluster State Replication features v6.5.0 labels Sep 4, 2018

martijnvg requested a review from jasontedor September 4, 2018 10:06

elasticmachine mentioned this pull request Sep 4, 2018

Introduce cross-cluster replication #30086

Closed

29 tasks

martijnvg force-pushed the ccr_improve_shard_follow_task_retryable_failure_handling branch from 05d0518 to a717f44 Compare September 4, 2018 10:11

jasontedor requested changes Sep 4, 2018

View reviewed changes

Change to delay to be more random and like: random(0, 2^c − 1) * DELA…

ba631e0

…Y_TIME

jasontedor requested changes Sep 5, 2018

View reviewed changes

jasontedor reviewed Sep 5, 2018

View reviewed changes

martijnvg added 2 commits September 5, 2018 20:52

Merge remote-tracking branch 'es/master' into ccr_improve_shard_follo…

3939c46

…w_task_retryable_failure_handling

iter

e2e2437

jasontedor removed v6.5.0 v7.0.0 labels Sep 10, 2018

jasontedor and others added 3 commits September 11, 2018 08:48

fixed compile errors after merge

460b4c6

cap more to avoid integer overflow

82d91cd

jasontedor approved these changes Sep 11, 2018

View reviewed changes

martijnvg mentioned this pull request Sep 11, 2018

[CCR] Added history uuid validation #33546

Merged

martijnvg added 3 commits September 12, 2018 07:31

Merge remote-tracking branch 'es/master' into ccr_improve_shard_follo…

05120e2

…w_task_retryable_failure_handling

fixed renaming after merging

80a1bae

fixed imports after merge

5a7974a

martijnvg merged commit 96c49e5 into elastic:master Sep 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CCR] Improve shard follow task's retryable error handling #33371

[CCR] Improve shard follow task's retryable error handling #33371

martijnvg commented Sep 4, 2018

elasticmachine commented Sep 4, 2018

jasontedor left a comment

jasontedor Sep 4, 2018

martijnvg Sep 5, 2018

martijnvg Sep 5, 2018

jasontedor left a comment

jasontedor Sep 5, 2018

jasontedor Sep 5, 2018

jasontedor Sep 5, 2018

martijnvg commented Sep 5, 2018

jasontedor left a comment

jasontedor commented Sep 11, 2018 •

edited

Loading

[CCR] Improve shard follow task's retryable error handling #33371

[CCR] Improve shard follow task's retryable error handling #33371

Conversation

martijnvg commented Sep 4, 2018

elasticmachine commented Sep 4, 2018

jasontedor left a comment

Choose a reason for hiding this comment

jasontedor Sep 4, 2018

Choose a reason for hiding this comment

martijnvg Sep 5, 2018

Choose a reason for hiding this comment

martijnvg Sep 5, 2018

Choose a reason for hiding this comment

jasontedor left a comment

Choose a reason for hiding this comment

jasontedor Sep 5, 2018

Choose a reason for hiding this comment

jasontedor Sep 5, 2018

Choose a reason for hiding this comment

jasontedor Sep 5, 2018

Choose a reason for hiding this comment

martijnvg commented Sep 5, 2018

jasontedor left a comment

Choose a reason for hiding this comment

jasontedor commented Sep 11, 2018 • edited Loading

jasontedor commented Sep 11, 2018 •

edited

Loading