Defer reroute when starting shards #44433

DaveCTurner · 2019-07-16T14:26:02Z

Today we reroute the cluster as part of the process of starting a shard, which
runs at URGENT priority. In large clusters, rerouting may take some time to
complete, and this means that a mere trickle of shard-started events can cause
starvation for other, lower-priority, tasks that are pending on the master.

However, it isn't really necessary to perform a reroute when starting a shard,
as long as one occurs eventually. This commit removes the inline reroute from
the process of starting a shard and replaces it with a deferred one that runs
at NORMAL priority, avoiding starvation of higher-priority tasks.

This may improve some of the situations related to #42738 and #42105.

Today we reroute the cluster as part of the process of starting a shard, which runs at `URGENT` priority. In large clusters, rerouting may take some time to complete, and this means that a mere trickle of shard-started events can cause starvation for other, lower-priority, tasks that are pending on the master. However, it isn't really necessary to perform a reroute when starting a shard, as long as one occurs eventually. This commit removes the inline reroute from the process of starting a shard and replaces it with a deferred one that runs at `NORMAL` priority, avoiding starvation of higher-priority tasks. This may improve some of the situations related to elastic#42738 and elastic#42105.

elasticmachine · 2019-07-16T14:26:05Z

Pinging @elastic/es-distributed

DaveCTurner · 2019-07-16T14:33:50Z

Note to reviewers: don't let the +462 −515 Files changed 61 put you off too much. The largest part of this is adjusting all the test cases that use an AllocationService to start some shards, because with this change they also need to perform a separate reroute. Apart from that it's just a bit of plumbing to get the RerouteService in place, and boilerplate for the already-deprecated setting that can be removed in master in a followup.

original-brownbear · 2019-07-16T16:58:39Z

@DaveCTurner seems there's a problem with the setting change on CI and it's probably just missing a warning exclusion in the rest tests:

java.lang.AssertionError: unexpected warning headers expected null, but was:<[299 Elasticsearch-8.0.0-SNAPSHOT-cf9953f "[cluster.routing.allocation.shard_state.reroute.priority] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version."]>

We cannot set the priority in all InternalTestClusters because the deprecation warning makes some tests unhappy. This commit adds a specific test instead.

DaveCTurner · 2019-07-16T20:07:58Z

Bit more complicated than that - I thought I could get away with something simple but got bitten by the deprecation logger, so had to add a specific test in 5fc2e7e.

original-brownbear · 2019-07-17T03:15:09Z

Jenkins run elasticsearch-ci/2 (just the xpack license test failing)

original-brownbear

Code looks good, just one question.

Also, to me the change to NORMAL priority/not-inline-execution makes sense here but maybe Yannick should confirm as well :)
I wonder though if we really should hide this change as much as it's hidden now. Couldn't we add a note on this change stating that this might be a problem if your cluster is already badly configured (master overloaded + lots of shard fluctuation?) but is an optimization for a healthy setup? That way users affected negatively by this might be more likely to either understand/fix the problem or report back in a more meaningful way?

server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

DaveCTurner · 2019-07-17T06:11:02Z

Sure. There's no way to express changes to the migration docs in a PR against master because the docs don't exist here, but I propose this in 7.x:

diff --git a/docs/reference/migration/migrate_7_4.asciidoc b/docs/reference/migration/migrate_7_4.asciidoc
index ebfca7d25c1..63315f9fb65 100644
--- a/docs/reference/migration/migrate_7_4.asciidoc
+++ b/docs/reference/migration/migrate_7_4.asciidoc
@@ -67,4 +67,20 @@ unsupported on buckets created after September 30th 2020.
 Starting in version 7.4, a `+` in a URL will be encoded as `%2B` by all REST API functionality. Prior versions handled a `+` as a single space.
 If your application requires handling `+` as a single space you can return to the old behaviour by setting the system property
 `es.rest.url_plus_as_space` to `true`. Note that this behaviour is deprecated and setting this system property to `true` will cease
-to be supported in version 8.
\ No newline at end of file
+to be supported in version 8.
+
+[float]
+[[breaking_74_cluster_changes]]
+=== Cluster changes
+
+[float]
+==== Rerouting after starting a shard runs at lower priority
+
+After starting each shard the elected master node must perform a reroute to
+search for other shards that could be allocated. In particular, when creating
+an index it is this task that allocates the replicas once the primaries have
+started. In versions prior to 7.4 this task runs at priority `URGENT`, but
+starting in version 7.4 its priority is reduced to `NORMAL`. In a
+well-configured cluster this reduces the amount of work the master must do, but
+means that a cluster with a master that is overloaded with other tasks at
+`HIGH` or `URGENT` priority may take longer to allocate all replicas.

original-brownbear

LGTM :) thanks @DaveCTurner

ywelsch

I've left a few small comments, looking good o.w.

server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

ywelsch

LGTM

* Defer reroute when starting shards Today we reroute the cluster as part of the process of starting a shard, which runs at `URGENT` priority. In large clusters, rerouting may take some time to complete, and this means that a mere trickle of shard-started events can cause starvation for other, lower-priority, tasks that are pending on the master. However, it isn't really necessary to perform a reroute when starting a shard, as long as one occurs eventually. This commit removes the inline reroute from the process of starting a shard and replaces it with a deferred one that runs at `NORMAL` priority, avoiding starvation of higher-priority tasks. This may improve some of the situations related to elastic#42738 and elastic#42105. * Specific test case for followup priority setting We cannot set the priority in all InternalTestClusters because the deprecation warning makes some tests unhappy. This commit adds a specific test instead. * Checkstyle * Cluster state always changed here * Assert consistency of routing nodes * Restrict setting only to reasonable priorities

The change in elastic#44433 introduces a state in which the cluster has no relocating shards but still has a pending reroute task which might start a shard relocation. `TransportSearchFailuresIT` failed on a PR build seemingly because it did not wait for this pending task to complete too, reporting more active shards than expected: 2> java.lang.AssertionError: Expected: <9> but: was <10> at __randomizedtesting.SeedInfo.seed([4057CA4301FE95FA:207EC88573747235]:0) at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at org.junit.Assert.assertThat(Assert.java:956) at org.junit.Assert.assertThat(Assert.java:923) at org.elasticsearch.search.basic.TransportSearchFailuresIT.testFailedSearchWithWrongQuery(TransportSearchFailuresIT.java:97) This commit addresses this failure by waiting until there are neither pending tasks nor shard relocations in progress.

The change in #44433 introduces a state in which the cluster has no relocating shards but still has a pending reroute task which might start a shard relocation. `TransportSearchFailuresIT` failed on a PR build seemingly because it did not wait for this pending task to complete too, reporting more active shards than expected: 2> java.lang.AssertionError: Expected: <9> but: was <10> at __randomizedtesting.SeedInfo.seed([4057CA4301FE95FA:207EC88573747235]:0) at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at org.junit.Assert.assertThat(Assert.java:956) at org.junit.Assert.assertThat(Assert.java:923) at org.elasticsearch.search.basic.TransportSearchFailuresIT.testFailedSearchWithWrongQuery(TransportSearchFailuresIT.java:97) This commit addresses this failure by waiting until there are neither pending tasks nor shard relocations in progress.

Today we reroute the cluster as part of the process of starting a shard, which runs at `URGENT` priority. In large clusters, rerouting may take some time to complete, and this means that a mere trickle of shard-started events can cause starvation for other, lower-priority, tasks that are pending on the master. However, it isn't really necessary to perform a reroute when starting a shard, as long as one occurs eventually. This commit removes the inline reroute from the process of starting a shard and replaces it with a deferred one that runs at `NORMAL` priority, avoiding starvation of higher-priority tasks. Backport of #44433 and #44543.

In elastic#44433 we introduced a temporary (immediately deprecated) escape-hatch setting to control the priority of the reroute scheduled after starting a batch of shards. This commit removes this setting in `master`, fixing the followup reroute's priority at `NORMAL`.

In #44433 we introduced a temporary (immediately deprecated) escape-hatch setting to control the priority of the reroute scheduled after starting a batch of shards. This commit removes this setting in `master`, fixing the followup reroute's priority at `NORMAL`.

Adds a `waitForEvents(Priority.LANGUID)` to the cluster health request in `ESIntegTestCase#waitForRelocation()` to deal with the case that this health request returns successfully despite the fact that there is a pending reroute task which will relocate another shard. Relates elastic#44433 Fixes elastic#45003

Adds a `waitForEvents(Priority.LANGUID)` to the cluster health request in `ESIntegTestCase#waitForRelocation()` to deal with the case that this health request returns successfully despite the fact that there is a pending reroute task which will relocate another shard. Relates #44433 Fixes #45003

This adds the `IndexBalanceTests` ES tests based on elastic/elasticsearch@5dda2b0 because the more recent version requires a not yet backport patch (elastic/elasticsearch#44433).

DaveCTurner added >enhancement :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.4.0 labels Jul 16, 2019

DaveCTurner requested review from ywelsch and original-brownbear July 16, 2019 14:26

Specific test case for followup priority setting

5fc2e7e

We cannot set the priority in all InternalTestClusters because the deprecation warning makes some tests unhappy. This commit adds a specific test instead.

Checkstyle

77259c1

original-brownbear reviewed Jul 17, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java Outdated Show resolved Hide resolved

Cluster state always changed here

6dbbdec

original-brownbear approved these changes Jul 17, 2019

View reviewed changes

ywelsch reviewed Jul 17, 2019

View reviewed changes

DaveCTurner added 3 commits July 17, 2019 09:08

Assert consistency of routing nodes

1f4aedb

Restrict setting only to reasonable priorities

5600e5e

Merge branch 'master' into 2019-07-16-defer-reroute-when-starting-shards

c8d7482

DaveCTurner requested a review from ywelsch July 17, 2019 11:28

ywelsch approved these changes Jul 17, 2019

View reviewed changes

DaveCTurner merged commit 51fb95e into elastic:master Jul 18, 2019

DaveCTurner deleted the 2019-07-16-defer-reroute-when-starting-shards branch July 18, 2019 05:39

DaveCTurner added the backport pending label Jul 18, 2019

DaveCTurner mentioned this pull request Jul 18, 2019

Defer reroute when starting shards #44539

Merged

DaveCTurner mentioned this pull request Jul 18, 2019

Wait for pending tasks in TransportSearchFailuresIT #44543

Merged

DaveCTurner removed the backport pending label Jul 18, 2019

DaveCTurner mentioned this pull request Jul 19, 2019

Remove followup reroute priority setting #44611

Merged

DaveCTurner mentioned this pull request Aug 1, 2019

Wait for events in waitForRelocation #45074

Merged

DaveCTurner mentioned this pull request Sep 22, 2019

A method to reduce the time cost to update cluster state #46941

Closed

DaveCTurner mentioned this pull request Sep 30, 2019

A little supplement about ThrottlingAllocationDecider #46040

Closed

DaveCTurner mentioned this pull request Nov 10, 2019

GET _cat/pending_tasks will return some pending tasks that have been executed #48925

Closed

DaveCTurner mentioned this pull request Jan 9, 2020

index rollover running with "NORMAL" priority #50778

Closed

DaveCTurner mentioned this pull request Feb 3, 2020

Making delete, close and update-settings index IMMEDIATE in pending tasks #51781

Closed

mfussenegger mentioned this pull request Mar 26, 2020

ES Backports crate/crate#9796

Closed

37 tasks

DaveCTurner mentioned this pull request May 20, 2020

ClusterApplierService stuck for mins while establishing connections to other node due to mismatch ephemeralId #56979

Closed

seut mentioned this pull request Oct 5, 2020

Add IndexBalanceTests back to es test suite crate/crate#10616

Merged

DaveCTurner mentioned this pull request Nov 18, 2020

add truncate support for a index #65189

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

jakelandis mentioned this pull request Oct 28, 2021

Update Deprecation Info API checks for 8.0 #42404

Closed

80 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defer reroute when starting shards #44433

Defer reroute when starting shards #44433

DaveCTurner commented Jul 16, 2019

elasticmachine commented Jul 16, 2019

DaveCTurner commented Jul 16, 2019 •

edited

Loading

original-brownbear commented Jul 16, 2019

DaveCTurner commented Jul 16, 2019

original-brownbear commented Jul 17, 2019

original-brownbear left a comment •

edited

Loading

DaveCTurner commented Jul 17, 2019

original-brownbear left a comment

ywelsch left a comment

ywelsch left a comment

Defer reroute when starting shards #44433

Defer reroute when starting shards #44433

Conversation

DaveCTurner commented Jul 16, 2019

elasticmachine commented Jul 16, 2019

DaveCTurner commented Jul 16, 2019 • edited Loading

original-brownbear commented Jul 16, 2019

DaveCTurner commented Jul 16, 2019

original-brownbear commented Jul 17, 2019

original-brownbear left a comment • edited Loading

Choose a reason for hiding this comment

DaveCTurner commented Jul 17, 2019

original-brownbear left a comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

DaveCTurner commented Jul 16, 2019 •

edited

Loading

original-brownbear left a comment •

edited

Loading