Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer reroute when starting shards #44433

Conversation

DaveCTurner
Copy link
Contributor

Today we reroute the cluster as part of the process of starting a shard, which
runs at URGENT priority. In large clusters, rerouting may take some time to
complete, and this means that a mere trickle of shard-started events can cause
starvation for other, lower-priority, tasks that are pending on the master.

However, it isn't really necessary to perform a reroute when starting a shard,
as long as one occurs eventually. This commit removes the inline reroute from
the process of starting a shard and replaces it with a deferred one that runs
at NORMAL priority, avoiding starvation of higher-priority tasks.

This may improve some of the situations related to #42738 and #42105.

Today we reroute the cluster as part of the process of starting a shard, which
runs at `URGENT` priority. In large clusters, rerouting may take some time to
complete, and this means that a mere trickle of shard-started events can cause
starvation for other, lower-priority, tasks that are pending on the master.

However, it isn't really necessary to perform a reroute when starting a shard,
as long as one occurs eventually. This commit removes the inline reroute from
the process of starting a shard and replaces it with a deferred one that runs
at `NORMAL` priority, avoiding starvation of higher-priority tasks.

This may improve some of the situations related to elastic#42738 and elastic#42105.
@DaveCTurner DaveCTurner added >enhancement :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.4.0 labels Jul 16, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Jul 16, 2019

Note to reviewers: don't let the +462 −515 Files changed 61 put you off too much. The largest part of this is adjusting all the test cases that use an AllocationService to start some shards, because with this change they also need to perform a separate reroute. Apart from that it's just a bit of plumbing to get the RerouteService in place, and boilerplate for the already-deprecated setting that can be removed in master in a followup.

@original-brownbear
Copy link
Member

@DaveCTurner seems there's a problem with the setting change on CI and it's probably just missing a warning exclusion in the rest tests:

java.lang.AssertionError: unexpected warning headers expected null, but was:<[299 Elasticsearch-8.0.0-SNAPSHOT-cf9953f "[cluster.routing.allocation.shard_state.reroute.priority] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version."]>

We cannot set the priority in all InternalTestClusters because the deprecation
warning makes some tests unhappy. This commit adds a specific test instead.
@DaveCTurner
Copy link
Contributor Author

Bit more complicated than that - I thought I could get away with something simple but got bitten by the deprecation logger, so had to add a specific test in 5fc2e7e.

@original-brownbear
Copy link
Member

Jenkins run elasticsearch-ci/2 (just the xpack license test failing)

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, just one question.

Also, to me the change to NORMAL priority/not-inline-execution makes sense here but maybe Yannick should confirm as well :)
I wonder though if we really should hide this change as much as it's hidden now. Couldn't we add a note on this change stating that this might be a problem if your cluster is already badly configured (master overloaded + lots of shard fluctuation?) but is an optimization for a healthy setup? That way users affected negatively by this might be more likely to either understand/fix the problem or report back in a more meaningful way?

@DaveCTurner
Copy link
Contributor Author

Sure. There's no way to express changes to the migration docs in a PR against master because the docs don't exist here, but I propose this in 7.x:

diff --git a/docs/reference/migration/migrate_7_4.asciidoc b/docs/reference/migration/migrate_7_4.asciidoc
index ebfca7d25c1..63315f9fb65 100644
--- a/docs/reference/migration/migrate_7_4.asciidoc
+++ b/docs/reference/migration/migrate_7_4.asciidoc
@@ -67,4 +67,20 @@ unsupported on buckets created after September 30th 2020.
 Starting in version 7.4, a `+` in a URL will be encoded as `%2B` by all REST API functionality. Prior versions handled a `+` as a single space.
 If your application requires handling `+` as a single space you can return to the old behaviour by setting the system property
 `es.rest.url_plus_as_space` to `true`. Note that this behaviour is deprecated and setting this system property to `true` will cease
-to be supported in version 8.
\ No newline at end of file
+to be supported in version 8.
+
+[float]
+[[breaking_74_cluster_changes]]
+=== Cluster changes
+
+[float]
+==== Rerouting after starting a shard runs at lower priority
+
+After starting each shard the elected master node must perform a reroute to
+search for other shards that could be allocated. In particular, when creating
+an index it is this task that allocates the replicas once the primaries have
+started. In versions prior to 7.4 this task runs at priority `URGENT`, but
+starting in version 7.4 its priority is reduced to `NORMAL`. In a
+well-configured cluster this reduces the amount of work the master must do, but
+means that a cluster with a master that is overloaded with other tasks at
+`HIGH` or `URGENT` priority may take longer to allocate all replicas.

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :) thanks @DaveCTurner

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few small comments, looking good o.w.

@DaveCTurner DaveCTurner requested a review from ywelsch July 17, 2019 11:28
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit 51fb95e into elastic:master Jul 18, 2019
@DaveCTurner DaveCTurner deleted the 2019-07-16-defer-reroute-when-starting-shards branch July 18, 2019 05:39
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 18, 2019
* Defer reroute when starting shards

Today we reroute the cluster as part of the process of starting a shard, which
runs at `URGENT` priority. In large clusters, rerouting may take some time to
complete, and this means that a mere trickle of shard-started events can cause
starvation for other, lower-priority, tasks that are pending on the master.

However, it isn't really necessary to perform a reroute when starting a shard,
as long as one occurs eventually. This commit removes the inline reroute from
the process of starting a shard and replaces it with a deferred one that runs
at `NORMAL` priority, avoiding starvation of higher-priority tasks.

This may improve some of the situations related to elastic#42738 and elastic#42105.

* Specific test case for followup priority setting

We cannot set the priority in all InternalTestClusters because the deprecation
warning makes some tests unhappy. This commit adds a specific test instead.

* Checkstyle

* Cluster state always changed here

* Assert consistency of routing nodes

* Restrict setting only to reasonable priorities
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 18, 2019
The change in elastic#44433 introduces a state in which the cluster has no relocating
shards but still has a pending reroute task which might start a shard
relocation. `TransportSearchFailuresIT` failed on a PR build seemingly because
it did not wait for this pending task to complete too, reporting more active shards
than expected:

    2> java.lang.AssertionError:
      Expected: <9>
           but: was <10>
          at __randomizedtesting.SeedInfo.seed([4057CA4301FE95FA:207EC88573747235]:0)
          at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
          at org.junit.Assert.assertThat(Assert.java:956)
          at org.junit.Assert.assertThat(Assert.java:923)
          at org.elasticsearch.search.basic.TransportSearchFailuresIT.testFailedSearchWithWrongQuery(TransportSearchFailuresIT.java:97)

This commit addresses this failure by waiting until there are neither pending
tasks nor shard relocations in progress.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 18, 2019
The change in elastic#44433 introduces a state in which the cluster has no relocating
shards but still has a pending reroute task which might start a shard
relocation. `TransportSearchFailuresIT` failed on a PR build seemingly because
it did not wait for this pending task to complete too, reporting more active shards
than expected:

    2> java.lang.AssertionError:
      Expected: <9>
           but: was <10>
          at __randomizedtesting.SeedInfo.seed([4057CA4301FE95FA:207EC88573747235]:0)
          at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
          at org.junit.Assert.assertThat(Assert.java:956)
          at org.junit.Assert.assertThat(Assert.java:923)
          at org.elasticsearch.search.basic.TransportSearchFailuresIT.testFailedSearchWithWrongQuery(TransportSearchFailuresIT.java:97)

This commit addresses this failure by waiting until there are neither pending
tasks nor shard relocations in progress.
DaveCTurner added a commit that referenced this pull request Jul 18, 2019
The change in #44433 introduces a state in which the cluster has no relocating
shards but still has a pending reroute task which might start a shard
relocation. `TransportSearchFailuresIT` failed on a PR build seemingly because
it did not wait for this pending task to complete too, reporting more active shards
than expected:

    2> java.lang.AssertionError:
      Expected: <9>
           but: was <10>
          at __randomizedtesting.SeedInfo.seed([4057CA4301FE95FA:207EC88573747235]:0)
          at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
          at org.junit.Assert.assertThat(Assert.java:956)
          at org.junit.Assert.assertThat(Assert.java:923)
          at org.elasticsearch.search.basic.TransportSearchFailuresIT.testFailedSearchWithWrongQuery(TransportSearchFailuresIT.java:97)

This commit addresses this failure by waiting until there are neither pending
tasks nor shard relocations in progress.
DaveCTurner added a commit that referenced this pull request Jul 18, 2019
Today we reroute the cluster as part of the process of starting a shard, which
runs at `URGENT` priority. In large clusters, rerouting may take some time to
complete, and this means that a mere trickle of shard-started events can cause
starvation for other, lower-priority, tasks that are pending on the master.

However, it isn't really necessary to perform a reroute when starting a shard,
as long as one occurs eventually. This commit removes the inline reroute from
the process of starting a shard and replaces it with a deferred one that runs
at `NORMAL` priority, avoiding starvation of higher-priority tasks.

Backport of #44433 and #44543.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 19, 2019
In elastic#44433 we introduced a temporary (immediately deprecated) escape-hatch
setting to control the priority of the reroute scheduled after starting a batch
of shards. This commit removes this setting in `master`, fixing the followup
reroute's priority at `NORMAL`.
DaveCTurner added a commit that referenced this pull request Jul 19, 2019
In #44433 we introduced a temporary (immediately deprecated) escape-hatch
setting to control the priority of the reroute scheduled after starting a batch
of shards. This commit removes this setting in `master`, fixing the followup
reroute's priority at `NORMAL`.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 1, 2019
Adds a `waitForEvents(Priority.LANGUID)` to the cluster health request in
`ESIntegTestCase#waitForRelocation()` to deal with the case that this health
request returns successfully despite the fact that there is a pending reroute task which
will relocate another shard.

Relates elastic#44433
Fixes elastic#45003
DaveCTurner added a commit that referenced this pull request Aug 1, 2019
Adds a `waitForEvents(Priority.LANGUID)` to the cluster health request in
`ESIntegTestCase#waitForRelocation()` to deal with the case that this health
request returns successfully despite the fact that there is a pending reroute task which
will relocate another shard.

Relates #44433
Fixes #45003
DaveCTurner added a commit that referenced this pull request Aug 1, 2019
Adds a `waitForEvents(Priority.LANGUID)` to the cluster health request in
`ESIntegTestCase#waitForRelocation()` to deal with the case that this health
request returns successfully despite the fact that there is a pending reroute task which
will relocate another shard.

Relates #44433
Fixes #45003
@mfussenegger mfussenegger mentioned this pull request Mar 26, 2020
37 tasks
seut pushed a commit to crate/crate that referenced this pull request Oct 5, 2020
This adds the `IndexBalanceTests` ES tests based on
elastic/elasticsearch@5dda2b0
because the more recent version requires a not yet backport patch
(elastic/elasticsearch#44433).
mergify bot pushed a commit to crate/crate that referenced this pull request Oct 5, 2020
This adds the `IndexBalanceTests` ES tests based on
elastic/elasticsearch@5dda2b0
because the more recent version requires a not yet backport patch
(elastic/elasticsearch#44433).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement v7.4.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants