Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid parallel reroutes in DiskThresholdMonitor #43381

Conversation

@DaveCTurner
Copy link
Contributor

commented Jun 19, 2019

Today the DiskThresholdMonitor limits the frequency with which it submits
reroute tasks, but it might still submit these tasks faster than the master can
process them if, for instance, each reroute takes over 60 seconds. This causes
a problem since the reroute task runs with priority IMMEDIATE and is always
scheduled when there is a node over the high watermark, so this can starve any
other pending tasks on the master.

This change avoids further updates from the monitor while its last task(s) are
still in progress, and it measures the time of each update from the completion
time of the reroute task rather than its start time, to allow a larger window
for other tasks to run.

Fixes #40174

Today the `DiskThresholdMonitor` limits the frequency with which it submits
reroute tasks, but it might still submit these tasks faster than the master can
process them if, for instance, each reroute takes over 60 seconds. This causes
a problem since the reroute task runs with priority `IMMEDIATE` and is always
scheduled when there is a node over the high watermark, so this can starve any
other pending tasks on the master.

This change avoids further updates from the monitor while its last task(s) are
still in progress, and it measures the time of each update from the completion
time of the reroute task rather than its start time, to allow a larger window
for other tasks to run.

Fixes #40174
@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Jun 19, 2019

}

final ImmutableOpenMap<String, DiskUsage> usages = info.getNodeLeastAvailableDiskUsages();
if (usages == null) {

This comment has been minimized.

Copy link
@DaveCTurner

DaveCTurner Jun 19, 2019

Author Contributor

Probably best to look at this bit ignoring whitespace changes - I removed a level of indentation.

This comment has been minimized.

Copy link
@ywelsch

ywelsch Jun 21, 2019

Contributor

good tip

DaveCTurner added 4 commits Jun 19, 2019
Copy link
Contributor

left a comment

I'm still not super happy that we are sending a task to be executed at priority IMMEDIATE. I would rather have this call RoutingService. In that case, we could also avoid this whole business of tracking whether there is already a call in progress (that's taken care of by RoutingService). WDYT?

}

final ImmutableOpenMap<String, DiskUsage> usages = info.getNodeLeastAvailableDiskUsages();
if (usages == null) {

This comment has been minimized.

Copy link
@ywelsch

ywelsch Jun 21, 2019

Contributor

good tip

@DaveCTurner

This comment has been minimized.

Copy link
Contributor Author

commented Jun 24, 2019

I agree on the priority thing, but the RoutingService still uses HIGH priority and doesn't offer a notification on completion to keep the frequency low. I could add such a thing if you'd like?

@ywelsch

This comment has been minimized.

Copy link
Contributor

commented Jun 25, 2019

I agree on the priority thing, but the RoutingService still uses HIGH priority and doesn't offer a notification on completion to keep the frequency low. I could add such a thing if you'd like?

I think HIGH priority is ok for now. I wonder why we need the notification on completion. What does it keep the frequency low of? If we're batching calls, it's fine to have multiple pending attempts?

@DaveCTurner DaveCTurner requested a review from ywelsch Jun 26, 2019
@DaveCTurner

This comment has been minimized.

Copy link
Contributor Author

commented Jun 26, 2019

Ok I have added to the RoutingService the ability to listen for completion, and adjusted the DiskThresholdMonitor to make use of this. @ywelsch would you take another look?

Copy link
Contributor

left a comment

I've left two small asks. Looking good o.w.

@@ -379,7 +379,7 @@ public void clusterStatePublished(ClusterChangedEvent clusterChangedEvent) {
if (logger.isTraceEnabled()) {
logger.trace("{}, scheduling a reroute", reason);
}
routingService.reroute(reason);
routingService.reroute(reason, ActionListener.wrap(() -> logger.trace("{}, reroute completed", reason)));

This comment has been minimized.

Copy link
@ywelsch

ywelsch Jun 26, 2019

Contributor

this also logs the same line on an exception :/
I would prefer two different log lines, and the failure one with the exception (same for other places in this PR)

This comment has been minimized.

Copy link
@DaveCTurner

DaveCTurner Jun 26, 2019

Author Contributor

image

Fixed in ce5946b

if (nodes.contains(node) == false) {
nodeHasPassedWatermark.remove(node);
}

This comment has been minimized.

Copy link
@ywelsch

ywelsch Jun 26, 2019

Contributor

assert that rerouteAction is set?

This comment has been minimized.

Copy link
@DaveCTurner

DaveCTurner Jun 26, 2019

Author Contributor

Fixed in c1d6ee0

DaveCTurner added 3 commits Jun 26, 2019
@DaveCTurner

This comment has been minimized.

Copy link
Contributor Author

commented Jun 26, 2019

@elasticmachine please run elasticsearch-ci/2

@ywelsch ywelsch self-requested a review Jun 27, 2019
Copy link
Contributor

left a comment

LGTM

…k-threshold-monitor
@DaveCTurner DaveCTurner merged commit 448acea into elastic:master Jun 30, 2019
8 checks passed
8 checks passed
CLA All commits in pull request signed
Details
elasticsearch-ci/1 Build finished.
Details
elasticsearch-ci/2 Build finished.
Details
elasticsearch-ci/bwc Build finished.
Details
elasticsearch-ci/default-distro Build finished.
Details
elasticsearch-ci/docbldesx Build finished.
Details
elasticsearch-ci/oss-distro-docs Build finished.
Details
elasticsearch-ci/packaging-sample Build finished.
Details
@DaveCTurner DaveCTurner deleted the DaveCTurner:2019-06-19-avoid-parallel-rerouting-in-disk-threshold-monitor branch Jun 30, 2019
DaveCTurner added a commit that referenced this pull request Jun 30, 2019
Today the `DiskThresholdMonitor` limits the frequency with which it submits
reroute tasks, but it might still submit these tasks faster than the master can
process them if, for instance, each reroute takes over 60 seconds. This causes
a problem since the reroute task runs with priority `IMMEDIATE` and is always
scheduled when there is a node over the high watermark, so this can starve any
other pending tasks on the master.

This change avoids further updates from the monitor while its last task(s) are
still in progress, and it measures the time of each update from the completion
time of the reroute task rather than its start time, to allow a larger window
for other tasks to run.

It also now makes use of the `RoutingService` to submit the reroute task, in
order to batch this task with any other pending reroutes. It enhances the
`RoutingService` to notify its listeners on completion.

Fixes #40174
Relates #42559
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.