Improve work stealing for scaling situations #4920

fjetter · 2021-06-15T14:22:55Z

These are a few preliminary fixes which would close #4471. While this does not necessarily account for opportunity cost as I described in #4471 (comment) it still remedies the situation by

Allow idle workers to be part of the decision when scheduling a task. the worker objective will take care of the rest and should assign it to the proper worker if it is not overloaded. That might backlash and should be tested in another context
The work stealing ratio calculation emits sentinel values in situations which are arguable not situations we should skip on. As I argued in Poor work scheduling when cluster adapts size #4471 (comment) there are situations where we should consider stealing even if the task is incredibly cheap to compute since it might allow for further parallelism. In particular if the cluster is partially idling.
This is a minor tuning and not connected to the problem but I chose to not pick the thief based on round robin but on the worker objective, as we do for the initial decision. We might need to tweak this since it doesn't account for in-flight occupancy and might cause another overload of a worker.

Tests missing (TODO), feedback welcome

Closes Poor work scheduling when cluster adapts size #4471
Tests added / passed
Passes black distributed / flake8 distributed / isort distributed

fjetter · 2021-06-15T14:30:18Z

I completely forgot, performance reports

Before

https://gistcdn.githack.com/fjetter/6e04f5ced602378977c8fea0037e2c31/raw/b8dbbd5ae962ede8c0853926649619714ae1ded3/gh4471.html

After

https://gistcdn.githack.com/fjetter/6e04f5ced602378977c8fea0037e2c31/raw/b8dbbd5ae962ede8c0853926649619714ae1ded3/gh4471_combined.html

distributed/stealing.py

mrocklin · 2021-06-15T14:48:01Z

distributed/stealing.py

+
+                        thief = min(
+                            thieves, key=partial(self.scheduler.worker_objective, ts)
+                        )


How many thieves can there be here? If the answer is "maybe a lot" then we might need to be careful here.

I'm also curious how much impact this choice in particular had on the workload we're testing against.

How many thieves can there be here? If the answer is "maybe a lot" then we might need to be careful here.

Worst case situation would be if all but one worker are idling. then we'd have W-1 potential thieves. I thought about sampling this down to 20 or smth like that to avoid any extreme cases. Sampling and picking the best of the sample still sounds better than random. However...

I'm also curious how much impact this choice in particular had on the workload we're testing against.

Nothing, at all. This is the one change which didn't actually impact this particular example. It just felt right? idk. That's one of the things I want to test on different workloads

In particular for workloads with actual payload data. If the dependents don't have data, this is a rather extreme stealing edge case and I'd like to verify that this also behaves nicely for proper payloads

In decide_worker we choose the best worker if there are less than 20 workers and we choose an incrementing worker if there are more than 20 workers.

In practice though I doubt that it matters much. The difference in occupancy between thieves and victims means that we're making the world significantly better with the dumb change. If this results in a suboptimal world that's ok... we'll just improve it again in a future cycle if it becomes a problem.

In decide_worker we choose the best worker if there are less than 20 workers and we choose an incrementing worker if there are more than 20 workers.

I also thought about just using decide_worker. After all, why would the logic be any different (given that decide_worker takes a whitelist <-> thieves). However, the important logic is in the objective function which is why I picked that. If this things turns out to be valuable, we can have a closer look

In practice though I doubt that it matters much. The difference in occupancy between thieves and victims means that we're making the world significantly better with the dumb change.

True. As I said, the current problem doesn't actually trigger anything where this would matter. If I can not prove an improvement, I'll remove it again.

The difference in occupancy

The only thing I would like to leverage here is to pick a thief with dependencies, if that exists. that's implicitly done by the worker objective but could also be made explicit. My gut tells me this can make a difference but you might be right that this is only marginal.

we'll just improve it again in a future cycle if it becomes a problem.

There will not be another cylce to correct this unless the worker becomes saturated soon-ish, will there?. Non-saturated workers are not victimized.

There will not be another cylce to correct this unless the worker becomes saturated soon-ish, will there?. Non-saturated workers are not victimized.

Correct, but the stolen task probably won't be run immediately anyway. It's probably at the end of the queue. As the computation continues progressing and some workers become more free before others the saturated/non-saturated split will grow and trigger another cycle of transfers if necessary.

distributed/scheduler.py

fjetter · 2021-06-15T15:04:20Z

FYI tests break because the stealing now behaves differently and the tests reflect that (good!). just wanted to get this out early in case smbd has input already.

jakirkham · 2021-06-15T17:27:00Z

distributed/scheduler.py

-    ts: TaskState, all_workers, valid_workers: set, objective
+    ts: TaskState,
+    all_workers: set,
+    valid_workers: Optional[set],


Cython doesn't handling typing objects like Optional yet. Though it already happily accepts None and checks for that

Suggested change

valid_workers: Optional[set],

valid_workers: set,

I added optional mostly for type checking, e.g. in IDEs. is cython having trouble with this or does it simply ignore it?

jakirkham

AIUI these could be kind of expensive if we don't need them, which is why the original code didn't coerce them to sets. However it was that way even prior to Cythonizing the Scheduler

distributed/scheduler.py

mrocklin · 2021-06-17T13:09:31Z

Summarizing things a bit here, I'm seeing two changes:

We don't only look at workers with relevant dependencies, we also consider idle workers. I agree that this is a big benefit.
We don't do a round robin, but instead use the worker objective. I agree that this is often a benefit, I am concerned about introducing checks that depend on the number of workers for each task. I think that we may want to defend against this. There are a few options:
- Round-robin defends against this but results in suboptimal scheduling. One answer is that we keep doing round-robin until we identify that it results in user problems. I wouldn't be surprised if it didn't over the lifetime of a computation because we'll continue correcting things.
- Use worker objective if the number of theives is small (less than twenty) but fall back to round robin otherwise (we do this elsewhere)
- Select a random sample of the theives (maybe ten?) and then use the worker objective on that sample (idea from @gjoseph92 ). Maybe this gives us the best of both worlds?

fjetter · 2021-06-17T13:18:54Z

Change 2B: relax the sentinel values in steal_time_ratio. While it is ok do use a short path for no-dependency tasks to save compute cost, we should not disallow work stealing for generally small or expensive tasks (<0.005 or cost_multiplier > 100). I believe these two cases should not be short-circuited but rather be dealt with using the usual logic. If it is too expensive and doesn't payoff they are not stolen but there is a remote chance that it would still pay off to move them in extreme situations (as in this example). If we do not remove these sentinels, they are forever blacklisted and will never be stolen.

Select a random sample of the theives (maybe ten?) and then use the worker objective on that sample (idea from

I like this idea so much that I wish it was my own ;) #4920 (comment)

mrocklin · 2021-06-17T13:22:10Z

we should not disallow work stealing for generally small or expensive tasks (<0.005 or cost_multiplier > 100). I believe these two cases should not be short-circuited but rather be dealt with using the usual logic

That's ok with me

I like this idea so much that I wish it was my own ;) #4920 (comment)

Whoops! Well then it sounds like it must be a great idea :)

fjetter · 2021-06-17T13:30:15Z

Whoops! Well then it sounds like it must be a great idea :)

I'm not entirely certain to be honest. This is one of the cases where I would really love some data. After all, what are the chances that a sampling would keep the one or two workers which carry dependencies? If I cannot back this up with data, I'm inclined to roll this back.

However, overall I'd really love to have some micro benchmarks on this to know how bad this actually is. The comments the two of you left behind about performance make 100% sense but it is all a bit fuzzy and sometimes hard to spot.

For instance, I would've thought that the following operations would be A) faster and B) identical
but I was wrong. Twice.

Anyhow, this is offtopic but I would love to discuss the topic of benchmarking with a few ppl (maybe next dev meeting)

mrocklin · 2021-06-17T13:32:43Z

After all, what are the chances that a sampling would keep the one or two workers which carry dependencies?

Ah, my thinking (or rather Gabe's thinking) was that you would keep the workers with dependencies in the set, but then mix in a few others from outside of the set.

fjetter · 2021-06-17T13:37:04Z

Ah, my thinking (or rather Gabe's thinking) was that you would keep the workers with dependencies in the set

Right... that sounds straight forward. there is a saying in germany "Manchmal sieht man den Wald for lauter Baeumen nicht" which translates loosely to "Sometimes you loose sight of the forest because there are too many trees."
No idea if this makes sense to you but I think that's a good idea :D (and better than mine, after all)

mrocklin · 2021-06-17T13:38:18Z

Sometimes you loose sight of the forest because there are too many trees

This saying is commonly known in English as well. It's a good saying.

mrocklin · 2021-06-17T13:47:20Z

https://www.dictionary.com/browse/can-t-see-the-forest-for-the-trees#:~:text=An%20expression%20used%20of%20someone,the%20bill%20could%20never%20pass.%E2%80%9D

distributed/scheduler.py

fjetter · 2021-06-21T15:38:36Z

(Code still needs to be updated)

I removed all the changes around decide_worker and solely focused on the blacklisting of cost_multiplier > 100 and duration < 0.005 tasks (which are often the same). Below you can see the stealing tab (tab includes fixes, PR upcoming) which shows the stealing activitiy and occupancy over time

main / default

without blacklisting "expensive" tasks

What we can see is that the occupancy (timeseries chart, top) is much more effectively balanced if the small tasks are not blacklisted. I do not have proper perf reports or screenshots but the task stream density behaves similarly that the well balanced occupancy is much denser.

mrocklin · 2022-04-13T14:39:29Z

I spoke with @fjetter about this after replicating some of this work in #6115 .

It sounds like there isn't a major blocker here. I'm going to see if I can push it through today. @fjetter if there are any major concerns you have on this PR that weren't listed above then please let me know.

github-actions · 2022-04-13T18:05:25Z

Unit Test Results

      16 files ±  0       16 suites ±0 8h 3m 25s ⏱️ + 51m 57s
  2 737 tests +  3   2 620 ✔️ - 32     81 💤 ±0   36 ❌ +  35
21 781 runs +24 20 596 ✔️ - 85 1 080 💤 +5 105 ❌ +104

For more details on these failures, see this check.

Results for commit 31e3b9f. ± Comparison against base commit 6a3cbd3.

gjoseph92 · 2022-04-13T20:29:26Z

distributed/stealing.py

+                        if not thief:
+                            thief = thieves[i % len(thieves)]


nit: might be easier to read if this was moved into _maybe_pick_thief (and it was no longer maybe).

right, I figured this is nicer because it felt weird passing i to _maybe_pick_thief.

gjoseph92 · 2022-04-13T20:32:43Z

distributed/stealing.py

+            # If there are potential thieves with dependencies we
+            # should prefer them and pick the one which works best.
+            # Otherwise just random/round robin
+            if thieves_with_data:


If, say, just 1 idle worker holds dependencies, is it possible for that worker to get slammed with new tasks because of this? What has to happen between picking a worker as a thief and it getting removed from the idle set?

If, say, just 1 idle worker holds dependencies, is it possible for that worker to get slammed with new tasks because of this

Yes, this may hurt, particularly for sub-topologies like

flowchart BT A --> B1 A --> B2 A --> B3

Loading

where A is very small and B* is very compute heavy, i.e. steal_ratio will always be small and tasks are even allowed to be stolen from "public", i.e. non-saturated workers. I believe for other cases, i.e. if tasks are just stolen from saturated workers, this should be OK since decide_worker is much smarter with initial task placement.

What has to happen between picking a worker as a thief and it getting removed from the idle set?

Steal request must be confirmed which updates occupancies and recalculates the idle set. Somewhere on this PR a comment of mine is suggesting that this _mabye_pick_thief should incorporate in_flight_occupancy to reflect for the time until this happens.

fjetter · 2022-04-14T10:34:31Z

if there are any major concerns you have on this PR that weren't listed above then please let me know.

Dropping the sentinels will emphasize the importance of proper measurements (#6115 (comment)) which is where I ran out of time last year.

I myself would've only picked this up again after having a few solid benchmarks up and running. alternatively a lot of manual testing, maybe both.

mrocklin reviewed Jun 15, 2021

View reviewed changes

jakirkham reviewed Jun 15, 2021

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

distributed/scheduler.py Outdated Show resolved Hide resolved

gjoseph92 reviewed Jun 17, 2021

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

fjetter added 2 commits June 21, 2021 15:30

Improve work stealing for scaling situations

c8ed3f6

Fix stealing dashboard

f42faec

fjetter force-pushed the stealing_scaling branch from dc9586d to f42faec Compare June 21, 2021 17:16

Add shortpath for no-dependency tasks

6d9545d

This was referenced Jun 22, 2021

work stealing seems to not occur fully when launching tasks from tasks #4945

Open

Worker assignment for split-shuffle tasks #4962

Open

Ensure shuffle split operations are blacklisted from work stealing #4964

Merged

fjetter mentioned this pull request Apr 13, 2022

Allow stealing of fast tasks in some situations #6115

Open

Merge branch 'main' of github.com:dask/distributed into stealing_scaling

31e3b9f

gjoseph92 reviewed Apr 13, 2022

View reviewed changes

fjetter added the stealing label Jun 14, 2022

fjetter mentioned this pull request Jun 24, 2022

Initial set of automated performance benchmarks (non-H2O) coiled/benchmarks#191

Closed

fjetter mentioned this pull request Aug 11, 2022

Root-ish tasks all schedule onto one worker #6573

Closed

This was referenced Sep 2, 2022

Timeboxed push for simplifying work stealing #6993

Closed

Allow very fast keys and very expensive transfers as stealing candidates #7022

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve work stealing for scaling situations #4920

Improve work stealing for scaling situations #4920

fjetter commented Jun 15, 2021

fjetter commented Jun 15, 2021 •

edited

Loading

mrocklin Jun 15, 2021

mrocklin Jun 15, 2021

fjetter Jun 15, 2021 •

edited

Loading

fjetter Jun 15, 2021

mrocklin Jun 15, 2021

fjetter Jun 15, 2021

mrocklin Jun 15, 2021

fjetter commented Jun 15, 2021 •

edited

Loading

jakirkham Jun 15, 2021

fjetter Jun 16, 2021

jakirkham left a comment

mrocklin commented Jun 17, 2021

fjetter commented Jun 17, 2021

mrocklin commented Jun 17, 2021

fjetter commented Jun 17, 2021

mrocklin commented Jun 17, 2021

fjetter commented Jun 17, 2021

mrocklin commented Jun 17, 2021

mrocklin commented Jun 17, 2021

fjetter commented Jun 21, 2021

mrocklin commented Apr 13, 2022

github-actions bot commented Apr 13, 2022

gjoseph92 Apr 13, 2022

fjetter Apr 14, 2022

gjoseph92 Apr 13, 2022

fjetter Apr 14, 2022

fjetter commented Apr 14, 2022

Improve work stealing for scaling situations #4920

Are you sure you want to change the base?

Improve work stealing for scaling situations #4920

Conversation

fjetter commented Jun 15, 2021

fjetter commented Jun 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter Jun 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter commented Jun 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakirkham left a comment

Choose a reason for hiding this comment

mrocklin commented Jun 17, 2021

fjetter commented Jun 17, 2021

mrocklin commented Jun 17, 2021

fjetter commented Jun 17, 2021

mrocklin commented Jun 17, 2021

fjetter commented Jun 17, 2021

mrocklin commented Jun 17, 2021

mrocklin commented Jun 17, 2021

fjetter commented Jun 21, 2021

main / default

without blacklisting "expensive" tasks

mrocklin commented Apr 13, 2022

github-actions bot commented Apr 13, 2022

Unit Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter commented Apr 14, 2022

fjetter commented Jun 15, 2021 •

edited

Loading

fjetter Jun 15, 2021 •

edited

Loading

fjetter commented Jun 15, 2021 •

edited

Loading