Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve work stealing for scaling situations #4920

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Jun 15, 2021

These are a few preliminary fixes which would close #4471. While this does not necessarily account for opportunity cost as I described in #4471 (comment) it still remedies the situation by

  1. Allow idle workers to be part of the decision when scheduling a task. the worker objective will take care of the rest and should assign it to the proper worker if it is not overloaded. That might backlash and should be tested in another context
  2. The work stealing ratio calculation emits sentinel values in situations which are arguable not situations we should skip on. As I argued in Poor work scheduling when cluster adapts size #4471 (comment) there are situations where we should consider stealing even if the task is incredibly cheap to compute since it might allow for further parallelism. In particular if the cluster is partially idling.
  3. This is a minor tuning and not connected to the problem but I chose to not pick the thief based on round robin but on the worker objective, as we do for the initial decision. We might need to tweak this since it doesn't account for in-flight occupancy and might cause another overload of a worker.

Tests missing (TODO), feedback welcome

distributed/stealing.py Outdated Show resolved Hide resolved

thief = min(
thieves, key=partial(self.scheduler.worker_objective, ts)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many thieves can there be here? If the answer is "maybe a lot" then we might need to be careful here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also curious how much impact this choice in particular had on the workload we're testing against.

Copy link
Member Author

@fjetter fjetter Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many thieves can there be here? If the answer is "maybe a lot" then we might need to be careful here.

Worst case situation would be if all but one worker are idling. then we'd have W-1 potential thieves. I thought about sampling this down to 20 or smth like that to avoid any extreme cases. Sampling and picking the best of the sample still sounds better than random. However...

I'm also curious how much impact this choice in particular had on the workload we're testing against.

Nothing, at all. This is the one change which didn't actually impact this particular example. It just felt right? idk. That's one of the things I want to test on different workloads

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular for workloads with actual payload data. If the dependents don't have data, this is a rather extreme stealing edge case and I'd like to verify that this also behaves nicely for proper payloads

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In decide_worker we choose the best worker if there are less than 20 workers and we choose an incrementing worker if there are more than 20 workers.

In practice though I doubt that it matters much. The difference in occupancy between thieves and victims means that we're making the world significantly better with the dumb change. If this results in a suboptimal world that's ok... we'll just improve it again in a future cycle if it becomes a problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In decide_worker we choose the best worker if there are less than 20 workers and we choose an incrementing worker if there are more than 20 workers.

I also thought about just using decide_worker. After all, why would the logic be any different (given that decide_worker takes a whitelist <-> thieves). However, the important logic is in the objective function which is why I picked that. If this things turns out to be valuable, we can have a closer look

In practice though I doubt that it matters much. The difference in occupancy between thieves and victims means that we're making the world significantly better with the dumb change.

True. As I said, the current problem doesn't actually trigger anything where this would matter. If I can not prove an improvement, I'll remove it again.

The difference in occupancy

The only thing I would like to leverage here is to pick a thief with dependencies, if that exists. that's implicitly done by the worker objective but could also be made explicit. My gut tells me this can make a difference but you might be right that this is only marginal.

we'll just improve it again in a future cycle if it becomes a problem.

There will not be another cylce to correct this unless the worker becomes saturated soon-ish, will there?. Non-saturated workers are not victimized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will not be another cylce to correct this unless the worker becomes saturated soon-ish, will there?. Non-saturated workers are not victimized.

Correct, but the stolen task probably won't be run immediately anyway. It's probably at the end of the queue. As the computation continues progressing and some workers become more free before others the saturated/non-saturated split will grow and trigger another cycle of transfers if necessary.

distributed/scheduler.py Outdated Show resolved Hide resolved
@fjetter
Copy link
Member Author

fjetter commented Jun 15, 2021

FYI tests break because the stealing now behaves differently and the tests reflect that (good!). just wanted to get this out early in case smbd has input already.

ts: TaskState, all_workers, valid_workers: set, objective
ts: TaskState,
all_workers: set,
valid_workers: Optional[set],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cython doesn't handling typing objects like Optional yet. Though it already happily accepts None and checks for that

Suggested change
valid_workers: Optional[set],
valid_workers: set,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added optional mostly for type checking, e.g. in IDEs. is cython having trouble with this or does it simply ignore it?

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI these could be kind of expensive if we don't need them, which is why the original code didn't coerce them to sets. However it was that way even prior to Cythonizing the Scheduler

distributed/scheduler.py Outdated Show resolved Hide resolved
distributed/scheduler.py Outdated Show resolved Hide resolved
@mrocklin
Copy link
Member

Summarizing things a bit here, I'm seeing two changes:

  1. We don't only look at workers with relevant dependencies, we also consider idle workers. I agree that this is a big benefit.
  2. We don't do a round robin, but instead use the worker objective. I agree that this is often a benefit, I am concerned about introducing checks that depend on the number of workers for each task. I think that we may want to defend against this. There are a few options:
    • Round-robin defends against this but results in suboptimal scheduling. One answer is that we keep doing round-robin until we identify that it results in user problems. I wouldn't be surprised if it didn't over the lifetime of a computation because we'll continue correcting things.
    • Use worker objective if the number of theives is small (less than twenty) but fall back to round robin otherwise (we do this elsewhere)
    • Select a random sample of the theives (maybe ten?) and then use the worker objective on that sample (idea from @gjoseph92 ). Maybe this gives us the best of both worlds?

@fjetter
Copy link
Member Author

fjetter commented Jun 17, 2021

Change 2B: relax the sentinel values in steal_time_ratio. While it is ok do use a short path for no-dependency tasks to save compute cost, we should not disallow work stealing for generally small or expensive tasks (<0.005 or cost_multiplier > 100). I believe these two cases should not be short-circuited but rather be dealt with using the usual logic. If it is too expensive and doesn't payoff they are not stolen but there is a remote chance that it would still pay off to move them in extreme situations (as in this example). If we do not remove these sentinels, they are forever blacklisted and will never be stolen.

Select a random sample of the theives (maybe ten?) and then use the worker objective on that sample (idea from

I like this idea so much that I wish it was my own ;) #4920 (comment)

@mrocklin
Copy link
Member

we should not disallow work stealing for generally small or expensive tasks (<0.005 or cost_multiplier > 100). I believe these two cases should not be short-circuited but rather be dealt with using the usual logic

That's ok with me

I like this idea so much that I wish it was my own ;) #4920 (comment)

Whoops! Well then it sounds like it must be a great idea :)

@fjetter
Copy link
Member Author

fjetter commented Jun 17, 2021

Whoops! Well then it sounds like it must be a great idea :)

I'm not entirely certain to be honest. This is one of the cases where I would really love some data. After all, what are the chances that a sampling would keep the one or two workers which carry dependencies? If I cannot back this up with data, I'm inclined to roll this back.

However, overall I'd really love to have some micro benchmarks on this to know how bad this actually is. The comments the two of you left behind about performance make 100% sense but it is all a bit fuzzy and sometimes hard to spot.

For instance, I would've thought that the following operations would be A) faster and B) identical
image but I was wrong. Twice.

Anyhow, this is offtopic but I would love to discuss the topic of benchmarking with a few ppl (maybe next dev meeting)

@mrocklin
Copy link
Member

After all, what are the chances that a sampling would keep the one or two workers which carry dependencies?

Ah, my thinking (or rather Gabe's thinking) was that you would keep the workers with dependencies in the set, but then mix in a few others from outside of the set.

@fjetter
Copy link
Member Author

fjetter commented Jun 17, 2021

Ah, my thinking (or rather Gabe's thinking) was that you would keep the workers with dependencies in the set

Right... that sounds straight forward. there is a saying in germany "Manchmal sieht man den Wald for lauter Baeumen nicht" which translates loosely to "Sometimes you loose sight of the forest because there are too many trees."
No idea if this makes sense to you but I think that's a good idea :D (and better than mine, after all)

@mrocklin
Copy link
Member

Sometimes you loose sight of the forest because there are too many trees

This saying is commonly known in English as well. It's a good saying.

@mrocklin
Copy link
Member

distributed/scheduler.py Outdated Show resolved Hide resolved
@fjetter
Copy link
Member Author

fjetter commented Jun 21, 2021

(Code still needs to be updated)

I removed all the changes around decide_worker and solely focused on the blacklisting of cost_multiplier > 100 and duration < 0.005 tasks (which are often the same). Below you can see the stealing tab (tab includes fixes, PR upcoming) which shows the stealing activitiy and occupancy over time

main / default

Screenshot 2021-06-21 at 17 31 32

without blacklisting "expensive" tasks

Screenshot 2021-06-21 at 17 27 38

What we can see is that the occupancy (timeseries chart, top) is much more effectively balanced if the small tasks are not blacklisted. I do not have proper perf reports or screenshots but the task stream density behaves similarly that the well balanced occupancy is much denser.

@mrocklin
Copy link
Member

I spoke with @fjetter about this after replicating some of this work in #6115 .

It sounds like there isn't a major blocker here. I'm going to see if I can push it through today. @fjetter if there are any major concerns you have on this PR that weren't listed above then please let me know.

@github-actions
Copy link
Contributor

Unit Test Results

       16 files  ±  0         16 suites  ±0   8h 3m 25s ⏱️ + 51m 57s
  2 737 tests +  3    2 620 ✔️  - 32       81 💤 ±0    36 +  35 
21 781 runs  +24  20 596 ✔️  - 85  1 080 💤 +5  105 +104 

For more details on these failures, see this check.

Results for commit 31e3b9f. ± Comparison against base commit 6a3cbd3.

Comment on lines +457 to +458
if not thief:
thief = thieves[i % len(thieves)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be easier to read if this was moved into _maybe_pick_thief (and it was no longer maybe).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I figured this is nicer because it felt weird passing i to _maybe_pick_thief.

# If there are potential thieves with dependencies we
# should prefer them and pick the one which works best.
# Otherwise just random/round robin
if thieves_with_data:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If, say, just 1 idle worker holds dependencies, is it possible for that worker to get slammed with new tasks because of this? What has to happen between picking a worker as a thief and it getting removed from the idle set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If, say, just 1 idle worker holds dependencies, is it possible for that worker to get slammed with new tasks because of this

Yes, this may hurt, particularly for sub-topologies like

flowchart BT
A --> B1
A --> B2
A --> B3
Loading

where A is very small and B* is very compute heavy, i.e. steal_ratio will always be small and tasks are even allowed to be stolen from "public", i.e. non-saturated workers. I believe for other cases, i.e. if tasks are just stolen from saturated workers, this should be OK since decide_worker is much smarter with initial task placement.

What has to happen between picking a worker as a thief and it getting removed from the idle set?

Steal request must be confirmed which updates occupancies and recalculates the idle set. Somewhere on this PR a comment of mine is suggesting that this _mabye_pick_thief should incorporate in_flight_occupancy to reflect for the time until this happens.

@fjetter
Copy link
Member Author

fjetter commented Apr 14, 2022

if there are any major concerns you have on this PR that weren't listed above then please let me know.

Dropping the sentinels will emphasize the importance of proper measurements (#6115 (comment)) which is where I ran out of time last year.

I myself would've only picked this up again after having a few solid benchmarks up and running. alternatively a lot of manual testing, maybe both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poor work scheduling when cluster adapts size
4 participants