Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Round-robin worker selection makes poor choices with worker-saturation > 1.0 #7197

Open
gjoseph92 opened this issue Oct 26, 2022 · 6 comments · May be fixed by #7248 or #7280
Open

Round-robin worker selection makes poor choices with worker-saturation > 1.0 #7197

gjoseph92 opened this issue Oct 26, 2022 · 6 comments · May be fixed by #7248 or #7280
Labels
bug Something is broken scheduling

Comments

@gjoseph92
Copy link
Collaborator

test_wait_first_completed is failing in #7191, with the worker-saturation value set to 1.1

@gen_cluster(client=True)
async def test_wait_first_completed(c, s, a, b):
event = Event()
x = c.submit(block_on_event, event)
y = c.submit(block_on_event, event)
z = c.submit(inc, 2)
done, not_done = await wait([x, y, z], return_when="FIRST_COMPLETED")
assert done == {z}
assert not_done == {x, y}
assert z.status == "finished"
assert x.status == "pending"
assert y.status == "pending"
await event.set()

It works fine with 1.0, but because of the round-up logic #7116 allowing workers to be oversaturated, fails for 1.1

It blocks forever because the worker with 1 thread gets assigned [block_on_event, inc], and the worker with 2 threads gets assigned [block_on_event]. It should be the other way around.

The culprit has something to do with the round-robin logic that only applies to rare situations like this, where the cluster is small but larger than the TaskGroup being assigned

# TODO if `is_rootish` would always return True for tasks without dependencies,
# we could remove all this logic. The rootish assignment logic would behave
# more or less the same as this, maybe without gauranteed round-robin though?
# This path is only reachable when `ts` doesn't have dependencies, but its
# group is also smaller than the cluster.
# Fastpath when there are no related tasks or restrictions
worker_pool = self.idle or self.workers
# FIXME idle and workers are SortedDict's declared as dicts
# because sortedcontainers is not annotated
wp_vals = cast("Sequence[WorkerState]", worker_pool.values())
n_workers: int = len(wp_vals)
if n_workers < 20: # smart but linear in small case
ws = min(wp_vals, key=operator.attrgetter("occupancy"))
assert ws
if ws.occupancy == 0:
# special case to use round-robin; linear search
# for next worker with zero occupancy (or just
# land back where we started).
wp_i: WorkerState
start: int = self.n_tasks % n_workers
i: int
for i in range(n_workers):
wp_i = wp_vals[(i + start) % n_workers]
if wp_i.occupancy == 0:
ws = wp_i
break

If I update is_rootish like so:

diff --git a/distributed/scheduler.py b/distributed/scheduler.py
index cf240240..802df12d 100644
--- a/distributed/scheduler.py
+++ b/distributed/scheduler.py
@@ -3043,6 +3043,8 @@ class SchedulerState:
         """
         if ts.resource_restrictions or ts.worker_restrictions or ts.host_restrictions:
             return False
+        if not ts.dependencies:
+            return True
         tg = ts.group
         # TODO short-circuit to True if `not ts.dependencies`?
         return (

the test passes.

cc @fjetter @crusaderky

@gjoseph92 gjoseph92 added bug Something is broken scheduling labels Oct 26, 2022
@gjoseph92 gjoseph92 self-assigned this Oct 26, 2022
@fjetter
Copy link
Member

fjetter commented Oct 26, 2022

I'm not a big fan of the round robin in general

xref #6974

I'm a bit nervous about the change to is_rootish since I'm having a hard time estimating the impact


FWIW I'm wondering if this test even makes any sense the way it is written. I'd like to see a permutation in the order in which we submit these tasks and I bet main would be failing as well

@gjoseph92
Copy link
Collaborator Author

Yeah, I'm also not a fan of the round robin. The is_rootish change isn't necessarily what I think we should do, just a quick way of pointing out that the round robin logic is the issue here, since it skips that code path.

@gjoseph92
Copy link
Collaborator Author

For posterity, here's why this test currently fails with worker-saturation > 1.0:

We have 3 tasks, and a cluster with 3 threads—2 on one worker, 1 on another.

Even though the tasks are clearly root tasks (they have no dependencies #7274), there aren't enough of them to be considered root-ish #7273. So we don't use the queuing-related code path, even though queuing is enabled.

Instead, we select workers for them via this old round-robin logic. Then this happens:

  1. Select 1-threaded worker for first block_on_event task. All workers have 0 occupancy, so we pick the worker whose address comes first lexicographically. This is fine.
  2. Select 2-threaded worker for the second block_on_event task. It has the minimum occupancy. This is fine.
  3. Select 1-threaded worker for the inc task. This is bad; we should have picked the 2-threaded worker. But both workers have 0.5 occupancy right now, and we don't scale occupancy by the number of worker threads. So since they're equal, lexicographical sort is the tie-breaker, and we pick the first worker.

Why does this even pass with queuing off, or with worker-saturation == 1.0? Because in that case, once a task is assigned to the 1-threaded worker, it's removed from the idle set. So this round-robin code doesn't have the chance to make a bad choice, because the worker is not a candidate. With worker-saturation == 1.1, the worker isn't removed from idle, since it's supposed to be oversaturated.

I'd argue this is simply a bug in the round-robin code path, by not sorting by quite the right criteria. It's inconsistent with worker_objective, for example. We shouldn't be relying on the implementation of idle to not pick bad workers. If we just do this:

diff --git a/distributed/scheduler.py b/distributed/scheduler.py
index eb5828bf..fd487cc8 100644
--- a/distributed/scheduler.py
+++ b/distributed/scheduler.py
@@ -2212,7 +2212,7 @@ class SchedulerState:
             wp_vals = cast("Sequence[WorkerState]", worker_pool.values())
             n_workers: int = len(wp_vals)
             if n_workers < 20:  # smart but linear in small case
-                ws = min(wp_vals, key=operator.attrgetter("occupancy"))
+                ws = min(wp_vals, key=lambda ws: ws.occupancy / ws.nthreads)
                 assert ws
                 if ws.occupancy == 0:
                     # special case to use round-robin; linear search

then the test passes at 1.0, 1.1, and inf.

@gjoseph92
Copy link
Collaborator Author

Another issue with this round-robin code path: it doesn't take memory into consideration. This test, to be added in #7248, fails on main right now both with queuing on and off:

@gen_cluster(
client=True,
nthreads=[("", 2)] * 3,
)
async def test_decide_worker_memory_tiebreaker_idle_cluster(c, s, *workers):
big = c.submit(lambda: "x" * 4096)
await big
f2 = c.submit(inc, 1)
await f2
f3 = c.submit(inc, 2)
await f3
assert all(len(w.state.tasks) == 1 for w in workers), [
w.state.tasks for w in workers
]
last = c.submit(inc, 3)
await last
big_ts = s.tasks[big.key]
last_ts = s.tasks[last.key]
assert first(last_ts.who_has) is not first(big_ts.who_has)

This is the case @crusaderky cares about. In an idle cluster, if some workers have keys in memory and others don't, we shouldn't keep adding more tasks to the workers that already have data in memory.

@gjoseph92
Copy link
Collaborator Author

Finally, the >20 workers case (which would be heavily used in most real-world non-local clusters, especially with the Futures API) is only covered by one test: #7275. Obviously, both this "pick the worker with open threads" and "pick the worker with less memory" go out the window window with >20 workers, since we're not taking the min at all, just picking round-robin.

Given that selecting the min worker only takes a tiny amount of time, even on larger clusters #7246 (comment), I'm not sure this >20 round-robin case is worth it for both the less-good decision-making and the complexity of another code path with different behavior and almost no testing.

That's why eliminating this round-robin code path and consolidating it into the decide_worker_rootish_* functions seems like the best long-term approach to me. Even starting with #6974 would improve the situation, since worker_objective already gets this occupancy and nbytes metric right.

@gjoseph92 gjoseph92 linked a pull request Nov 9, 2022 that will close this issue
2 tasks
@gjoseph92
Copy link
Collaborator Author

Note that this was alleviated, but not entirely fixed, by #7278. (It's still not sorting by a great objective function, but totally-full candidates are now removed regardless of worker-saturation.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken scheduling
Projects
None yet
2 participants