Round-robin worker selection makes poor choices with `worker-saturation > 1.0` #7197

gjoseph92 · 2022-10-26T13:48:34Z

test_wait_first_completed is failing in #7191, with the worker-saturation value set to 1.1

distributed/distributed/tests/test_client.py

Lines 732 to 746 in 0983731

    
           @gen_cluster(client=True) 
        
           async def test_wait_first_completed(c, s, a, b): 
        
               event = Event() 
        
               x = c.submit(block_on_event, event) 
        
               y = c.submit(block_on_event, event) 
        
               z = c.submit(inc, 2) 
        
               done, not_done = await wait([x, y, z], return_when="FIRST_COMPLETED") 
        
               assert done == {z} 
        
               assert not_done == {x, y} 
        
               assert z.status == "finished" 
        
               assert x.status == "pending" 
        
               assert y.status == "pending" 
        
               await event.set()

It works fine with 1.0, but because of the round-up logic #7116 allowing workers to be oversaturated, fails for 1.1

It blocks forever because the worker with 1 thread gets assigned [block_on_event, inc], and the worker with 2 threads gets assigned [block_on_event]. It should be the other way around.

The culprit has something to do with the round-robin logic that only applies to rare situations like this, where the cluster is small but larger than the TaskGroup being assigned

distributed/distributed/scheduler.py

Lines 2210 to 2236 in 0983731

    
           # TODO if `is_rootish` would always return True for tasks without dependencies, 
        
           # we could remove all this logic. The rootish assignment logic would behave 
        
           # more or less the same as this, maybe without gauranteed round-robin though? 
        
           # This path is only reachable when `ts` doesn't have dependencies, but its 
        
           # group is also smaller than the cluster. 
        
           # Fastpath when there are no related tasks or restrictions 
        
           worker_pool = self.idle or self.workers 
        
           # FIXME idle and workers are SortedDict's declared as dicts 
        
           #       because sortedcontainers is not annotated 
        
           wp_vals = cast("Sequence[WorkerState]", worker_pool.values()) 
        
           n_workers: int = len(wp_vals) 
        
           if n_workers < 20:  # smart but linear in small case 
        
               ws = min(wp_vals, key=operator.attrgetter("occupancy")) 
        
               assert ws 
        
               if ws.occupancy == 0: 
        
                   # special case to use round-robin; linear search 
        
                   # for next worker with zero occupancy (or just 
        
                   # land back where we started). 
        
                   wp_i: WorkerState 
        
                   start: int = self.n_tasks % n_workers 
        
                   i: int 
        
                   for i in range(n_workers): 
        
                       wp_i = wp_vals[(i + start) % n_workers] 
        
                       if wp_i.occupancy == 0: 
        
                           ws = wp_i 
        
                           break

If I update is_rootish like so:

diff --git a/distributed/scheduler.py b/distributed/scheduler.py
index cf240240..802df12d 100644
--- a/distributed/scheduler.py
+++ b/distributed/scheduler.py
@@ -3043,6 +3043,8 @@ class SchedulerState:
         """
         if ts.resource_restrictions or ts.worker_restrictions or ts.host_restrictions:
             return False
+        if not ts.dependencies:
+            return True
         tg = ts.group
         # TODO short-circuit to True if `not ts.dependencies`?
         return (

the test passes.

cc @fjetter @crusaderky

The text was updated successfully, but these errors were encountered:

fjetter · 2022-10-26T13:57:04Z

I'm not a big fan of the round robin in general

xref #6974

I'm a bit nervous about the change to is_rootish since I'm having a hard time estimating the impact

FWIW I'm wondering if this test even makes any sense the way it is written. I'd like to see a permutation in the order in which we submit these tasks and I bet main would be failing as well

gjoseph92 · 2022-10-26T14:34:45Z

Yeah, I'm also not a fan of the round robin. The is_rootish change isn't necessarily what I think we should do, just a quick way of pointing out that the round robin logic is the issue here, since it skips that code path.

gjoseph92 · 2022-11-09T01:12:37Z

For posterity, here's why this test currently fails with worker-saturation > 1.0:

We have 3 tasks, and a cluster with 3 threads—2 on one worker, 1 on another.

Even though the tasks are clearly root tasks (they have no dependencies #7274), there aren't enough of them to be considered root-ish #7273. So we don't use the queuing-related code path, even though queuing is enabled.

Instead, we select workers for them via this old round-robin logic. Then this happens:

Select 1-threaded worker for first block_on_event task. All workers have 0 occupancy, so we pick the worker whose address comes first lexicographically. This is fine.
Select 2-threaded worker for the second block_on_event task. It has the minimum occupancy. This is fine.
Select 1-threaded worker for the inc task. This is bad; we should have picked the 2-threaded worker. But both workers have 0.5 occupancy right now, and we don't scale occupancy by the number of worker threads. So since they're equal, lexicographical sort is the tie-breaker, and we pick the first worker.

Why does this even pass with queuing off, or with worker-saturation == 1.0? Because in that case, once a task is assigned to the 1-threaded worker, it's removed from the idle set. So this round-robin code doesn't have the chance to make a bad choice, because the worker is not a candidate. With worker-saturation == 1.1, the worker isn't removed from idle, since it's supposed to be oversaturated.

I'd argue this is simply a bug in the round-robin code path, by not sorting by quite the right criteria. It's inconsistent with worker_objective, for example. We shouldn't be relying on the implementation of idle to not pick bad workers. If we just do this:

diff --git a/distributed/scheduler.py b/distributed/scheduler.py
index eb5828bf..fd487cc8 100644
--- a/distributed/scheduler.py
+++ b/distributed/scheduler.py
@@ -2212,7 +2212,7 @@ class SchedulerState:
             wp_vals = cast("Sequence[WorkerState]", worker_pool.values())
             n_workers: int = len(wp_vals)
             if n_workers < 20:  # smart but linear in small case
-                ws = min(wp_vals, key=operator.attrgetter("occupancy"))
+                ws = min(wp_vals, key=lambda ws: ws.occupancy / ws.nthreads)
                 assert ws
                 if ws.occupancy == 0:
                     # special case to use round-robin; linear search

then the test passes at 1.0, 1.1, and inf.

gjoseph92 · 2022-11-09T03:34:58Z

Another issue with this round-robin code path: it doesn't take memory into consideration. This test, to be added in #7248, fails on main right now both with queuing on and off:

distributed/distributed/tests/test_scheduler.py

Lines 484 to 505 in 6f0bbc5

    
           @gen_cluster( 
        
               client=True, 
        
               nthreads=[("", 2)] * 3, 
        
           ) 
        
           async def test_decide_worker_memory_tiebreaker_idle_cluster(c, s, *workers): 
        
               big = c.submit(lambda: "x" * 4096) 
        
               await big 
        
               f2 = c.submit(inc, 1) 
        
               await f2 
        
               f3 = c.submit(inc, 2) 
        
               await f3 
        
               assert all(len(w.state.tasks) == 1 for w in workers), [ 
        
                   w.state.tasks for w in workers 
        
               ] 
        
               last = c.submit(inc, 3) 
        
               await last 
        
               big_ts = s.tasks[big.key] 
        
               last_ts = s.tasks[last.key] 
        
               assert first(last_ts.who_has) is not first(big_ts.who_has)

This is the case @crusaderky cares about. In an idle cluster, if some workers have keys in memory and others don't, we shouldn't keep adding more tasks to the workers that already have data in memory.

gjoseph92 · 2022-11-09T03:56:29Z

Finally, the >20 workers case (which would be heavily used in most real-world non-local clusters, especially with the Futures API) is only covered by one test: #7275. Obviously, both this "pick the worker with open threads" and "pick the worker with less memory" go out the window window with >20 workers, since we're not taking the min at all, just picking round-robin.

Given that selecting the min worker only takes a tiny amount of time, even on larger clusters #7246 (comment), I'm not sure this >20 round-robin case is worth it for both the less-good decision-making and the complexity of another code path with different behavior and almost no testing.

That's why eliminating this round-robin code path and consolidating it into the decide_worker_rootish_* functions seems like the best long-term approach to me. Even starting with #6974 would improve the situation, since worker_objective already gets this occupancy and nbytes metric right.

gjoseph92 · 2022-11-10T18:36:50Z

Note that this was alleviated, but not entirely fixed, by #7278. (It's still not sorting by a great objective function, but totally-full candidates are now removed regardless of worker-saturation.)

gjoseph92 added bug Something is broken scheduling labels Oct 26, 2022

gjoseph92 self-assigned this Oct 26, 2022

This was referenced Oct 28, 2022

Turn on queuing by default #7213

Closed

All tasks without dependencies are root-ish #7221

Open

Round-robin empty workers with queueing enabled #7222

Closed

Queuing: lowest-memory worker as tiebreaker #7248

Open

[DNM] Queue by default #7191

Closed

gjoseph92 mentioned this issue Nov 8, 2022

Tasks which are obviously root tasks not considered rootish #7274

Open

gjoseph92 linked a pull request Nov 9, 2022 that will close this issue

Consistent worker selection for no-deps cases #7280

Open

2 tasks

fjetter unassigned gjoseph92 Nov 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Round-robin worker selection makes poor choices with `worker-saturation > 1.0` #7197

Round-robin worker selection makes poor choices with `worker-saturation > 1.0` #7197

gjoseph92 commented Oct 26, 2022

fjetter commented Oct 26, 2022

gjoseph92 commented Oct 26, 2022

gjoseph92 commented Nov 9, 2022

gjoseph92 commented Nov 9, 2022

gjoseph92 commented Nov 9, 2022

gjoseph92 commented Nov 10, 2022

Round-robin worker selection makes poor choices with worker-saturation > 1.0 #7197

Round-robin worker selection makes poor choices with worker-saturation > 1.0 #7197

Comments

gjoseph92 commented Oct 26, 2022

fjetter commented Oct 26, 2022

gjoseph92 commented Oct 26, 2022

gjoseph92 commented Nov 9, 2022

gjoseph92 commented Nov 9, 2022

gjoseph92 commented Nov 9, 2022

gjoseph92 commented Nov 10, 2022

Round-robin worker selection makes poor choices with `worker-saturation > 1.0` #7197

Round-robin worker selection makes poor choices with `worker-saturation > 1.0` #7197