`worker-saturation` impacts balancing in work-stealing #7085

hendrikmakait · 2022-09-29T12:07:49Z

When worker-saturation is not inf, then workers are only classified as idle if they are not full:

Lines 2899 to 2903 in 482941e

    
           if ( 
        
               self.is_unoccupied(ws, occ, p) 
        
               if math.isinf(self.WORKER_SATURATION) 
        
               else not _worker_full(ws, self.WORKER_SATURATION) 
        
           ):

While this behavior is desired for withholding root-tasks (it was introduced in #6614), work-stealing also relies on the classification of idle tasks to identify thieves. Limiting this to workers that are not saturated according to worker-saturation delays balancing decisions until workers are almost out of work and reduces our ability to interleave computation of remaining tasks with gathering dependencies of stolen ones.

Reproducer
Add the following test case to test_steal.py

@pytest.mark.parametrize("queue", [True, False])
@pytest.mark.parametrize("recompute_saturation", [True, False])
@pytest.mark.parametrize(
    "inp,expected",
    [
        (
            [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0]],
            [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0]],
        ),  # balance many tasks
    ],
)
def test_balance_interacts_with_worker_saturation(
    inp, expected, queue, recompute_saturation
):
    async def test_balance_(*args, **kwargs):
        await assert_balanced(inp, expected, recompute_saturation, *args, **kwargs)

    config = {
        "distributed.scheduler.default-task-durations": {str(i): 1 for i in range(10)},
        "distributed.scheduler.worker-saturation": 1.0 if queue else float("inf"),
    }
    gen_cluster(client=True, nthreads=[("", 1)] * len(inp), config=config)(
        test_balance_
    )()

FAILED distributed/tests/test_steal.py::test_balance_interacts_with_worker_saturation[inp0-expected0-True-True] - Exception: Expected: [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0]]; got: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0]]
FAILED distributed/tests/test_steal.py::test_balance_interacts_with_worker_saturation[inp0-expected0-False-True] - Exception: Expected: [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0]]; got: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0]]

cc @gjoseph92

The text was updated successfully, but these errors were encountered:

gjoseph92 · 2022-11-10T01:37:33Z

FYI recompute_saturation has been removed, so the test is now:

@pytest.mark.parametrize("queue", [True, False])
@pytest.mark.parametrize(
    "inp,expected",
    [
        (
            [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0]],
            [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0]],
        ),  # balance many tasks
    ],
)
def test_balance_interacts_with_worker_saturation(inp, expected, queue):
    async def test_balance_(*args, **kwargs):
        await assert_balanced(inp, expected, *args, **kwargs)

    config = {
        "distributed.scheduler.default-task-durations": {str(i): 1 for i in range(10)},
        "distributed.scheduler.worker-saturation": 1.0 if queue else float("inf"),
    }
    gen_cluster(client=True, nthreads=[("", 1)] * len(inp), config=config)(
        test_balance_
    )()

gjoseph92 · 2022-11-10T02:13:21Z

Yeah, looks like what's happening here is that because both workers are single-threaded and have processing tasks, idle is empty, so balance doesn't try to do anything.

Without queuing, the worker with 1 task is still considered idle because its occupancy is less than half the average occupancy, even though all its threads are in use.

Stepping back though, how realistic is this situation? What you've created here is sort of root task overproduction: many more tasks are in processing on one worker than its threads can handle. These tasks aren't queued because they aren't classified as root-ish because they have worker restrictions (set in assert_balanced): you explicitly asked for them to all go to one worker.

I feel like this test is trying to simulate a scale-up case: you start with one worker, submit lots of tasks, then another one joins right at the end of submission. If queuing were on (and there were no worker restrictions), the first worker wouldn't have gotten all the tasks in the first place. There'd be no need to rebalance, since most tasks would be in the scheduler-side queue and would naturally be assigned to the new worker evenly. (Modulo #7274 and #7273 of course.)

So yes, if someone had exactly this use case—using worker restrictions but wanting tasks to get stolen anyway, starting with 1 worker and scaling up—then it won't rebalance when queuing is on. But if we care about the broad case—what if you submit lots of tasks to a 1-worker cluster, then scale up—I don't think this is a relevant test with queuing on.

Here's a test showing queuing balances when scaling up: #7284.

hendrikmakait added bug Something is broken stealing labels Sep 29, 2022

This was referenced Sep 29, 2022

Smarter stealing with dependencies #7024

Merged

Avoid rootish tasks in work-stealing integration tests coiled/benchmarks#385

Open

fjetter added the scheduling label Oct 4, 2022

fjetter mentioned this issue Nov 9, 2022

Revert idle classification when worker-saturation is set #7278

Merged

gjoseph92 mentioned this issue Nov 10, 2022

Test for balanced scale-up with queuing #7284

Open

2 tasks

fjetter closed this as completed in #7278 Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`worker-saturation` impacts balancing in work-stealing #7085

`worker-saturation` impacts balancing in work-stealing #7085

hendrikmakait commented Sep 29, 2022 •

edited

gjoseph92 commented Nov 10, 2022

gjoseph92 commented Nov 10, 2022

worker-saturation impacts balancing in work-stealing #7085

worker-saturation impacts balancing in work-stealing #7085

Comments

hendrikmakait commented Sep 29, 2022 • edited

gjoseph92 commented Nov 10, 2022

gjoseph92 commented Nov 10, 2022

`worker-saturation` impacts balancing in work-stealing #7085

`worker-saturation` impacts balancing in work-stealing #7085

hendrikmakait commented Sep 29, 2022 •

edited