Respect task ordering when making worker assignments #4922

mrocklin · 2021-06-17T00:24:09Z

In situations where tasks have many related tasks, and few dependencies
among them, we try to co-schedule those tasks onto similar workers
according to their dask.ordering. We do this in hopes that this reduces
the burden of communication on their dependents.

In situations where tasks have many related tasks, and few dependencies among them, we try to co-schedule those tasks onto similar workers according to their dask.ordering. We do this in hopes that this reduces the burden of communication on their dependents.

It's not worth the effort otherwise

Thanks tests!

fjetter · 2021-06-17T10:29:05Z

distributed/scheduler.py

+
+    # If our group is large with few dependencies
+    # Then assign sequential tasks to similar workers, even if occupancy isn't ideal
+    if len(ts._group) > nthreads * 2 and sum(map(len, ts._group._dependencies)) < 5:


Isn't the length of all dependencies of a TG potentially very expensive? The length of a group iterates over all TaskStates in a given group. For some topologies, this would require us to iterate over all tasks (-1), wouldn't it?

Is there any way to reason about the numeric values here? I think I'm still lacking intuition for TGs to tell how stable this heuristic is.

The length of a group iterates over all TaskStates in a given group

The code looks like this

def __init__(self, ...): self._states = {"memory": 0, "processing": 0, ...} def __len__(self): return sum(self._states.values())

So it's not as bad as it sounds. However, iterating the dict of a few elements could still be concerning. If so we could always keep a _len value around. It would be cheap to maintain.

Is there any way to reason about the numeric values here? I think I'm still lacking intuition for TGs to tell how stable this heuristic is.

The 2 is because we want more than one task per worker to be allocated. If there are more or equal workers as tasks then we're unlikely to co-schedule any tasks on similar workers, so this is a moot point.

The < 5 is really saying "we want there to be almost no dependencies for the tasks in this group, but we're going to accept a common case of all tasks depending on some parameter or something like an abstract zarr file". We're looking for cases where the dependency won't significantly affect the distribution of tasks throughout the cluster. This could be len(dependencies) in (0, 1) but we figured we'd allow a couple of these just in case.

I expect that the distribution here will be bi-modal with tasks either in (0, 1) or in the hundreds or thousands. Five seemed like a good separator value in that distribution. I think that, given the distribution, this choice is stable and defensible.

The code looks like this

Right, that's state as in {Running, Memory, Released} and not state as in TaskState and is an aggregated dict. I was already a bit thrown off when I saw that. That's perfectly fine.

I expect that the distribution here will be bi-modal with tasks either in (0, 1) or in the hundreds or thousands.

Thanks for the detailed description. I think I was thrown off by the TaskGroup semantics again. I was thinking about our typical tree reductions where we have usually task splits like 8 or 16. These are the situations where one would want to group all dependencies for the first reduction.
However, for group dependencies this should be a trivial dependency of one, correct?

Then, five is conservative, I agree 👍

Perhaps state should have been called state_counts. Oh well.

Ah, it's not len(ts._group._dependencies) which is what you're describing, I think. It's sum(map(len, ts._group._dependencies)) < 5.

We're counting up all of the dependencies for all of the tasks that are like this task. So in a tree reduction, this number would likely be in the thousands for any non-trivially sized computation. It is non-zero and less than five only in cases like the following:

a1 a2 a3 a4 a5 a6 a7 a8 \ \ \ \ / / / / b

Really we're looking for cases where the number of dependencies, amortized over all similar tasks, is near-zero.

This is the "ish" in "root-ish" tasks that we sometimes talk about here.

Perhaps state should have been called state_counts. Oh well.

naming is hard

We're counting up all of the dependencies for all of the tasks that are like this task. So in a tree reduction, this number would likely be in the thousands for any non-trivially sized computation. It is non-zero and less than five only in cases like the following:
Really we're looking for cases where the number of dependencies, amortized over all similar tasks, is near-zero.
This is the "ish" in "root-ish" tasks that we sometimes talk about here.

I think I got it now. That's an interesting approach to gauge the local topology. What I'm currently wondering is if this or a closely related metric (e.g. ratio of group dependents/dependencies) could be used to estimate whether a task has the potential to increase/decrease parallelism. that'd be an interesting metric for work stealing.

anyhow, don't want to increase the scope here. this is a discussion we can delay. I'll let the professionals back to work! thanks!

I think that it could be a useful metric for memory consuming/producing tasks.

It's also, yes, a good metric for increasing parallelism. My experience though is that we are always in a state of abundant parallelism, and that scheduling to increase parallelism is not worth considering in our domain.

Instead we should focus our scheduling decisions to reduce memory use and free intermediate tasks quickly.

This gets test_scheduler.py::test_reschedule to pass

mrocklin · 2021-06-17T13:49:56Z

The test failure is distributed/tests/test_scheduler.py::test_memory, which I don't understand particularly well. Unfortunately I'm not able to make it fail locally. cc'ing @crusaderky in case he has any quick suggestions on what might be going on, or why allocating tasks differently to machines might affect that test.

gjoseph92 · 2021-06-17T14:45:40Z

Running this code (uncomment the first bit to generate the zarr array)

import xarray as xr
import dask.array as da
from distributed import Client, LocalCluster
import coiled


if __name__ == "__main__":
    cluster = LocalCluster(
        processes=True, n_workers=4, threads_per_worker=1, memory_limit=0
    )
    client = Client(cluster)

    # Write a zarr array to disk (requires 100GB free disk space!)
    # Comment this out once you've run it once.
    # data = da.zeros((12500000, 1000), chunks=(12500000, 1))
    # ds = xr.Dataset({"data": (("x", "y"), data)})
    # ds.to_zarr("test.zarr")
    # print("Saved zarr")

    # Do the same array-sum example, but from zarr.
    ds_zarr = xr.open_zarr("test.zarr")
    with coiled.performance_report("zarr-4899.html"):
        ds_zarr.sum("y").compute()

causes a lot of transfers using this branch (performance report) compared to #4899 (performance report). I believe this is because, when moving on to a new worker, this is still using the typical candidate-restricting logic—see commit message of 0fbb75e for an explanation.

mrocklin · 2021-06-17T14:55:55Z

Ah, right. Would this be solved by your trick of including a few workers from the general pool into the mix?

We might also consider applying the root-ish check when we check for dependencies. If there are far fewer dependencies than tasks in this group then we just fall back to the all_workers case.

gjoseph92 · 2021-06-17T15:01:06Z

Would this be solved by your trick of including a few workers from the general pool into the mix?

Yes, but I think we should not consider dependencies at all when selecting candidates in this case:

distributed/distributed/scheduler.py

Lines 7570 to 7576 in 0fbb75e

    
           # Previous worker is fully assigned, so pick a new worker. 
        
           # Since this is a root-like task, we should ignore the placement of its dependencies while selecting workers. 
        
           # Every worker is going to end up running this type of task eventually, and any dependencies will have to be 
        
           # transferred to all workers, so there's no gain from only considering workers where the dependencies already live. 
        
           # Indeed, we _must_ consider all workers, otherwise we would keep picking the same "new" worker(s) every time, 
        
           # since there are only N workers to choose from that actually have the dependency (where N <= n_deps). 
        
           ignore_deps_while_picking = True

So rather than picking candidates as usual and then adding a few random workers, I think we should only use random workers in this instance. The whole point of the "ish" in root-ish tasks is that it's a case where we've decided dependencies don't matter.

mrocklin · 2021-06-17T15:42:38Z

So rather than picking candidates as usual and then adding a few random workers, I think we should only use random workers in this instance. The whole point of the "ish" in root-ish tasks is that it's a case where we've decided dependencies don't matter.

There are two proposals that came out of conversation here:

Mix in random workers to the candidates, probably good to do in general
Consider not looking at dependencies at all in some cases. This might be ...
- A check similar to what we do today for rootish, large number of tasks in the group and small number of total dependencies
- Something fancier, like looking at the amount of bytes of all dependencies, amortized over all tasks, maybe comparing communication cost to computation cost

They might both make sense

mrocklin added 4 commits June 16, 2021 19:17

Add occupancy to WorkerState

64f9dfa

Clean up condition, we've already passed if we have dependencies

85acab8

add keyword parameters and comments

dfa06a1

mrocklin mentioned this pull request Jun 17, 2021

WIP co-assign related root-ish tasks #4899

Closed

3 tasks

mrocklin added 2 commits June 16, 2021 20:23

Only trigger rootish optimization when we have 2x the tasks

e62ddc6

It's not worth the effort otherwise

Check that the worker still exists

54caae5

Thanks tests!

fjetter reviewed Jun 17, 2021

View reviewed changes

Avoid neighbor scheduling in one-off cases

489a3f7

This gets test_scheduler.py::test_reschedule to pass

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect task ordering when making worker assignments #4922

Respect task ordering when making worker assignments #4922

mrocklin commented Jun 17, 2021

fjetter Jun 17, 2021

mrocklin Jun 17, 2021

mrocklin Jun 17, 2021

fjetter Jun 17, 2021

mrocklin Jun 17, 2021

mrocklin Jun 17, 2021

mrocklin Jun 17, 2021

fjetter Jun 17, 2021

mrocklin Jun 17, 2021

mrocklin commented Jun 17, 2021

gjoseph92 commented Jun 17, 2021

mrocklin commented Jun 17, 2021

gjoseph92 commented Jun 17, 2021

mrocklin commented Jun 17, 2021

Respect task ordering when making worker assignments #4922

Are you sure you want to change the base?

Respect task ordering when making worker assignments #4922

Conversation

mrocklin commented Jun 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Jun 17, 2021

gjoseph92 commented Jun 17, 2021

mrocklin commented Jun 17, 2021

gjoseph92 commented Jun 17, 2021

mrocklin commented Jun 17, 2021