Co-assign root-ish tasks #4967

gjoseph92 · 2021-06-24T00:29:07Z

In decide_worker, rather than spreading out root tasks as much as possible, schedule consecutive (by priority order) root(ish) tasks on the same worker. This ensures the dependencies of a reduction start out on the same worker, reducing future data transfer.

Closes Co-assign neighboring tasks to neighboring workers #4892
Tests added / passed
Passes black distributed / flake8 distributed / isort distributed

It doesn't like `sum(wws._nthreads for wws in valid_workers)` since that requires defining a closure.

mrocklin

Woot?

distributed/tests/test_scheduler.py

distributed/scheduler.py

mrocklin · 2021-06-24T12:54:09Z

Here are a couple of possible commits: https://github.com/dask/distributed/compare/main...mrocklin:decide_worker/co-assign-relatives-2?expand=1

Slight simplification, although I drop caring about size in the valid_workers case (maybe a bad idea): 28c5e41
Larger change, breaking things out into methods (also I broke things): 4589e94

distributed/scheduler.py

mrocklin · 2021-06-24T22:40:28Z

@Kirill888 great talk, for memory issues I think that you might find the work in this PR interesting. I don't know precisely what you're running into, but given how you described your problem I give this PR a 50% chance of making you happy.

This reverts commit 7aa691a.

mrocklin · 2021-06-25T00:20:12Z

If tests pass I'm good to merge. However, @gjoseph92 it might make sense to run with this in the wild a bit. Maybe tomorrow is "try a bunch of pangeo workloads" day?

gjoseph92 · 2021-06-25T00:24:47Z

@mrocklin I believe that the one failure (so far) is flaky: #4976. I'd also be happy to try it out for a day tomorrow before merging. Tomorrow definitely is that sort of day.

distributed/scheduler.py

distributed/tests/test_scheduler.py

fjetter · 2021-06-28T09:37:30Z

distributed/tests/test_scheduler.py

+
+            trivial_deps = {f"k{i}": delayed(object()) for i in range(ndeps)}
+
+            # TODO is there a simpler (non-blockwise) way to make this sort of graph?


TODO still relevant?

distributed/tests/test_scheduler.py

distributed/scheduler.py

fjetter · 2021-06-28T10:20:07Z

distributed/scheduler.py

+            valid_workers is None
+            and len(group) > self._total_nthreads * 2
+            and sum(map(len, group._dependencies)) < 5


I'm wondering if it is possible to construct a test which helps us with this cutoff condition. One of the motivators for this heuristics is that this is likely acting as a good binary classifier for graphs where most have either very small or very large numbers here. afaiu, this is an unproven assumption

I'm interested in a test for this particular boundary condition for two reasons

It would helps us to identify regressions when this boundary is accidentally moved

How good is our classifier? Might it be 40% better if this value was 10? A well written test could help in an analysis for this

I won't push hard on this if it proves too difficult or others disagree on the value. I'm just having a hard time with heuristics if I can change them without tests breaking.

For the current test I can do either of the following without the test breaking

remove len(group) > self._total_nthreads * 2 entirely

Increase the boundary for total thread count, e.g. len(group) > self._total_nthreads * 10 (it breaks eventually if pushed further)

Increase the boundary for dependencies sum(map(len, group._dependencies)) < 100 (increased further, it breaks.

In 90% of cases the number here is 0 (like da.random.random) or 1 (like da.from_zarr). I can imagine but can't actually come up with cases where this might be 2 (like da.from_zarr(zarr_array, parameter=some_dask_thing))

I get the aversion to magic numbers. This one feels pretty safe to me though.

distributed/scheduler.py

gjoseph92 · 2021-06-28T18:04:57Z

distributed/scheduler.py

@@ -2336,6 +2378,7 @@ def decide_worker(self, ts: TaskState) -> WorkerState:
                partial(self.worker_objective, ts),
            )
        else:
+            # Fastpath when there are no related tasks or restrictions


I'm realizing that this codepath will now only be rarely triggered (when there are 0 deps, but also the TaskGroup is small). Do we need to add this round-robining into our selection of a new worker for root-ish tasks? (Since we know we'll be running the tasks on every worker, I'm not sure it matters much that we may always start with the same one in an idle cluster.)

Possibly. More broadly this is probably a good reminder that while we run this on some larger example computations we should also remember to try looking at some profiles of the scheduler to see if/how things have changed.

@gjoseph92 did you run into issues with this yet? I'm curious, have you tried using many workers? (for some sensible definition of many)

@gjoseph92 did this come up in profiling? this seems like the only pending comment. I'd like to get this in if possible

It did not come up in profiling, and I haven't run into any issues with it. I feel pretty confident that round-robining is irrelevant when we're running TaskGroups larger than the cluster. I mostly brought it up because this branch is now pretty long and complicated for a codepath that we'll almost never go down. But maybe that's okay.

have you tried using many workers?

I haven't tried pangeo-style workloads with >30 workers, but I have tried my standard shuffle-profile with this which prompted 91aee92 which I need to look into a little more.

My hope is that we can get this in today or tomorrow. Is that hope achievable? If not, do you have a sense for what a reasonably deadline would be?

I shall prepare the happy dance.

While looking into dask#5083 I happened to notice that the dashboard felt very sluggish. I profiled with py-spy and discovered that the scheduler was spending 20% of runtime calculaing `sum(map(len, group._dependencies)) < 5`! A quick print statement showed some task groups depended on 25,728 other groups (each of size 1). We can easily skip those. I originally had this conditional in dask#4967 but we removed it for simplicity: dask#4967 (comment); turns out it was relevant after all!

While looking into #5083 I happened to notice that the dashboard felt very sluggish. I profiled with py-spy and discovered that the scheduler was spending 20% of runtime calculaing `sum(map(len, group._dependencies)) < 5`! A quick print statement showed some task groups depended on 25,728 other groups (each of size 1). We can easily skip those. I originally had this conditional in #4967 but we removed it for simplicity: #4967 (comment); turns out it was relevant after all!

If a dependency is already on every worker—or will end up on every worker regardless, because many things depend on it—we should ignore it when selecting our candidate workers. Otherwise, we'll end up considering every worker as a candidate, which is 1) slow and 2) often leads to poor choices (xref dask#5253, dask#5324). Just like with root-ish tasks, this is particularly important at the beginning. Say we have a bunch of tasks `x, 0`..`x, 10` that each depend on `root, 0`..`root, 10` respectively, but every `x` also depends on one task called `everywhere`. If `x, 0` is ready first, but `root, 0` and `everywhere` live on different workers, it appears as though we have a legitimate choice to make: do we schedule near `root, 0`, or near `everywhere`? But if we choose to go closer to `everywhere`, we might have a short-term gain, but we've taken a spot that could have gone to better use in the near future. Say that `everywhere` worker is about to complete `root, 6`. Now `x, 6` may run on yet another worker (because `x, 0` is already running where it should have gone). This can cascade through all the `x`s, until we've transferred most `root` tasks to different workers (on top of `everywhere`, which we have to transfer everywhere no matter what). The principle of this is the same as dask#4967: be more forward-looking in worker assignment and accept a little short-term slowness to ensure that downstream tasks have to transfer less data. This PR is a binary choice, but I think we could actually generalize to some weight in `worker_objective` like: the more dependents or replicas a task has, the less weight we should give to the workers that hold it. I wonder if, barring significant data transfer inbalance, having stronger affinity for the more "rare" keys will tend to lead to better placement.

Co-assign root-ish tasks

41b9f66

gjoseph92 mentioned this pull request Jun 24, 2021

WIP co-assign related root-ish tasks #4899

Closed

3 tasks

Fix cython

cd5d55d

It doesn't like `sum(wws._nthreads for wws in valid_workers)` since that requires defining a closure.

mrocklin reviewed Jun 24, 2021

View reviewed changes

distributed/tests/test_scheduler.py Show resolved Hide resolved

distributed/scheduler.py Show resolved Hide resolved

mrocklin reviewed Jun 24, 2021

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

mrocklin reviewed Jun 24, 2021

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

gjoseph92 added 2 commits June 24, 2021 11:08

No root-ish when valid_workers; simplify

f4f1320

Simplify no-workers case

f1b3836

mrocklin reviewed Jun 24, 2021

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

gjoseph92 added 4 commits June 24, 2021 12:35

Update work-stealing test

fbcb07c

Check worker exists; dont try to clear last_worker

51dfbc4

Clear last_worker on final task

4298657

Simplify? Is this easier or harder to read?

941a836

mrocklin reviewed Jun 24, 2021

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

Back to counting

103006a

mrocklin reviewed Jun 24, 2021

View reviewed changes

distributed/scheduler.py Show resolved Hide resolved

Debug test_lifetime failing in CI

7aa691a

gjoseph92 added 3 commits June 24, 2021 17:10

Apparantly sometimes this can be worse in CI

7041b11

Revert "Debug test_lifetime failing in CI"

dfd327b

This reverts commit 7aa691a.

How about this for test_lifetime?

93a9ede

fjetter reviewed Jun 28, 2021

View reviewed changes

Rename decide_worker for clarity

7ab7999

gjoseph92 commented Jun 28, 2021

View reviewed changes

gjoseph92 added 2 commits June 28, 2021 15:13

Docstring for decide_worker

de752ed

Attribute docstrings for TaskGroup

b0aeef4

mrocklin mentioned this pull request Jun 29, 2021

Scaling issues with large datasets ocean-transport/coiled_collaboration#3

Open

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Jul 1, 2021

Update scheduling policy docs for dask#4967

1918517

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Jul 1, 2021

Update scheduling policy docs for dask#4967

e85aeb6

gjoseph92 mentioned this pull request Jul 1, 2021

Update scheduling policy docs for root-ish task colocation #5018

Merged

mrocklin pushed a commit that referenced this pull request Jul 2, 2021

Update scheduling policy docs for #4967 (#5018)

cbcec9c

trivialfis mentioned this pull request Jul 15, 2021

[dask] Segfault in prediction with DaskDMatrix + base_margin (Workaround). dmlc/xgboost#7111

Closed

This was referenced Jul 20, 2021

Locks and chunked storage #5083

Closed

Consider candidates that don't hold any dependencies in decide_worker #4925

Open

🛑 DNM Deserialization: zero-copy merge subframes when possible #5112

Closed

gjoseph92 mentioned this pull request Jul 23, 2021

Short-circuit root-ish check for many deps #5113

Merged

3 tasks

fjetter mentioned this pull request Aug 5, 2021

Documentation on best practices configuration for large numbers of workers #5164

Open

pentschev mentioned this pull request Aug 5, 2021

Find alternative to NumPy structured arrays dask/dask#7997

Closed

ncclementi mentioned this pull request Aug 12, 2021

array data without reference kept on workers? dask/dask#7212

Closed

woodcockr mentioned this pull request Aug 16, 2021

Behaviour change: 2021.07.0 Works -> 2021.08.0 error "Tried sending message after closing" #5211

Closed

gjoseph92 mentioned this pull request Aug 16, 2021

stackstac.stack to support one time coordinate per unique datetime? gjoseph92/stackstac#66

Open

ncclementi mentioned this pull request Sep 13, 2021

Pangeo examples summary ocean-transport/coiled_collaboration#17

Open

gjoseph92 mentioned this pull request Sep 16, 2021

Ignore widely-shared dependencies in decide_worker #5325

Open

2 tasks

gjoseph92 mentioned this pull request Mar 8, 2022

Some updates to scheduling policies docs #5911

Merged

2 tasks

gjoseph92 mentioned this pull request Apr 1, 2022

Making AMM ReduceReplicas less aggressive towards widely-shared dependencies #6056

Open

gjoseph92 mentioned this pull request May 18, 2022

Ease memory pressure by deprioritizing root tasks? #6360

Open

gjoseph92 deleted the decide_worker/co-assign-relatives-2 branch June 9, 2022 20:37

TomNicholas mentioned this pull request Jun 13, 2022

Geospatial-type workload showing two common scheduler failures at once #6571

Closed

fjetter mentioned this pull request Aug 30, 2022

Withhold root tasks [no co assignment] #6614

Merged

2 tasks

gjoseph92 mentioned this pull request Oct 31, 2022

All tasks without dependencies are root-ish #7221

Open

2 tasks

gjoseph92 mentioned this pull request Nov 16, 2022

Queuing does not prevent root task overproduction unless you have enough tasks #7273

Open

This was referenced Feb 16, 2023

Coassignment + queuing by divvying up the queue #7553

Closed

Thoughts on task co-assignment #7555

Open

gjoseph92 mentioned this pull request Feb 24, 2023

dask.order over-prioritizes root tasks in some situations dask/dask#9995

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Co-assign root-ish tasks #4967

Co-assign root-ish tasks #4967

gjoseph92 commented Jun 24, 2021

mrocklin left a comment

mrocklin commented Jun 24, 2021

mrocklin commented Jun 24, 2021

mrocklin commented Jun 25, 2021

gjoseph92 commented Jun 25, 2021

fjetter Jun 28, 2021

fjetter Jun 28, 2021

mrocklin Jun 28, 2021

gjoseph92 Jun 28, 2021

mrocklin Jun 28, 2021

mrocklin Jun 30, 2021

mrocklin Jun 30, 2021

gjoseph92 Jun 30, 2021

mrocklin Jun 30, 2021

gjoseph92 Jun 30, 2021

mrocklin Jun 30, 2021


		trivial_deps = {f"k{i}": delayed(object()) for i in range(ndeps)}

		# TODO is there a simpler (non-blockwise) way to make this sort of graph?

Co-assign root-ish tasks #4967

Co-assign root-ish tasks #4967

Conversation

gjoseph92 commented Jun 24, 2021

mrocklin left a comment

Choose a reason for hiding this comment

mrocklin commented Jun 24, 2021

mrocklin commented Jun 24, 2021

mrocklin commented Jun 25, 2021

gjoseph92 commented Jun 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment