Alternative scheduling for new tasks #2940

TomAugspurger · 2019-08-07T19:57:00Z

Rather than placing new tasks with no dependencies on the first
idle worker, we try placing them on a worker executing tasks they're
a co-dependency with. This helps to reduce memory usage of graphs like

        a-1   a-2  a-3   a-4  a-5  a-6
          \    |    /     \    |    /
              b-1             b-2

This is meant to address #2602. Will require some testing

I'm writing up a bunch of benchmarks on synthetic workloads now. Will try out on some real workloads as well.

Rather than placing new tasks with no dependencies on the first idle worker, we try placing them on a worker executing tasks they're a co-depenency with.

mrocklin · 2019-08-07T20:02:13Z

distributed/scheduler.py

+                # If time weren't an issue, we might find the worker with the
+                # most siblings. But that's expensive.
+                #
+                for sts in dts.dependencies:


There are several situations where a single task has very many dependents. In these cases I think that we'll hit N^2 scaling and bring things down.

What about cases where we don't have siblings, but cousins n'th removed

a1 | a2 b1 \ / c

Yes, my initial got a full count of where each of our co-dependencies was running. That blew up very quickly. The early break once we find a co-dependency was a first attempt to avoid that.

This approach won't help in that case (I think a-1 and b-1 are niblings 😄).

mrocklin · 2019-08-07T20:04:16Z

distributed/scheduler.py

+
+            a-1  a-2  a-3  a-4
+             \   /     \   /
+              b-1       b-2


Note that all ascii art diagrams in the codebase so far have computation going from bottom to top. This is also the way that visualize works.

mrocklin · 2019-08-07T20:16:48Z

A long while ago we used to schedule things differently if they call came in in the same batch. We wouldn't do things one by one, we would take all of the initially free tasks, sort them by their dask.order value, and then partition them among the workers in that order. This worked well because nodes that have similar ordering values are likely to be closely related.

However this works poorly if...

There are many leaves and our sorting algorithm is O(nlogn)
New tasks come in that aren't part of this
These tasks actually all have a single dependency
The workers have some pre-existing load that we want to take into account
Our graph isn't just a big collections computation, but the ordering between tasks is really important, and so we want to highly prioritize the tasks that ordering suggests we prioritize by moving them to all of the workers rather than on one.

mrocklin · 2019-08-07T20:26:19Z

As an aside, a common cause of the graphs that you're dealing with come from, I think, not doing high-level-graph fusion aggressively enough. I think that if we had data ingestion operations fused as we currently fuse blockwise that this situation would occur much less frequently. This is a less general solution, but handling it well would be an unambiguous benefit, while core scheduling always has tradeoffs.

I don't know the exact operation that you're trying to deal with, but it might be better handled by bringing operations like read_parquet, from_array, and others, under the Blockwise banner.

TomAugspurger · 2019-08-07T21:47:38Z

A long while ago we used to schedule things differently if they call came in in the same batch. We wouldn't do things one by one, we would take all of the initially free tasks, sort them by their dask.order value, and then partition them among the workers in that order.

That's interesting to hear. I briefly looked into trying to fix things earlier on since it's so hard to satisfy the "schedule co-dependencies together" goal this late in the scheduling process (at the single-task level). I didn't explore it much, since it seems to go against how things are done currently.

I think that if we had data ingestion operations fused as we currently fuse blockwise that this situation would occur much less frequently.

Does too aggressive of fusion have a negative impact when you have multiple threads per worker? e.g. with

             b-1                b-2
          /  / \  \          /  / \  \
         /  |   |  \        /  |   |  \
       a-1 a-2 a-3 a-4    a-5 a-6 a-7 a-8

we might want to ensure that a-1 through a-4 end up on the same machine, but we might not want to fuse them.

I'll look into blockwise a bit. Perhaps updating Xarray's open_mfdataset to use it would yield some improvements.

mrocklin · 2019-08-07T22:00:38Z

Does too aggressive of fusion have a negative impact when you have multiple threads per worker? e.g. with

Maybe, but it's not common with collections (which is where we'll get high level blockwise fusion), where we commonly have far more partitions than we have threads.

I'll look into blockwise a bit. Perhaps updating Xarray's open_mfdataset to use it would yield some improvements.

The challenge is that blockwise currently expects to operate on Dask collections. There isn't a clean way of using it to start up a new graph.

TomAugspurger · 2020-09-11T14:04:31Z

I'm not actively working on this at the moment. Closing to clear the backlog.

Altnerative scheduling for new tasks

f1efbe3

Rather than placing new tasks with no dependencies on the first idle worker, we try placing them on a worker executing tasks they're a co-depenency with.

mrocklin reviewed Aug 7, 2019

View reviewed changes

TomAugspurger mentioned this pull request Sep 25, 2019

an example that shows the need for memory backpressure #2602

Closed

TomAugspurger closed this Sep 11, 2020

TomAugspurger mentioned this pull request Jun 8, 2021

Co-assign neighboring tasks to neighboring workers #4892

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative scheduling for new tasks #2940

Alternative scheduling for new tasks #2940

TomAugspurger commented Aug 7, 2019 •

edited

Loading

mrocklin Aug 7, 2019

mrocklin Aug 7, 2019

TomAugspurger Aug 7, 2019

TomAugspurger Aug 7, 2019

mrocklin Aug 7, 2019

mrocklin commented Aug 7, 2019

mrocklin commented Aug 7, 2019

TomAugspurger commented Aug 7, 2019

mrocklin commented Aug 7, 2019

TomAugspurger commented Sep 11, 2020

Alternative scheduling for new tasks #2940

Alternative scheduling for new tasks #2940

Conversation

TomAugspurger commented Aug 7, 2019 • edited Loading

mrocklin Aug 7, 2019

Choose a reason for hiding this comment

mrocklin Aug 7, 2019

Choose a reason for hiding this comment

TomAugspurger Aug 7, 2019

Choose a reason for hiding this comment

TomAugspurger Aug 7, 2019

Choose a reason for hiding this comment

mrocklin Aug 7, 2019

Choose a reason for hiding this comment

mrocklin commented Aug 7, 2019

mrocklin commented Aug 7, 2019

TomAugspurger commented Aug 7, 2019

mrocklin commented Aug 7, 2019

TomAugspurger commented Sep 11, 2020

TomAugspurger commented Aug 7, 2019 •

edited

Loading