Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imbalanced scheduling of non-root tasks with resources #8177

Open
crusaderky opened this issue Sep 11, 2023 · 4 comments · May be fixed by #8283
Open

Imbalanced scheduling of non-root tasks with resources #8177

crusaderky opened this issue Sep 11, 2023 · 4 comments · May be fixed by #8283
Assignees

Comments

@crusaderky
Copy link
Collaborator

crusaderky commented Sep 11, 2023

From https://dask.discourse.group/t/only-1-worker-is-running-when-the-dag-is-forking/2192

Non-root tasks that declare resources do not evenly distribute on the cluster, instead piling up on a single worker.

import time
import dask
from distributed import Client


@dask.delayed
def f():
    return 1

@dask.delayed
def g(x, y):
    time.sleep(2)
    return x + y



ops = []
root = f()
#root = 1

for i in range(4):
    with dask.annotate(resources={"cores": 100}):
        nonroot = g(root, i)
        ops.append(nonroot)

with Client(n_workers=2, threads_per_worker=4, resources={"cores": 128}):
    t0 = time.time()
    dask.compute(*ops)
    t1 = time.time()

print("compute time:", t1 - t0)

Expected: 4s
Actual: 8s

distributed.scheduler.worker-saturation does not seem to make a difference. Having less or more than 5 tasks (the threshold for is_rootish) doesn't seem to have an impact (as long as you have less tasks than threads).
Uncommenting root = 1, thus making the tasks with resources actually root (not just rootish) makes the issue disappear.

After increasing the number of tasks from 6 to 100, this is what I see on the dashboard:
image

@fjetter

This comment was marked as off-topic.

@fjetter
Copy link
Member

fjetter commented Sep 11, 2023

distributed.scheduler.worker-saturation does not seem to make a difference.

tasks with resource restrictions or other restrictions are always considered non-rootish

@fjetter
Copy link
Member

fjetter commented Sep 11, 2023

I was triaging this issue before and could track it down to this line

candidates = {wws for dts in ts.dependencies for wws in dts.who_has}

essentially once a dependency is in memory, dependents are forced to be scheduled to that worker. Only work stealing allows us to redistribute tasks again. I have a WIP branch open where I change this behavior but the impact on performance is highly nontrivial (based on the A/B tests) and I dropped further investigation due to lack of time

@PierrickPochelu
Copy link

Thanks for locating this issue in the code. It is understandable that work stealing may have a negative impact on performance, while in some other situations, it is desirable. I would suggest having multiple scheduler strategies to suit the user's needs.

@fjetter fjetter self-assigned this Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants