Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change split_every=False default for DataFrame reductions like mean? #9450

Open
gjoseph92 opened this issue Sep 1, 2022 · 4 comments
Open
Labels
dataframe enhancement Improve existing functionality or make things work better needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer.

Comments

@gjoseph92
Copy link
Collaborator

Most DataFrame reductions like mean, count, etc. use a hardcoded split_every=False default. This means that we'll apply the reduction to every partiton, then in a single task, combine all those results into one.

This makes sense because it keeps the graph smaller. And the intermediate results should be tiny (a single row), so transferring them is relatively cheap. It's ideal for a local scheduler.

split_every=False (current)

mydask

split_every=None

mydask

The problem is that this doesn't overlap communication and computation well when you have large numbers of partitions in a distributed cluster.

  1. The final task can't be scheduled until every input is done. This means none of the transfers can start until every input is already done.
  2. The final task has to transfer lots of individual keys from many different workers. Workers have limits on how many other workers they're allowed to talk to at once (and how much data they're allowed to gather at once). Trying to fetch 1k or 10k dependencies as input to a single task could be quite slow, even if each input is very small, especially when there are many workers.

Whereas with a tree reduction, you can start reducing before all the inputs are done (overlapping communication and computation), and avoid the massive all-to-one transfers.

So I'm wondering if we should change the split_every=False default for reductions like sum, mean, etc. Perhaps None (to use the default), or perhaps something higher (16? 32?) to recognize that the pieces of data being transferred are small, so we can combine more of them at once and have a smaller graph.

cc @jrbourbeau @ian-r-rose @jsignell @rjzamora

@gjoseph92 gjoseph92 added dataframe enhancement Improve existing functionality or make things work better labels Sep 1, 2022
@jrbourbeau
Copy link
Member

Thanks for writing this up @gjoseph92, it's an interesting idea

Trying to fetch 1k or 10k dependencies as input to a single task could be quite slow, even if each input is very small, especially when there are many workers.

Do you have a sense for what level of impact this has in practice?

@gjoseph92
Copy link
Collaborator Author

We should measure it! For the ~1k partition range, I think it's more of a UX issue than performance—the 2-3 second lag at the end where nothing is moving on the dashboard, and the only progress bar left is dataframe-count-agg 0/1, feels awkward but doesn't slow you down substantially.

If you had 50k partitions, my guess is it might be in the 20 second range ballpark (depending on number of workers).

@rjzamora
Copy link
Member

rjzamora commented Sep 2, 2022

This seems like something worth investigating. My gut tells me that the split_every default (8) is probably not right for these small-data aggregations, but that False is probably wrong at large scale. Perhaps we just want to use a number that behaves like False until the number of partitions gets "large" enough (128, 256, 512, 1024, ...?).

@gjoseph92
Copy link
Collaborator Author

Wouldn't split_every=128 behave like False for <128 partitions? So maybe that's what we want.

@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe enhancement Improve existing functionality or make things work better needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer.
Projects
None yet
Development

No branches or pull requests

3 participants