Change `split_every=False` default for DataFrame reductions like `mean`? #9450

gjoseph92 · 2022-09-01T03:10:35Z

Most DataFrame reductions like mean, count, etc. use a hardcoded split_every=False default. This means that we'll apply the reduction to every partiton, then in a single task, combine all those results into one.

This makes sense because it keeps the graph smaller. And the intermediate results should be tiny (a single row), so transferring them is relatively cheap. It's ideal for a local scheduler.

`split_every=False` (current)

`split_every=None`

The problem is that this doesn't overlap communication and computation well when you have large numbers of partitions in a distributed cluster.

The final task can't be scheduled until every input is done. This means none of the transfers can start until every input is already done.
The final task has to transfer lots of individual keys from many different workers. Workers have limits on how many other workers they're allowed to talk to at once (and how much data they're allowed to gather at once). Trying to fetch 1k or 10k dependencies as input to a single task could be quite slow, even if each input is very small, especially when there are many workers.

Whereas with a tree reduction, you can start reducing before all the inputs are done (overlapping communication and computation), and avoid the massive all-to-one transfers.

So I'm wondering if we should change the split_every=False default for reductions like sum, mean, etc. Perhaps None (to use the default), or perhaps something higher (16? 32?) to recognize that the pieces of data being transferred are small, so we can combine more of them at once and have a smaller graph.

cc @jrbourbeau @ian-r-rose @jsignell @rjzamora

The text was updated successfully, but these errors were encountered:

jrbourbeau · 2022-09-01T18:42:01Z

Thanks for writing this up @gjoseph92, it's an interesting idea

Trying to fetch 1k or 10k dependencies as input to a single task could be quite slow, even if each input is very small, especially when there are many workers.

Do you have a sense for what level of impact this has in practice?

gjoseph92 · 2022-09-01T18:51:35Z

We should measure it! For the ~1k partition range, I think it's more of a UX issue than performance—the 2-3 second lag at the end where nothing is moving on the dashboard, and the only progress bar left is dataframe-count-agg 0/1, feels awkward but doesn't slow you down substantially.

If you had 50k partitions, my guess is it might be in the 20 second range ballpark (depending on number of workers).

rjzamora · 2022-09-02T14:49:40Z

This seems like something worth investigating. My gut tells me that the split_every default (8) is probably not right for these small-data aggregations, but that False is probably wrong at large scale. Perhaps we just want to use a number that behaves like False until the number of partitions gets "large" enough (128, 256, 512, 1024, ...?).

gjoseph92 · 2022-09-02T19:47:48Z

Wouldn't split_every=128 behave like False for <128 partitions? So maybe that's what we want.

gjoseph92 added dataframe enhancement Improve existing functionality or make things work better labels Sep 1, 2022

github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change `split_every=False` default for DataFrame reductions like `mean`? #9450

Change `split_every=False` default for DataFrame reductions like `mean`? #9450

gjoseph92 commented Sep 1, 2022

jrbourbeau commented Sep 1, 2022

gjoseph92 commented Sep 1, 2022

rjzamora commented Sep 2, 2022 •

edited

gjoseph92 commented Sep 2, 2022

Change split_every=False default for DataFrame reductions like mean? #9450

Change split_every=False default for DataFrame reductions like mean? #9450

Comments

gjoseph92 commented Sep 1, 2022

split_every=False (current)

split_every=None

jrbourbeau commented Sep 1, 2022

gjoseph92 commented Sep 1, 2022

rjzamora commented Sep 2, 2022 • edited

gjoseph92 commented Sep 2, 2022

Change `split_every=False` default for DataFrame reductions like `mean`? #9450

Change `split_every=False` default for DataFrame reductions like `mean`? #9450

`split_every=False` (current)

`split_every=None`

rjzamora commented Sep 2, 2022 •

edited