Change split_every=False
default for DataFrame reductions like mean
?
#9450
Labels
dataframe
enhancement
Improve existing functionality or make things work better
needs attention
It's been a while since this was pushed on. Needs attention from the owner or a maintainer.
Most DataFrame reductions like
mean
,count
, etc. use a hardcodedsplit_every=False
default. This means that we'll apply the reduction to every partiton, then in a single task, combine all those results into one.This makes sense because it keeps the graph smaller. And the intermediate results should be tiny (a single row), so transferring them is relatively cheap. It's ideal for a local scheduler.
split_every=False
(current)split_every=None
The problem is that this doesn't overlap communication and computation well when you have large numbers of partitions in a distributed cluster.
Whereas with a tree reduction, you can start reducing before all the inputs are done (overlapping communication and computation), and avoid the massive all-to-one transfers.
So I'm wondering if we should change the
split_every=False
default for reductions likesum
,mean
, etc. PerhapsNone
(to use the default), or perhaps something higher (16? 32?) to recognize that the pieces of data being transferred are small, so we can combine more of them at once and have a smaller graph.cc @jrbourbeau @ian-r-rose @jsignell @rjzamora
The text was updated successfully, but these errors were encountered: