Change Dask Bag partitioning scheme#10294
Conversation
|
Marking as draft while I look at the test failures. |
| evens = db.from_sequence(range(0, hi, 2), npartitions=npartitions) | ||
| odds = db.from_sequence(range(1, hi, 2), npartitions=npartitions) | ||
| pairs = db.zip(evens, odds) | ||
| assert pairs.npartitions == npartitions |
There was a problem hiding this comment.
This assertion was only true because the length of the bag happened to be a multiple of 100. Updated it to be a little more robust.
| if len(seq) <= 100: | ||
| partition_size = int(math.ceil(len(seq) / npartitions)) | ||
| else: | ||
| partition_size = max(1, int(math.floor(len(seq) / npartitions))) |
There was a problem hiding this comment.
I tweaked this so that the new heuristic is only applied on bags with more than 100 items, similar to automatic partitioning. We could remove this and just use floor/max for everything but a number of tests would need to be changed as it would break some assumptions we have with small bags today.
|
cc @rjzamora |
There was a problem hiding this comment.
Thanks for the ping here @jacobtomlinson!
I spent some time thinking about this today, and found myself surprisingly distressed by the sawtooth behavior this PR is addressing.
My personal opinion is that the output partition count should always satisfy two criteria:
- It should be exactly equal to
npartitionswhen the argument is notNone, and is<=len(seq). - When neither
npartitionsnorpartition_sizeis specified, we should calculate a defaultnpartitionsvalue that monotonically increases with respect tolen(seq).
For example, I believe the following code (while a bit rough) will satisfy both of these criteria:
remainder = 0
seq = list(seq)
seq_len = len(seq)
if partition_size is None:
if npartitions is None:
if seq_len <= 100:
npartitions = seq_len
else:
npartitions = math.floor(100.0 * math.sqrt(seq_len/ 100.0))
npartitions = max(1, npartitions)
partition_size = int(math.floor(seq_len / npartitions))
remainder = seq_len % npartitions
if remainder:
start_size = partition_size + 1
start_count = start_size * remainder
parts = list(partition_all(start_size, seq[:start_count])) + list(partition_all(partition_size, seq[start_count:]))
else:
parts = list(partition_all(partition_size, seq))|
Thanks @rjzamora I did look into trying to achieve that but it does mean that partitions will be different sizes. I don't know if that is a problem or not. With your proposed code problem 1 becomes: And problem 2 becomes: However, if you specify something like In [1]: import dask.bag as db
...: from collections import Counter
...: Counter(db.from_sequence(range(300), npartitions=200).map_partitions(len).compute())
Out[1]: Counter({2: 100, 1: 100})I feel like intuitively from |
If the user specifies this, sure.
I think this is not a hard requirement just the ordinary case. I think you can reshape/rechunk an array to have all sorts of weird shapes.
I'm a bit confused why your plot ends at 200. Shouldn't this also grow without bounds? I agree with @rjzamora that |
👍
That plot is the case where the user specifies 200 partitions. So if there are only 10 items in the bag it will have 10 partitions, nothing we can do about that. But when they get above 200 items then it plateaus there because that's what the user specified.
I agree I think we should merge this as is, then follow up with another PR with @rjzamora's suggestion. |
|
Yes. Agree that this PR is an improvement “as is”, and that my suggestions can be considered in a follow-up. |


pre-commit run --all-filesin #10291 we discuss two challenges we are seeing with scaling Dask Bag to large numbers of workers (200+). This PR addresses both of them.
Problem 1
The default partitioning scheme of
dask.bag.from_sequencelimits the number of potential tasks at 199 (if the number of items in the bag is 199) and trends towards 100.As discussed with @fjetter it would be nice if this default scheme continued to increase the number of tasks, just at a non-exponential rate.
This change uses
math.sqrtto grow the number of tasks a the rate of a square root.This allows any number of workers to be saturated with tasks given enough items in the bag.
Problem 2
Setting the number of partitions equal to the number of workers feels like it would be a good way to explicitly ensure Dask Bag scales to utilize all workers. However, due to the rounding done in the current scheme, this doesn't actually saturate all of the workers unless
nitems % workers == 0.This change switches the
partition_sizecalculation to use floor instead of ceil to ensure that the number of tasks overshoots the desired partition size and trends towards the desired size, rather than undershooting by up to 50%. It feels better to have too many tasks rather than not enough tasks.