Bag repartition partition size#6371
Conversation
dask/bag/core.py
Outdated
|
|
||
| graph = HighLevelGraph.from_collections(new_name, dsk, dependencies=[self]) | ||
| return Bag(graph, name=new_name, npartitions=npartitions) | ||
| if sum([npartitions is not None, partition_size is not None]) != 1: |
There was a problem hiding this comment.
| if sum([npartitions is not None, partition_size is not None]) != 1: | |
| if npartitions is not None and partition_size is not None: |
There was a problem hiding this comment.
@TomAugspurger I don't think that suggestion actually catches the same things as the current sum.
IIUC, the current if is the not-XOR between npartitions is not None and partition_size is not None (letting the it be True when either both or neither are defined). While your suggestions only would catch the occasion when both are defined.
A possible shortening would be :
| if sum([npartitions is not None, partition_size is not None]) != 1: | |
| if (npartitions is None) == (partition_size is None): |
(using casting is None to bool with the parentheses, the equal operator (==, as that is identical to not-XOR for bool) , removing the not (as that is unnecessary for XOR operations (as it exist on both sides))
I don't have enough knowledge on the codebase of dask to comment on this PR in general.
There was a problem hiding this comment.
Yeah that's right - the purpose of that check is to ensure only one of npartitions and partition_size is set. Checking that both are not none allows for both to be set. @sroets way or not (bool(npartitions) ^ bool(partition_size)) is another.
| return _split_partitions(bag, nsplits, new_name) | ||
|
|
||
|
|
||
| def repartition_size(bag, size): |
There was a problem hiding this comment.
This seems to share a lot of code with dask.dataframe.core.repartition_size. Do you see opportunities for deduplication?
There was a problem hiding this comment.
Yeah that's a good point. At first glance it looks like they could share most of the code in that method. The only differences are:
- total_mem_usage
- _split_partitions (the guts of which are the same but the way the return value is built differs, this could be easily handled)
- the df code uses pandas Series & numpy in different places whereas bag uses functions from toolz...that could probably be handled easily as well.
So I think a way to deduplicate would be to move the repartition_size method to a shared space (is base.py the place for this?). The reason I didn't start w/that is that it seems like it might be awkward to have that one method be shared while it seems like most of the rest of the operations are not. Let me know what you think.
There was a problem hiding this comment.
@TomAugspurger definitely there is duplicated code b/t bag and dataframe w.r.t. repartitioning. I just am not sure of the best way to make them share code in the existing project structure.
It feels odd to put the shared code in the DaskMethodsMixin (which is already shared b/t bag and dataframe) as that seems to house very generic methods. Putting it in utils, which is also already shared, doesn't seem quite right either since those seem like general purpose methods not tied to needing a dask object.
So I guess a third option would be to put the various partitioning methods in a separate utils file, called partition_utils.py or something like that, which only df and bag import. Does that sound like a good path forward, or am I missing some other way to share the code?
There was a problem hiding this comment.
FWIW, I'm comfortable with some code duplication here. The complexity of factoring this out might be more than the complexity of duplication given that this is only going to happen twice.
dask/bag/core.py
Outdated
| def total_mem_usage(bag): | ||
| return sys.getsizeof(bag) |
There was a problem hiding this comment.
You might want to try dask.sizeof.sizeof instead here. I don't think that sys.getsizeof has the behavior that you want.
In [1]: import sys
In [2]: sys.getsizeof(list(range(1000000)))
Out[2]: 9000120
In [3]: sys.getsizeof({"x": list(range(1000000))})
Out[3]: 248In [6]: from dask.sizeof import sizeof
In [7]: sizeof({"x": list(range(1000000))})
Out[7]: 37000482
dask/bag/core.py
Outdated
|
|
||
| graph = HighLevelGraph.from_collections(new_name, dsk, dependencies=[self]) | ||
| return Bag(graph, name=new_name, npartitions=npartitions) | ||
| if not bool(npartitions) ^ bool(partition_size): |
There was a problem hiding this comment.
0 is a valid argument for npartitions so I think this would be more accurate as:
| if not bool(npartitions) ^ bool(partition_size): | |
| if not (npartitions is None) ^ (partition_size is None): |
There was a problem hiding this comment.
I think this hasn't been done yet correct?
There was a problem hiding this comment.
FWIW in other places where we're checking that exactly one input is given we use something like:
Lines 1164 to 1178 in 2411b0a
which I personally find more clear.
There was a problem hiding this comment.
Cool - I went with that here.
dask/bag/core.py
Outdated
| This can be used to reduce or increase the number of partitions | ||
| of the bag. | ||
| """ | ||
| new_name = "repartition-%d-%s" % (npartitions, tokenize(bag, npartitions)) |
There was a problem hiding this comment.
Can we move this down a few lines since it might not be needed?
There was a problem hiding this comment.
whoops i thought I addressed that - i moved it down a couple lines as suggested.
e9dd2b3 to
9635892
Compare
dask/sizeof.py
Outdated
| ) | ||
|
|
||
|
|
||
| @sizeof.register(chain) |
There was a problem hiding this comment.
I'm not sure if this belongs here (since it doesn't compute the current size of a chain object, but rather how much size the object would take up after compute is run), or in the total_mem_usage method in bag/core.py.
But the point of adding this was to make sure that repartitioning a bag multiple times in a row work...on subsequent repartitions, all the partitions were an itertools.chain object, and dask.sizeof.sizeof didn't have a way to accurately gauge memory usage of that.
There was a problem hiding this comment.
hmm. That is an interesting question. This seems like the right place to me. The point of the sizeof functions is to keep track of how many bytes the system would need to keep the object in memory.
There was a problem hiding this comment.
We'll want to make sure that this isn't burning a consumable, and that this data will still be around for future use. Probably before we repartition we'll want to reify all iterators into lists anyway.
There was a problem hiding this comment.
yes good point about not consuming the iterator. I decided to try just making a copy of the chain object to preserve the original, and using this same calculation.
33fd1f6 to
2dc48fa
Compare
|
OK i fixed what was making one of the tests flaky. Separate question - the |
|
It looks like Here are the relevant docs: https://docs.python.org/3/library/sys.html#sys.getsizeof |
|
Ah yeah, my bad, that's right. I got mixed up because I forgot integers inherited from |
|
Ok I think this is good to go! @dask/maintenance |
dask/bag/core.py
Outdated
| def total_mem_usage(bag): | ||
| return sizeof(bag) |
There was a problem hiding this comment.
It looks like this can be removed and we can just use sizeof instead.
dask/bag/core.py
Outdated
| # 1. split each partition that is larger than partition size | ||
| nsplits = [1 + mem_usage // size for mem_usage in mem_usages] | ||
| if any((nsplit > 1 for nsplit in nsplits)): | ||
| split_name = "repartition-split-{}-{}".format(size, tokenize(bag)) |
There was a problem hiding this comment.
I recommend that we exclude the full integer size here and instead put that in the tokenization call. These names are used in things like diagnostics and the dashboard. Having values like repartition-split-10248258572 show up in a progress bar is probably not ideal. If you want to include the text version I'm ok with that, but I would probably defer to just tokenizing everything.
|
I added a couple of small comments. I haven't gone into the details of the algorithm though. I'm curious, has anyone tried this out in practice? |
|
The failures seem unrelated to the changes here.. |
6938121 to
96b4673
Compare
|
Hi - just wanted to check if there were any further comments or suggestions here? |
|
Sorry for the delay @joshreback I am going to test this out locally and then it should be good to go! |
|
Hmm. I am confused by this: from dask.sizeof import sizeof
import dask.bag as db
b = db.from_sequence([1, 2, 3, 4, 5, 6])
size = sizeof(b)
# 48
new = b.repartition(partition_size=size)
new.npartitions
# 18 |
|
Looks like I agree those results are pretty odd looking, but I would guess the situation where someone wants a partition size that's smaller than an indivisible partition is kind of rare (edit: actually i have no idea if that situation comes up a lot, but I'd guess that repartitioning in that way would not that useful in practice?). Not sure what the best way forward is...i could try repartitioning followed by culling partitions that would come out as empty? Let me know if that seems reasonable. EDIT: From what I can tell, that seems to require calling |
…ertools chain object; tweak tests
… to make them independent of exact sizeof numbers
… are all of the same size
8c5f84e to
69a647f
Compare
|
It sounds like the |
|
I don't think there is a |
|
Ah! Thanks for pointing that out @joshreback. In that case I think this is good to go! |
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks for the PR @joshreback! This looks close to done and I'm looking forward to seeing it merge : ) In particular, it was good to see the several tests you added to ensure repartitioning is acting as expected
I've left a few small comments and it looks like there are a couple of comments from @jsignell that are still TODO
dask/bag/tests/test_bag.py
Outdated
|
|
||
|
|
||
| def test_repartition_partition_size_complex_dtypes(): | ||
| import numpy as np |
There was a problem hiding this comment.
Since NumPy isn't a required dependency for Dask bag, we'll want to skip this test is NumPy isn't installed. Pytest has a convenient pytest.importorskip function for this. Here's an example of it in use:
dask/dask/bag/tests/test_bag.py
Lines 369 to 376 in 69a647f
dask/bag/core.py
Outdated
| import operator | ||
| import uuid | ||
| import warnings | ||
| from dask.sizeof import sizeof |
There was a problem hiding this comment.
Nitpick: Could you please make this a relative import and move it down a few lines to be with the rest of the Dask imports
Lines 39 to 63 in 69a647f
For reference, most modules in the codebase follow a pattern of stdlib imports, then third-party imports, then Dask imports. Not a big deal, but just wanted to point out this convention.
dask/bag/core.py
Outdated
|
|
||
| Notes | ||
| ----- | ||
| Exactly one of `npartitions` or `partition_size` should be specified. |
There was a problem hiding this comment.
| Exactly one of `npartitions` or `partition_size` should be specified. | |
| Exactly one of ``npartitions`` or ``partition_size`` should be specified. |
dask/bag/core.py
Outdated
|
|
||
| graph = HighLevelGraph.from_collections(new_name, dsk, dependencies=[self]) | ||
| return Bag(graph, name=new_name, npartitions=npartitions) | ||
| if not bool(npartitions) ^ bool(partition_size): |
There was a problem hiding this comment.
FWIW in other places where we're checking that exactly one input is given we use something like:
Lines 1164 to 1178 in 2411b0a
which I personally find more clear.
dask/bag/core.py
Outdated
| if isinstance(bag, chain): | ||
| bag = reify(deepcopy(bag)) |
There was a problem hiding this comment.
Could you add a small comment on why this isinstance check + deepcopy is needed. That'll help us later when we come across this : )
|
@jsignell my mistake i lost track of a couple of your comments. But i think I have addressed those now + the suggestions from @jrbourbeau. Let me know if I missed anything! |
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks for your work on this @joshreback! This is in
black dask/flake8 daskCloses #6197
Hope it's alright, didn't look like anyone was going to take up that issue so I just took a shot at it to get more familiar w/the codebase.
Couple refactors (moving
iter_chunksto utils, makingrepartition_sizeandrepartition_npartitionsshare code) but generally approach is adapted from how this is implemented for dataframes. Appreciate any feedback!