Bag repartition partition size by joshreback · Pull Request #6371 · dask/dask

joshreback · 2020-07-04T12:42:49Z

Tests added / passed
Passes black dask / flake8 dask

Hope it's alright, didn't look like anyone was going to take up that issue so I just took a shot at it to get more familiar w/the codebase.

Couple refactors (moving iter_chunks to utils, making repartition_size and repartition_npartitions share code) but generally approach is adapted from how this is implemented for dataframes. Appreciate any feedback!

TomAugspurger · 2020-07-06T13:21:58Z

dask/bag/core.py

-
-        graph = HighLevelGraph.from_collections(new_name, dsk, dependencies=[self])
-        return Bag(graph, name=new_name, npartitions=npartitions)
+        if sum([npartitions is not None, partition_size is not None]) != 1:


Suggested change

if sum([npartitions is not None, partition_size is not None]) != 1:

if npartitions is not None and partition_size is not None:

@TomAugspurger I don't think that suggestion actually catches the same things as the current sum.

IIUC, the current if is the not-XOR between npartitions is not None and partition_size is not None (letting the it be True when either both or neither are defined). While your suggestions only would catch the occasion when both are defined.

A possible shortening would be :

Suggested change

if sum([npartitions is not None, partition_size is not None]) != 1:

if (npartitions is None) == (partition_size is None):

(using casting is None to bool with the parentheses, the equal operator (==, as that is identical to not-XOR for bool) , removing the not (as that is unnecessary for XOR operations (as it exist on both sides))

I don't have enough knowledge on the codebase of dask to comment on this PR in general.

Yeah that's right - the purpose of that check is to ensure only one of npartitions and partition_size is set. Checking that both are not none allows for both to be set. @sroets way or not (bool(npartitions) ^ bool(partition_size)) is another.

TomAugspurger · 2020-07-06T13:24:03Z

dask/bag/core.py

+        return _split_partitions(bag, nsplits, new_name)
+
+
+def repartition_size(bag, size):


This seems to share a lot of code with dask.dataframe.core.repartition_size. Do you see opportunities for deduplication?

Yeah that's a good point. At first glance it looks like they could share most of the code in that method. The only differences are:

total_mem_usage

_split_partitions (the guts of which are the same but the way the return value is built differs, this could be easily handled)

the df code uses pandas Series & numpy in different places whereas bag uses functions from toolz...that could probably be handled easily as well.

So I think a way to deduplicate would be to move the repartition_size method to a shared space (is base.py the place for this?). The reason I didn't start w/that is that it seems like it might be awkward to have that one method be shared while it seems like most of the rest of the operations are not. Let me know what you think.

@TomAugspurger definitely there is duplicated code b/t bag and dataframe w.r.t. repartitioning. I just am not sure of the best way to make them share code in the existing project structure.

It feels odd to put the shared code in the DaskMethodsMixin (which is already shared b/t bag and dataframe) as that seems to house very generic methods. Putting it in utils, which is also already shared, doesn't seem quite right either since those seem like general purpose methods not tied to needing a dask object.

So I guess a third option would be to put the various partitioning methods in a separate utils file, called partition_utils.py or something like that, which only df and bag import. Does that sound like a good path forward, or am I missing some other way to share the code?

FWIW, I'm comfortable with some code duplication here. The complexity of factoring this out might be more than the complexity of duplication given that this is only going to happen twice.

mrocklin · 2020-07-13T14:49:11Z

dask/bag/core.py

+def total_mem_usage(bag):
+    return sys.getsizeof(bag)


You might want to try dask.sizeof.sizeof instead here. I don't think that sys.getsizeof has the behavior that you want.

In [1]: import sys In [2]: sys.getsizeof(list(range(1000000))) Out[2]: 9000120 In [3]: sys.getsizeof({"x": list(range(1000000))}) Out[3]: 248

In [6]: from dask.sizeof import sizeof In [7]: sizeof({"x": list(range(1000000))}) Out[7]: 37000482

dask/bag/tests/test_bag.py

jsignell · 2020-07-15T15:23:39Z

dask/bag/core.py

-
-        graph = HighLevelGraph.from_collections(new_name, dsk, dependencies=[self])
-        return Bag(graph, name=new_name, npartitions=npartitions)
+        if not bool(npartitions) ^ bool(partition_size):


0 is a valid argument for npartitions so I think this would be more accurate as:

Suggested change

if not bool(npartitions) ^ bool(partition_size):

if not (npartitions is None) ^ (partition_size is None):

ah good point.

I think this hasn't been done yet correct?

FWIW in other places where we're checking that exactly one input is given we use something like:

dask/dask/dataframe/core.py

Lines 1164 to 1178 in 2411b0a

if (

sum(

[

partition_size is not None,

divisions is not None,

npartitions is not None,

freq is not None,

]

)

!= 1

):

raise ValueError(

"Please provide exactly one of ``npartitions=``, ``freq=``, "

"``divisions=``, ``partitions_size=`` keyword arguments"

)

which I personally find more clear.

Cool - I went with that here.

jsignell · 2020-07-15T15:28:02Z

dask/bag/core.py

+    This can be used to reduce or increase the number of partitions
+    of the bag.
+    """
+    new_name = "repartition-%d-%s" % (npartitions, tokenize(bag, npartitions))


Can we move this down a few lines since it might not be needed?

whoops i thought I addressed that - i moved it down a couple lines as suggested.

joshreback · 2020-07-19T01:32:12Z

dask/sizeof.py

    )


+@sizeof.register(chain)


I'm not sure if this belongs here (since it doesn't compute the current size of a chain object, but rather how much size the object would take up after compute is run), or in the total_mem_usage method in bag/core.py.

But the point of adding this was to make sure that repartitioning a bag multiple times in a row work...on subsequent repartitions, all the partitions were an itertools.chain object, and dask.sizeof.sizeof didn't have a way to accurately gauge memory usage of that.

hmm. That is an interesting question. This seems like the right place to me. The point of the sizeof functions is to keep track of how many bytes the system would need to keep the object in memory.

We'll want to make sure that this isn't burning a consumable, and that this data will still be around for future use. Probably before we repartition we'll want to reify all iterators into lists anyway.

yes good point about not consuming the iterator. I decided to try just making a copy of the chain object to preserve the original, and using this same calculation.

joshreback · 2020-07-22T20:06:16Z

OK i fixed what was making one of the tests flaky.

Separate question - the dask.sizeof method appears to overestimate (by approx 2x) the memory usage for lists (or any sequence). It computes the size of the sequence using getsizeof (which for a list of primitive types produces the right result) and then adds to that an estimate of the size of each of the elements of the list. Is that a bug or is there a reason that it was implemented that way?

jsignell · 2020-07-24T14:05:22Z

It looks like getsizeof doesn't get the size of objects that are nested within other objects so dask is doing the right thing.

Here are the relevant docs: https://docs.python.org/3/library/sys.html#sys.getsizeof

joshreback · 2020-07-24T16:36:30Z

Ah yeah, my bad, that's right. I got mixed up because I forgot integers inherited from object in Python.

jsignell · 2020-07-24T17:30:00Z

Ok I think this is good to go! @dask/maintenance

mrocklin · 2020-07-24T21:55:15Z

dask/bag/core.py

+def total_mem_usage(bag):
+    return sizeof(bag)


It looks like this can be removed and we can just use sizeof instead.

mrocklin · 2020-07-24T21:56:49Z

dask/bag/core.py

+    # 1. split each partition that is larger than partition size
+    nsplits = [1 + mem_usage // size for mem_usage in mem_usages]
+    if any((nsplit > 1 for nsplit in nsplits)):
+        split_name = "repartition-split-{}-{}".format(size, tokenize(bag))


I recommend that we exclude the full integer size here and instead put that in the tokenization call. These names are used in things like diagnostics and the dashboard. Having values like repartition-split-10248258572 show up in a progress bar is probably not ideal. If you want to include the text version I'm ok with that, but I would probably defer to just tokenizing everything.

mrocklin · 2020-07-24T21:58:57Z

I added a couple of small comments. I haven't gone into the details of the algorithm though. I'm curious, has anyone tried this out in practice?

joshreback · 2020-07-25T14:03:09Z

The failures seem unrelated to the changes here..

joshreback · 2020-07-29T14:42:10Z

Hi - just wanted to check if there were any further comments or suggestions here?

jsignell · 2020-07-30T20:15:57Z

Sorry for the delay @joshreback I am going to test this out locally and then it should be good to go!

jsignell · 2020-07-30T20:33:49Z

Hmm. I am confused by this:

from dask.sizeof import sizeof
import dask.bag as db

b = db.from_sequence([1, 2, 3, 4, 5, 6])
size = sizeof(b)
# 48

new = b.repartition(partition_size=size)
new.npartitions
# 18

joshreback · 2020-07-31T21:12:39Z

Looks like sizeof(b) is just returning the size of the bag object, not the objects inside. Each original partition of b is a 1 element list which is 124 bytes, so it tries to split each of the original 6 partitions into 3 partitions (1 + 124 // 48).

I agree those results are pretty odd looking, but I would guess the situation where someone wants a partition size that's smaller than an indivisible partition is kind of rare (edit: actually i have no idea if that situation comes up a lot, but I'd guess that repartitioning in that way would not that useful in practice?). Not sure what the best way forward is...i could try repartitioning followed by culling partitions that would come out as empty? Let me know if that seems reasonable.

EDIT: From what I can tell, that seems to require calling compute to see which partitions would be empty, and then deleting them, which sounds kind of expensive

…ertools chain object; tweak tests

… to make them independent of exact sizeof numbers

… are all of the same size

jsignell · 2020-08-04T14:28:35Z

It sounds like the sizeof for bag is wrong then. I don't have enough context to determine what is sane behavior, but maybe @jrbourbeau does.

joshreback · 2020-08-04T20:12:56Z

I don't think there is a sizeof implementation for bag, so it just falls back to sys.getsizeof. The code now just computes the memory usage of a bag by summing the memory usage of each of its partitions (which is how repartition_size is implemented for dataframes - i basically just copied that).

jsignell · 2020-08-06T13:56:43Z

Ah! Thanks for pointing that out @joshreback. In that case I think this is good to go!

jrbourbeau

Thanks for the PR @joshreback! This looks close to done and I'm looking forward to seeing it merge : ) In particular, it was good to see the several tests you added to ensure repartitioning is acting as expected

I've left a few small comments and it looks like there are a couple of comments from @jsignell that are still TODO

jrbourbeau · 2020-08-06T16:08:05Z

dask/bag/tests/test_bag.py

+
+
+def test_repartition_partition_size_complex_dtypes():
+    import numpy as np


Since NumPy isn't a required dependency for Dask bag, we'll want to skip this test is NumPy isn't installed. Pytest has a convenient pytest.importorskip function for this. Here's an example of it in use:

dask/dask/bag/tests/test_bag.py

Lines 369 to 376 in 69a647f

@pytest.mark.parametrize("npartitions", [1, 10])

def test_non_splittable_reductions(npartitions):

np = pytest.importorskip("numpy")

data = list(range(100))

c = db.from_sequence(data, npartitions=npartitions)

assert_eq(c.mean(), np.mean(data))

assert_eq(c.std(), np.std(data))

jrbourbeau · 2020-08-06T16:12:35Z

dask/bag/core.py

 import operator
 import uuid
 import warnings
+from dask.sizeof import sizeof


Nitpick: Could you please make this a relative import and move it down a few lines to be with the rest of the Dask imports

dask/dask/bag/core.py

Lines 39 to 63 in 69a647f

from .. import config

from .avro import to_avro

from ..base import tokenize, dont_optimize, DaskMethodsMixin

from ..bytes import open_files

from ..context import globalmethod

from ..core import quote, istask, get_dependencies, reverse_dict, flatten

from ..delayed import Delayed, unpack_collections

from ..highlevelgraph import HighLevelGraph

from ..multiprocessing import get as mpget

from ..optimization import fuse, cull, inline

from ..utils import (

apply,

system_encoding,

takes_multiple_arguments,

funcname,

digit,

insert,

ensure_dict,

ensure_bytes,

ensure_unicode,

key_split,

parse_bytes,

iter_chunks,

)

from . import chunk

For reference, most modules in the codebase follow a pattern of stdlib imports, then third-party imports, then Dask imports. Not a big deal, but just wanted to point out this convention.

jrbourbeau · 2020-08-06T16:13:21Z

dask/bag/core.py

+
+        Notes
+        -----
+        Exactly one of `npartitions` or `partition_size` should be specified.


Suggested change

Exactly one of `npartitions` or `partition_size` should be specified.

Exactly one of ``npartitions`` or ``partition_size`` should be specified.

jrbourbeau · 2020-08-06T16:25:56Z

dask/bag/core.py

-
-        graph = HighLevelGraph.from_collections(new_name, dsk, dependencies=[self])
-        return Bag(graph, name=new_name, npartitions=npartitions)
+        if not bool(npartitions) ^ bool(partition_size):


FWIW in other places where we're checking that exactly one input is given we use something like:

dask/dask/dataframe/core.py

Lines 1164 to 1178 in 2411b0a

if (

sum(

[

partition_size is not None,

divisions is not None,

npartitions is not None,

freq is not None,

]

)

!= 1

):

raise ValueError(

"Please provide exactly one of ``npartitions=``, ``freq=``, "

"``divisions=``, ``partitions_size=`` keyword arguments"

)

which I personally find more clear.

jrbourbeau · 2020-08-06T16:39:54Z

dask/bag/core.py

+    if isinstance(bag, chain):
+        bag = reify(deepcopy(bag))


Could you add a small comment on why this isinstance check + deepcopy is needed. That'll help us later when we come across this : )

joshreback · 2020-08-06T23:10:05Z

@jsignell my mistake i lost track of a couple of your comments. But i think I have addressed those now + the suggestions from @jrbourbeau. Let me know if I missed anything!

jrbourbeau

Thanks for your work on this @joshreback! This is in

TomAugspurger reviewed Jul 6, 2020

View reviewed changes

mrocklin reviewed Jul 13, 2020

View reviewed changes

dask/bag/tests/test_bag.py Show resolved Hide resolved

joshreback marked this pull request as draft July 14, 2020 12:21

jsignell reviewed Jul 15, 2020

View reviewed changes

joshreback force-pushed the bag-repartition-partition_size branch 2 times, most recently from e9dd2b3 to 9635892 Compare July 17, 2020 01:42

joshreback marked this pull request as ready for review July 19, 2020 01:26

joshreback commented Jul 19, 2020

View reviewed changes

joshreback marked this pull request as draft July 22, 2020 02:00

joshreback force-pushed the bag-repartition-partition_size branch from 33fd1f6 to 2dc48fa Compare July 22, 2020 12:43

joshreback marked this pull request as ready for review July 22, 2020 20:03

mrocklin reviewed Jul 24, 2020

View reviewed changes

joshreback force-pushed the bag-repartition-partition_size branch from 6938121 to 96b4673 Compare July 28, 2020 01:16

Josh Reback added 4 commits August 4, 2020 10:02

saving progress on dask bag repartition

00e5591

most code written sans tests

511ec11

all existing tests for bag pass

92f2335

wip -- failing on assertion error at the end of repartition_size test

84e55ec

Josh Reback added 19 commits August 4, 2020 10:03

all tests pass

4c19dc7

remove spurious add

ffa1452

run black dask

8920c55

run flake8 dask

02bbb8e

more concise means of ensuring only one param is set

a01902f

use dask.sizeof instead of sys.getsizeof; add test

17e7e32

add a print statement to see why this fails on windows

23312f1

amend total_mem_usage function to properly compute memory_usage of it…

ea3340e

…ertools chain object; tweak tests

add a bunch of print statements for more debugging

67d57cf

add method for computing size of itertools chain object; retool tests…

4bf01ce

… to make them independent of exact sizeof numbers

make sizeof test actually pass

abfb94d

avoid burning through itertools chain object

6d049ed

make bag memory tests more reliable by using lists with elements that…

d41423f

… are all of the same size

address PR comments

e31c983

get size of itertools chain object

1683446

remove test.py

58817a2

amend test now that iterators are reified before computing mem usage

e7335b0

run black dask

3b67d1f

remove iterchunks test

69a647f

joshreback force-pushed the bag-repartition-partition_size branch from 8c5f84e to 69a647f Compare August 4, 2020 14:05

jrbourbeau reviewed Aug 6, 2020

View reviewed changes

Josh Reback added 2 commits August 6, 2020 19:02

address PR feedback

88b7fc1

move new_name variable down

aa1f654

jrbourbeau approved these changes Aug 7, 2020

View reviewed changes

jrbourbeau merged commit ff653b9 into dask:master Aug 7, 2020

kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020

Bag repartition partition size (dask#6371)

6e24bb6

	if sum([npartitions is not None, partition_size is not None]) != 1:
	if npartitions is not None and partition_size is not None:

	if sum([npartitions is not None, partition_size is not None]) != 1:
	if (npartitions is None) == (partition_size is None):

		return _split_partitions(bag, nsplits, new_name)


		def repartition_size(bag, size):

	if not bool(npartitions) ^ bool(partition_size):
	if not (npartitions is None) ^ (partition_size is None):

	if (
	sum(
	[
	partition_size is not None,
	divisions is not None,
	npartitions is not None,
	freq is not None,
	]
	)
	!= 1
	):
	raise ValueError(
	"Please provide exactly one of ``npartitions=``, ``freq=``, "
	"``divisions=``, ``partitions_size=`` keyword arguments"
	)



		def test_repartition_partition_size_complex_dtypes():
		import numpy as np

	@pytest.mark.parametrize("npartitions", [1, 10])
	def test_non_splittable_reductions(npartitions):
	np = pytest.importorskip("numpy")
	data = list(range(100))
	c = db.from_sequence(data, npartitions=npartitions)

	assert_eq(c.mean(), np.mean(data))
	assert_eq(c.std(), np.std(data))

	from .. import config
	from .avro import to_avro
	from ..base import tokenize, dont_optimize, DaskMethodsMixin
	from ..bytes import open_files
	from ..context import globalmethod
	from ..core import quote, istask, get_dependencies, reverse_dict, flatten
	from ..delayed import Delayed, unpack_collections
	from ..highlevelgraph import HighLevelGraph
	from ..multiprocessing import get as mpget
	from ..optimization import fuse, cull, inline
	from ..utils import (
	apply,
	system_encoding,
	takes_multiple_arguments,
	funcname,
	digit,
	insert,
	ensure_dict,
	ensure_bytes,
	ensure_unicode,
	key_split,
	parse_bytes,
	iter_chunks,
	)
	from . import chunk

	Exactly one of `npartitions` or `partition_size` should be specified.
	Exactly one of ``npartitions`` or ``partition_size`` should be specified.

		)


		@sizeof.register(chain)

Uh oh!

Conversation

joshreback commented Jul 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sroet Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joshreback Jul 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joshreback commented Jul 22, 2020

Uh oh!

jsignell commented Jul 24, 2020

Uh oh!

joshreback commented Jul 24, 2020

Uh oh!

jsignell commented Jul 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Jul 24, 2020

Uh oh!

joshreback commented Jul 25, 2020

Uh oh!

joshreback commented Jul 29, 2020

Uh oh!

jsignell commented Jul 30, 2020

Uh oh!

jsignell commented Jul 30, 2020

Uh oh!

joshreback commented Jul 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsignell commented Aug 4, 2020

Uh oh!

joshreback commented Aug 4, 2020

Uh oh!

jsignell commented Aug 6, 2020

joshreback commented Jul 4, 2020 •

edited

Loading

sroet Jul 7, 2020 •

edited

Loading

joshreback Jul 19, 2020 •

edited

Loading

joshreback commented Jul 31, 2020 •

edited

Loading