New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding chunk & type information to dask high level graphs #7309
Conversation
Copying over info from #7141 (comment)
EDIT: this is not quite accurate, the keys of
Anyway, continuing on...
|
This has been slower going than I'd like, because I keep getting failing tests locally when I run pytest on the master branch. Some of it is this #7291 but my workaround of increasing |
Update: deleting everything (github repo & conda env) then starting again did the trick. |
Are dataframe divisions analogous to dask array chunks? I think I can find the dtype by looking at what |
The tests all pass if I run pytest with only the changes made to |
Also, the errors/failures I saw earlier in |
Found the common thread! |
Here is a better minimal reproducible example for our test failures (example is from the docs here): import dask.array as da
import zarr as zr
c = (2, 2)
d = da.ones((10, 11), chunks=c)
z1 = zr.open_array('lazy.zarr', shape=d.shape, dtype=d.dtype, chunks=c)
d1 = d.store(z1, compute=False, return_stored=True)
# fails with KeyError |
The problem seems to be here: Lines 1028 to 1034 in 850472d
Commenting out line 1029 will cause the same kind of failure, when passing a |
tl:dr What is happeningMost tests are failing because we're calling Line 1029 in 850472d
So then the dictionary Separately, dask/dask/array/tests/test_array_core.py Line 3907 in 850472d
which means that when we make a new dask array, What will we do about it?We need to work out: |
@rjzamora & @crusaderky - Martin suggested you might be interested in taking a look at this. It's blocked until we've made a decision about the desired behaviour moving forward. Here's the summary:; #7309 (comment) |
There's no guarantee that name matches an actually existing layer. It normally does, but there are exceptions. And you have to be very careful going down the route of "fixing" these exceptions, because you must avoid at all costs collisions in layer names. Please don't set attributes in a Python class that are not explicitly declared in the class itself. It's bad practice and a guarantee for things to fall apart when either (1) the class declares
This makes no sense to me; how can a function that replaces a da.Array with a np.ndarray alter the da.Array.dask? |
Thanks for your comments @crusaderky
|
Here is a more minimal example showing the reason for the failure in def test_genevieve():
with dask.config.set(array_plugins=[lambda x: x.compute()]):
x = da.ones(10, chunks=5)
y = x + 1
assert isinstance(y, np.ndarray) Because the array plugin is a function calling Details:=================================== FAILURES ===================================
________________________________ test_genevieve ________________________________
def test_genevieve():
with dask.config.set(array_plugins=[lambda x: x.compute()]):
> x = da.ones(10, chunks=5)
dask/array/tests/test_array_core.py:3895:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask/array/wrap.py:78: in wrap_func_shape_as_first_arg
return Array(dsk, name, chunks, dtype=dtype, meta=kwargs.get("meta", None))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cls = <class 'dask.array.core.Array'>
dask = <dask.highlevelgraph.HighLevelGraph object at 0x7fe43c9f7fa0>
name = 'ones-5dd10755422ede42ff13cddda545a499', chunks = ((5, 5),)
dtype = dtype('float64'), meta = array(2.5e-323), shape = None
def __new__(cls, dask, name, chunks, dtype=None, meta=None, shape=None):
self = super(Array, cls).__new__(cls)
assert isinstance(dask, Mapping)
if not isinstance(dask, HighLevelGraph):
dask = HighLevelGraph.from_collections(name, dask, dependencies=())
self.dask = dask
self.name = str(name)
meta = meta_from_array(meta, dtype=dtype)
if (
isinstance(chunks, str)
or isinstance(chunks, tuple)
and chunks
and any(isinstance(c, str) for c in chunks)
):
dt = meta.dtype
else:
dt = None
self._chunks = normalize_chunks(chunks, shape, dtype=dt)
if self.chunks is None:
raise ValueError(CHUNKS_NONE_ERROR_MESSAGE)
self._meta = meta_from_array(meta, ndim=self.ndim, dtype=dtype)
for plugin in config.get("array_plugins", ()):
result = plugin(self)
if result is not None:
self = result
> if name in self.dask.layers:
E AttributeError: 'numpy.ndarray' object has no attribute 'dask'
dask/array/core.py:1159: AttributeError
The most obvious fix is to add a if hasattr(self, 'dask') and name in self.dask.layers: |
I'm happy with this, unless there's any other feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing unit tests. Also, have there been measures on real-life use cases for storing the chunks information for each intermediate step this way? It can get pretty large pretty fast.
I'd like to hear from others what they think about the design - particularly since this new info
dict is heavily overlapping with the annotations
dict (but, unlike annotations, it will get lost in transit when moving to the distributed scheduler).
self.dask.layers[name].info["type"] = type(self) | ||
self.dask.layers[name].info["divisions"] = divisions | ||
self.dask.layers[name].info["chunk_type"] = type(meta) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- could use the same refactoring as da.Array
- why nothing for Series?
- dd does not have anything equivalent to array_plugins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- could use the same refactoring as da.Array - ok, done
- why nothing for Series? - Good point, I've added
"series_dtypes": {col: meta[col].dtype for col in meta.columns}
- dd does not have anything equivalent to array_plugins - This looks like a comment and not a request. Thanks for mentioning it!
dask/highlevelgraph.py
Outdated
@@ -59,6 +59,7 @@ class Layer(collections.abc.Mapping): | |||
annotations: Optional[Mapping[str, Any]] | |||
|
|||
def __init__(self, annotations: Mapping[str, Any] = None): | |||
self.info = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- please add type annotations above
- please add documentation
- you'll lose everything on clone(), cull(), and possibly some other methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- please add type annotations above - added
- please add documentation - added
- you'll lose everything on clone(), cull(), and possibly some other methods - I think this is a comment, not a request to fix just now? If I've misunderstood please let me know.
Co-authored-by: crusaderky <crusaderky@gmail.com>
Are you suggesting putting this information into the |
@GenevieveBuckley That is a possibility but the potential performance implications of transferring the extra data to the scheduler are non-trivial. I'm suggesting that those who are heavily involved in the design of graph annotations should be involved in this design too |
In a lot of cases, I don't think users actually want this level of information. At least, not all of the time. With arrays, it's common to have uniformly sized chunks, perhaps with some funny sized ones towards the edges. Can we separate this somehow into:
I don't know if it's easy to create an answer to (1), but I think that's the first question people try and answer when troubleshooting, even if you start out by giving them all the detailed chunk sizes. |
This conversation has stalled, mostly due to future-leaning concerns about how this information that we're adding to the layers on the client side could affect downstream submission on the scheduler and the workers. However, these concerns aren't yet valid today (we don't currently care about this information on the scheduler or workers yet, and so it would be good if these concerns did not block us). Short term, I recommend that we procede without using the current annotations machinery, and instead put this information somewhere else, something like @sjperkins are you ok with deferring the broader question here and letting this work go on? If so, are you ok with putting this metadata somewhere other than annotations for the short-term? |
Apologies I did not realise this issue was blocked. I'm happy for this to proceed and defer the broader question. |
Ok I've tried out the (If any of the CI tests fail, I'll come back and address those - it's late on Friday here so I should stop working now) |
@sjperkins would you be interested in taking another look through this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understand of this PR is that it exists to provide information to dask visualization routines (dask.visualize
?). I think the approach for the Array and Dataframe collections is fine.
What about Delayed and Bag? I believe Delayed objects have a HLG with layers but Bags do not. I'd be happy with attaching annotations to Delayed objects only, perhaps just with a type
attribute. Are there any other attributes appropriate for Delayed?
Woo
Bag could use high level graphs if anyone spent the time to do it. I think that we're establishing a protocol here that can be extended in the future. |
Since this PR supports experimental mucking around with visualizations, I think it makes sense to defer any decisions for Delayed & Bag until after we've had a chance to try it and see if it's a useful thing we want to do. |
Thank you for looking over this again @sjperkins (and thank you @crusaderky for your earlier review, too) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work on this @GenevieveBuckley! I've left a few small final comments, but overall this looks ready to merge
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
… use None instead of {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @GenevieveBuckley for your work on this (and @sjperkins @crusaderky for reviewing)! I'm looking forward to seeing how we can use this new information
@freyam this PR might be relevant for your GSOC project later on. You don't have to read everything or understand all of the discussion (obviously things changed as we went, so the first parts might not make much sense. The part that's relevant for you is that we now have a new dictionary called In [1]: import dask.array as da
In [2]: arr = da.random.random((100,100), chunks=(10,10))
In [3]: arr.dask.layers
Out[3]: {'random_sample-8c39afc91c532043b96497a2f2fa9875': <dask.highlevelgraph.MaterializedLayer at 0x7fe4b4339640>}
In [4]: mylayer = arr.dask.layers['random_sample-8c39afc91c532043b96497a2f2fa987
...: 5']
In [5]: mylayer
Out[5]: <dask.highlevelgraph.MaterializedLayer at 0x7fe4b4339640>
In [6]: mylayer.collection_annotations
Out[6]:
{'type': dask.array.core.Array,
'chunk_type': numpy.ndarray,
'chunks': ((10, 10, 10, 10, 10, 10, 10, 10, 10, 10),
(10, 10, 10, 10, 10, 10, 10, 10, 10, 10)),
'dtype': None} |
Just a note: I am starting to (slowly) work through a design doc for general column-projection in Dask-Dataframe, and I am getting the sense that |
Rick, I think that it is also reasonable for you to make a DataframeLayer
class that defines a consistent set of attributes. I don't think that you
need to restrict yourself to this dictionary and can probably use normal
Python attributes. However, I do think that DataframeLayer should probably
populate this dict secondarily if someone asks for it.
…On Thu, Jul 1, 2021 at 6:22 PM Richard (Rick) Zamora < ***@***.***> wrote:
Thanks @GenevieveBuckley <https://github.com/GenevieveBuckley> for your
work on this (and @sjperkins <https://github.com/sjperkins> @crusaderky
<https://github.com/crusaderky> for reviewing)! I'm looking forward to
seeing how we can use this new information
Just a note: I am starting to (slowly) work through a design doc for
general column-projection in Dask-Dataframe, and I am getting the sense
that collection_annotations may be the right place to store Layer-wise
column-dependency properties. That is, I am thinking that
optimize_dataframe_getitem
<https://github.com/dask/dask/blob/cae3f8ffcb0891d8b04e1900c34ba4433c55945f/dask/dataframe/optimize.py#L49>
can be extended to work accross multiple layers if common Dataframe
operations are modified to (optionally) store the required input and output
columns for the generated Layers. If all Layers in a HLG include this
metadata, and the root Layer is a DataFrameIOLayer, then column projection
becomes relatively simple.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7309 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTBIR634735EFHKJYFTTVT2E5ANCNFSM4YQPLTNA>
.
|
Right - That is certainly plan "A", but I am starting to consider alternatives as I struggle to decide on a "clean" sub-classing design :) |
Maybe share your thoughts in an issue? Perhaps we can think through a good
design together.
…On Thu, Jul 1, 2021, 5:34 PM Richard (Rick) Zamora ***@***.***> wrote:
Rick, I think that it is also reasonable for you to make a DataframeLayer
class that defines a consistent set of attributes.
Right - That is certainly plan "A", but I am starting to *consider*
alternatives as I struggle to decide on a "clean" sub-classing design :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7309 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCDCMMX5QC2EBG6M4TTVUCR3ANCNFSM4YQPLTNA>
.
|
This PR is to give us a place for discussing implementation strategies for #7141 (comment)
@jrbourbeau - Matt says you'd probably like to have this discussion with me, or will tag in somebody else to handle it.
cc @madsbk, another person who might also be interested.
Steps:
chunks
&dtype
to the dask high level graph, for bothArray
andDataframe