Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing chunk array types in dask.array (e.g., sparsed and masked) #2977

Closed
shoyer opened this issue Dec 8, 2017 · 12 comments

Comments

Projects
None yet
4 participants
@shoyer
Copy link
Member

commented Dec 8, 2017

As noted in pydata/xarray#1769, dask currently does not expose any way to determine the computed type of a dask array containing MaskedArray or sparse blocks.

This is untenable for building complex codebases using dask: we need to know the type of subarrays, so errors can be raised at graph building instead of compute time. For masked arrays, this is merely inconvenient, but for mixing up dense/sparse arrays this is a very serious concern, because it could entail loading very large arrays into memory.

I would suggest a simple hierarchy:

  • BaseArray: abstract base class for all dask array types
  • Array(BaseArray): base numpy.ndarray elements
  • MaskedArray(BaseArray): numpy.ma.MaskedArray elements
  • SparseArray(BaseArray): sparse array elements.
@mrocklin

This comment has been minimized.

Copy link
Member

commented Dec 8, 2017

Another option would be to keep a container class on the array or an example element. These have the advantage that we can avoid having to reproduce all of the logic for when various types turn into other types, which I suspect would be a significant effort to apply to the entire codebase.

Also, another issue, dask.array supports mixing container types.

@shoyer

This comment has been minimized.

Copy link
Member Author

commented Dec 8, 2017

Another option would be to keep a container class on the array or an example element.

Sure, I would be happy with an array_type/chunk_type attribute, analogous to dtype on numpy arrays.

I'd like something that could (at least in principle) be typed checked by a system like mypy, for which both subclasses and array-types satisfy. Otherwise, it's quite likely that somebody is going to try to compute nansum() on a masked array (e.g., in xarray), with completely unknown results.

One downside of using array_type is that np.ma.MaskedArray has the design mistake of being a np.ndarray subclass. This makes it slightly awkward to distinguish masked/non-masked arrays (e.g., you need to use type() instead of isinstance()), although we do already deal with this for NumPy arrays.

These have the advantage that we can avoid having to reproduce all of the logic for when various types turn into other types, which I suspect would be a significant effort to apply to the entire codebase.

Yes, it's a lot of work to add logic for extra types. I don't see any way to avoid that if you want to make a system that works reliably. A good starting point is to assume that existing dask operations only work on arrays containing NumPy arrays, and add support for masked/sparse on a by-exception basis.

Also, another issue, dask.array supports mixing container types.

We could add MixedArray(BaseArray) to my list, or simply use BaseArray for mixed use cases.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Dec 8, 2017

To me this doesn't seem like a case where types are the right move. Types are hierarchical while this is the container type is only one aspect of many of the array. It feels like a mismatch.

Yes, it's a lot of work to add logic for extra types. I don't see any way to avoid that if you want to make a system that works reliably.

I think that there are many ways to ensure reliability other than types.

@shoyer

This comment has been minimized.

Copy link
Member Author

commented Dec 8, 2017

Types are hierarchical while this is the container type is only one aspect of many of the array.

I don't understand. Are you also objecting to the array_type property?

I'm proposing using types (in dask array) to model types (np.ndarray/MaskedArray/SparseArray). It feels pretty natural to me.

I think that there are many ways to ensure reliability other than types.

I think types are one of the very few ways we have to ensure reliable behavior in complex/distributed systems, short of extensive test coverage.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Dec 8, 2017

I'm not objecting to the array_type property. I'm objecting to subclassing da.Array. I see now though that that isn't what you were suggesting.

MixedArray(BaseArray)

I think that it would be more natural to have tuples of types. Also, I wouldn't build new types like SparseArray, I would just use existing container types like np.ndarray.

I think types are one of the very few ways we have to ensure reliable behavior in complex/distributed systems, short of extensive test coverage.

Lets note that we are now at the point of philosophical discussion where there is no correct or incorrect answer. My opinion is that types are quite valuable, but that they (like tests) don't ensure reliable behavior. They are both valuable, as are a number of other mechanisms.

@jakirkham jakirkham added the array label Dec 8, 2017

@shoyer shoyer changed the title Use subclasses to indicate dask array sub-types (e.g., sparsed and masked) Exposing chunk array types in dask.array (e.g., sparsed and masked) Dec 8, 2017

@shoyer shoyer referenced this issue Dec 18, 2017

Merged

xarray to and from Iris #1750

4 of 4 tasks complete
@mrocklin

This comment has been minimized.

Copy link
Member

commented Oct 9, 2018

Copying over a comment from duplicate #4070

Sometimes we want to ask questions about the kind of constituent arrays we have within our dask arrays. For example we might want to know if it is sparse or on the GPU.

Currently we are unable to answer these questions, except by doing a small computation.

To resolve this we might consider storing a small empty array rather than the dtype. This is similar to what we do with the _meta attribute to Dask Dataframes.

This would help us to understand more about our arrays, but would also introduce some potential challenges when computing metadata of subsequent arrays.

@shoyer suggested

though it might be enough to store the type of the arrays in each chunk.

I think that for many of the questions we want to ask, storing the type suffices. However I suspect that it is harder to propagate type information along. I suspect that it is easier to create a closed system if you have a full object.

Your use of "type of the arrays in each chunk" reminds me that Dask arrays can be mixed. Maybe this means that _meta would have to be a tuple of objects, one for each type present?

@shoyer

This comment has been minimized.

Copy link
Member Author

commented Oct 9, 2018

I'm happy going either way on types vs objects, whichever is easier.

Unfortunately, I suspect making the closed system with actual objects will be a little tricky, due to edge cases with arrays of size 0 or size 1 (which NumPy sometimes treats like scalars). But it's probably worth a try.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Oct 9, 2018

I think that for the moderate future we would still need to accept dtype inputs from users in functions like map_blocks

x.map_blocks(func, dtype=float)

In these cases we probably assume Numpy, and allow ourselves to be wrong (as we currently do with dtypes)

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

Thinkinga about this again I think that we should add a ._meta attribute to Dask array, mimicing the current ._meta attribute on dask.dataframe (which does mostly what is proposed above). We would then replace the .dtype attribute with a computed property

@property
def dtype(self):
    return self._meta.dtype

The Array constructor will have to be changed to accept a meta= keyword. We'll probably also have to continue accepting a dtype= keyword for backwards compatibility, in this case I recommend that we construct a NumPy array as the dtype

if meta is None and dtype is not None:
    meta = np.empty(shape=(0,) * len(shape), dtype=dtype)

Then we'll need to go through all of the functions and change the dtype-management code to meta-management code. This is where most of the work will be I suspect. Hopefully it's just a lot of straightforward work, but I wouldn't be surprised if there are a couple of tricky cases (I'm more than happy to help out with these).

The plan I've proposed above doesn't support the mixed-array case (where we have a mix of dense and sparse arrays in the same dask array for example), but I suggest that we start there. I think that it will be a good and productive first step.

@pentschev pentschev referenced this issue Mar 5, 2019

Merged

Add Dask Array._meta attribute #4543

17 of 34 tasks complete
@jakirkham

This comment has been minimized.

Copy link
Member

commented Mar 21, 2019

Am writing this in part to record some concerns that @mrocklin and I discussed with this approach offline. Please fill in or correct anything that I have forgotten or misremembered, Matt.

One concern with _meta generally is it seems very magical. The reason this is worrisome is magic tends to result in code that requires more work to maintain. This can cause confusion over how it works (by users and developers), result in tricky bugs, and create challenges for building on this code later.

Of course if _meta is broadly useful for various downstream users, this approach becomes more reasonable. Not because those concerns are eliminated, but because there is more buy-in from others that would be affected by these issues. Also it means that there will likely be more people willing to help address these issues in the future. Though it is worth confirming that this is in fact the case (before biting off more than we can chew).

Another concern is that using a trivial array to try out in functions can run into some thorny issues. One example that I mentioned here is np.max errors on an empty array. Though it is not too difficult to run into thornier issues (e.g. convolving fixed size kernels that expect certain sized arrays). These errors are usually less than clear in these cases.

This could be reasonably addressed by having a clear fallback path. Namely try using the _meta array to see the behavior. If this doesn't work, fallback to the old dtype-based strategy. Finally if nothing works, raise an error requesting the user to tell us the missing meta information (e.g. dtype, etc.). As a benefit this error would now show up at graph construction time instead of run time, which is more friendly towards users.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 21, 2019

Motivation from dask-ml here: dask/dask-ml#394 (comment)

@pentschev

This comment has been minimized.

Copy link
Member

commented Mar 22, 2019

One concern with _meta generally is it seems very magical. The reason this is worrisome is magic tends to result in code that requires more work to maintain. This can cause confusion over how it works (by users and developers), result in tricky bugs, and create challenges for building on this code later.

This was exactly my fear, which I mentioned before. However, after I understood more about Dask's code, I got less so. I think the hardest corner cases have already been addressed, and I just got rid of some of the ugliest and more concerning code, that I possibly added not because they were necessary, but because I had other issues before.

This could be reasonably addressed by having a clear fallback path. Namely try using the _meta array to see the behavior. If this doesn't work, fallback to the old dtype-based strategy. Finally if nothing works, raise an error requesting the user to tell us the missing meta information (e.g. dtype, etc.). As a benefit this error would now show up at graph construction time instead of run time, which is more friendly towards users.

Unless I misunderstand what you're saying, this is handled the way you suggested here. Of course, there may be other paths that need similar handling that I didn't see, if that's the case, we have to address them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.