Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Exactly which numpy slicing features does an array-like need to support to be used with Dask.from_array #5281

Open
clbarnes opened this issue Aug 15, 2019 · 4 comments
Labels

Comments

@clbarnes
Copy link

clbarnes commented Aug 15, 2019

From https://docs.dask.org/en/latest/array-creation.html#create-dask-arrays

any format that supports NumPy-style slicing

but the from_array method has the fancy argument, suggesting that there's a way to implement a bare minimum of array slicing in the backend and let dask do the rest of the work.

What is that bare minimum?

dask.array.slicing seems to have utilities to fill out a lot of possibilities; my hope is that the backend for an N-D array would only have to support an N-length tuple of slice objects with >=0 start and stop, step=None.

Context: h5py_like attempts to provide base classes and utilities for libraries implementing ndarray storage (like pyn5) to make them behave like h5py objects, including some array indexing cases and rudimentary concurrent IO. I'd love to offload both responsibilities to dask's utilities, as it's much more mature, better tested, and has more eyes on it.

Same question for writing with dask.array.store.

@jakirkham
Copy link
Member

I think this means point selection (like a[0, 5]), contiguous slicing (like a[2:5]), non-contiguous slicing (like a[2:5:2]), and reversed slicing (like a[2:5:-1]). If there is a place where documentation might help, feel free to submit a PR with some useful text where you were looking for this info. Would be happy to review 🙂

Fancy indexing is more about selecting multiple points in different ways (inner indexing - just some points, outer indexing - grid of points with self-chosen spacing). Also supporting things like bools arrays for selection. I don't know if you use these at all. If not, I wouldn't worry about it.

It's worth noting that libraries like h5py, Zarr, etc. will take the first selection and then return a NumPy array. At which point everything works as expected (including fancy indexing). So this really only matters if you are trying to do fancy indexing as part of reading in data.

My guess is this won't matter much. I've yet to be bitten by a case where Dask tries to use fancy indexing on an object that doesn't support it. So probably worth just giving things a try and seeing if it works out or not. Happy to follow-up if you run into issues.

@jakirkham
Copy link
Member

Same thing with da.store. I wouldn't worry about more sophisticated forms of indexing when saving data unless you are performing some more complex transform when writing out the data. If you are just doing a 1-to-1 mapping of Dask chunks to some slice in the stored array, then only basic slicing will be relevant.

@jakirkham
Copy link
Member

Does that answer your questions @clbarnes?

@clbarnes
Copy link
Author

To an extent. I think it could be specified in the docs exactly what features are needed for each; I'll try to raise a PR to that effect when I have time. One of these days I'll write a test suite for array-likes to ascertain exactly which numpy features any given one supports...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants