Skip to content

Flattening Dask Arrays without spreading out chunks #4855

@jakirkham

Description

@jakirkham

Currently functions like ravel and flatten follow the same traversal that one would expect when using NumPy. This makes sense for users coming from NumPy, who would expect this. Also there are some Dask Array functions that rely on this behavior to match their NumPy counterparts. However this means that ravel and flatten must spread out the chunks (rechunk the array), which can have a non-negligible performance cost associated.

That said, not all cases rely on matching the same order that NumPy would provide when flattening out an array. For cases like this, a nice alternative would be to flatten chunks themselves and merely stitch together the flattened chunks into a new 1-D array. This strategy would require no rechunking and would be embarrassingly parallel. Thus it would avoid the performance penalties that ravel and flatten have today.

Assuming this strategy is reasonable, there are a few ways we could go about implementing it.

  1. Allow some additional options to the order parameter to handle this need.
  2. Add a new parameter to toggle NumPy or chunk-based traversal strategies.
  3. Include a config option to enable this behavior for Dask Arrays more generally.
  4. Add a new function entirely for this behavior.
  5. ?

Thoughts on this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    arraygood first issueClearly described and easy to accomplish. Good for beginners to the project.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions