Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue warning when map_blocks() function with axis arguments conflicts with know dask array chunk structure #6810

Closed
mattchurgin opened this issue Nov 5, 2020 · 5 comments
Labels
array good first issue Clearly described and easy to accomplish. Good for beginners to the project.

Comments

@mattchurgin
Copy link

Hello, I recently ran into this issue and wanted to suggest issuing a warning when mapping a function onto a dask array when the mapped function arguments could yield unexpected/undesirable behavior in relation to the known chunk structure of the array. I provide an example below.

Minimal example:
I want to horizontally stack multiple 1-d dask arrays and argsort them along their columns.

import numpy as np
import dask
import dask.array as da

# column vectors
array1 = da.from_array(np.array([5, 9, 1, 0]).reshape((-1, 1)))
array2 = da.from_array(np.array([12, -9, 15, 0]).reshape((-1, 1)))
array3 = da.from_array(np.array([90, -3, 3, 16]).reshape((-1, 1)))

# horizontally stack
combined_array = da.hstack([array1, array2, array3])

# argsort
combined_array.map_blocks(np.argsort, axis=1).compute()

Unexpected/undesired output:

array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]])

This code results in unexpected and undesirable output. The hstacked array remains chunked along the columns, causing the mapped argsort along axis=1 to return all zeros:

hstacked_array_chunks

I resolved the issue by rechunking the stacked array so that each row was part of the same chunk:

combined_array = combined_array.rechunk({1: combined_array.shape[1]}) # ensure row contents are part of the same chunk
combined_array.map_blocks(np.argsort, axis=1).compute()

Desired output:

array([[0, 1, 2],
       [1, 2, 0],
       [0, 2, 1],
       [0, 1, 2]])

Suggestion:
To help prevent unexpected and undesirable results, it may be worth alerting the user if the arguments to their mapping function (axis=1 in this case) conflict with the known chunk structure of their array. What do you think?

@jsignell
Copy link
Member

jsignell commented Nov 6, 2020

Thanks for the detailed write-up! Since this seems like a relatively common scenario, I think it's a great opportunity to raise a warning in advance. Especially since the code doesn't throw a warning, but just gives an unexpected result.

I am trying to think about whether this is a general problem with the axis key word thong. Do other numpy operations that use axis need to have access to the whole array along that axis? If the answer is yes, then the path is relatively straightforward, we'd want to check that the chunks of the axis in question is 1. If the answer is no, there might not be a good way to address this in the code, but you should still feel free to add a note about it in the docs.

Please feel free to open a pull request!

@mattchurgin
Copy link
Author

mattchurgin commented Nov 6, 2020

Hey @jsignell,

It looks like this issue may be general. I only tried one more function, np.sum, but found a similar result. Example below:

Create array whose chunk structure happens to run along columns (each column part of a different chunk):

array1 = da.from_array(np.array([5, 9, 1, 0]).reshape((-1, 1)))
array2 = da.from_array(np.array([12, -9, 15, 0]).reshape((-1, 1)))
array3 = da.from_array(np.array([90, -3, 3, 16]).reshape((-1, 1)))
combined_array = da.hstack([array1, array2, array3])

combined_array.map_blocks(np.sum, axis=1, keepdims=True).compute()

Undesired output when summing across columns, as each column is treated as independent due to chunk structure:

array([[ 5, 12, 90],
       [ 9, -9, -3],
       [ 1, 15,  3],
       [ 0,  0, 16]])

When summing across rows we get the expected output:

combined_array.map_blocks(np.sum, axis=0, keepdims=True).compute()

Output:

array([[ 15,  18, 106]])

Rechunking so that the chunk size aligns with the axis we are trying to reduce over:

combined_array = combined_array.rechunk({1: combined_array.shape[1]}) # ensure 1 chunk per row
combined_array.map_blocks(np.sum, axis=1, keepdims=True).compute()

Results in expected output:

array([[107],
       [ -3],
       [ 19],
       [ 16]])

I am happy to open a PR. Please let me know if you have any more thoughts. Thanks!
-Matt

@jsignell
Copy link
Member

jsignell commented Nov 9, 2020

Yeah that makes sense. It seems reasonable to me to raise a warning for these cases. Go for it! Just ping me if you run into issues.

@jsignell jsignell added array good first issue Clearly described and easy to accomplish. Good for beginners to the project. labels Nov 9, 2020
@mattchurgin mattchurgin mentioned this issue Nov 21, 2020
1 task
@Madhu94
Copy link
Contributor

Madhu94 commented Mar 19, 2021

I think this is good to close since in the linked PR, it was decided this warning may not be needed.

@jrbourbeau
Copy link
Member

Thanks for following up here @Madhu94

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array good first issue Clearly described and easy to accomplish. Good for beginners to the project.
Projects
None yet
Development

No branches or pull requests

4 participants