You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the issue:
A function that is supplied to dask.array.map_blocks can be called with multiple Dask arrays as input.
This generally works if the shape of the input arrays is the same.
If however the input arrays have a different shape, unexpected behavior can arise.
The chunks and shapes are often not as expected.
Sometimes calling compute() on the result will yield an incorrect array and sometimes it errors with *** IndexError: tuple index out of range
Minimal Complete Verifiable Example:
In this example I define a function where the second argument is unused and the first argument is returned unmodified.
The expectation therefore is that the result is unaffected by the second array.
importnumpyasnpimportdask.arrayasdadeffoo(arr1, arr2):
returnarr1arr_3d=da.stack(3*[da.eye(2)])
arr_2d=da.stack(3*[da.arange(2)])
print("Shape of arr_3d:", arr_3d.shape)
print("Shape of arr_2d:", arr_2d.shape)
result=da.map_blocks(foo, arr_3d, arr_2d)
print("Shape of result:", result.shape)
This returns:
Shape of arr_3d: (3, 2, 2)
Shape of arr_2d: (3, 2)
Shape of result: (3, 3, 2)
Of course I would have expected the shape of the result to be (3, 2, 2) and not (3, 3, 2).
What is possibly even more telling than the shape are the chunks. If I run the code above but print the chunks instead of the shape I get:
Chunks of arr_3d: ((1, 1, 1), (2,), (2,))
Chunks of arr_2d: ((1, 1, 1), (2,))
Chunks of result: ((1, 1, 1), (1, 1, 1), (2,))
If instead of supplying arr_2d to the function, I supply any non-dask object, the function works as expected. In this case I supplied None instead:
Chunks of arr_3d: ((1, 1, 1), (2,), (2,))
Chunks of result: ((1, 1, 1), (2,), (2,))
So the mere presence of the second array will mess up the chunking and shape of the return array.
Anything else we need to know?:
I think the problem might be originating in the creation of argpairs here:
The variable argpairs will have the length of the respective arrays, but it gets inverted.
In this example, the indices that go with the argpairs are (2, 1, 0) for the first array where the corresponding chunks are ((1, 1, 1), (2,), (2,)). For the second array we have the indices (1, 0) with the corresponding chunks ((1, 1, 1), (2,)).
What I was assuming when calling the function was that the chunks are matched in order. Instead now we have the (1,1,1) corresponding to index 1 of the second argument overwriting the (2,) corresponding to index 1 of the first argument.
The overwrite is occurring here:
I have not tried to fiddle with argpairs yet for I am not very familiar with Dasks internals and I want to check here first if this is even supported functionality.
Workaround
It does work if we create an extra (dummy) dimension on the second array so it matches the number of axes of the first.
The following works as expected: arr_2d = da.expand_dims(arr_2d, 2)
Which gives the chunks:
Chunks of arr_3d: ((1, 1, 1), (2,), (2,))
Chunks of arr_2d: ((1, 1, 1), (2,), (1,))
Chunks of result: ((1, 1, 1), (2,), (2,))
Expectation of solution:
I am unsure if this is feature is meant to be supported.
While I would love for this feature to be supported, another possible outcome of this issue could of course be the statement that we do not support map_blocks on arrays of different sizes.
In that case an explicit error could be beneficial.
Environment:
Dask version: 2023.3.2 and 2023.5.0
Python version: 3.8.10
Operating System: Ubuntu 20.04 LTS (focal)
Install method (conda, pip, source): pip
The text was updated successfully, but these errors were encountered:
While I still encounter problems when using map_blocks with arrays of differing shapes, I have since found out that dask.array.blockwise can be used to address the problem outlined above. For example, the line da.blockwise(foo, "ijk", arr_3d, "ijk", arr_2d, "ij") does result in an array with the expected shape.
Describe the issue:
A function that is supplied to
dask.array.map_blocks
can be called with multiple Dask arrays as input.This generally works if the shape of the input arrays is the same.
If however the input arrays have a different shape, unexpected behavior can arise.
The chunks and shapes are often not as expected.
Sometimes calling
compute()
on the result will yield an incorrect array and sometimes it errors with*** IndexError: tuple index out of range
Minimal Complete Verifiable Example:
In this example I define a function where the second argument is unused and the first argument is returned unmodified.
The expectation therefore is that the result is unaffected by the second array.
This returns:
Of course I would have expected the shape of the result to be (3, 2, 2) and not (3, 3, 2).
What is possibly even more telling than the shape are the chunks. If I run the code above but print the chunks instead of the shape I get:
If instead of supplying
arr_2d
to the function, I supply any non-dask object, the function works as expected. In this case I supplied None instead:So the mere presence of the second array will mess up the chunking and shape of the return array.
Anything else we need to know?:
I think the problem might be originating in the creation of
argpairs
here:dask/dask/array/core.py
Lines 802 to 805 in 461355e
The variable
argpairs
will have the length of the respective arrays, but it gets inverted.In this example, the indices that go with the
argpairs
are (2, 1, 0) for the first array where the corresponding chunks are ((1, 1, 1), (2,), (2,)). For the second array we have the indices (1, 0) with the corresponding chunks ((1, 1, 1), (2,)).What I was assuming when calling the function was that the chunks are matched in order. Instead now we have the (1,1,1) corresponding to index 1 of the second argument overwriting the (2,) corresponding to index 1 of the first argument.
The overwrite is occurring here:
dask/dask/array/blockwise.py
Lines 192 to 195 in 461355e
I have not tried to fiddle with
argpairs
yet for I am not very familiar with Dasks internals and I want to check here first if this is even supported functionality.Workaround
It does work if we create an extra (dummy) dimension on the second array so it matches the number of axes of the first.
The following works as expected:
arr_2d = da.expand_dims(arr_2d, 2)
Which gives the chunks:
Expectation of solution:
I am unsure if this is feature is meant to be supported.
While I would love for this feature to be supported, another possible outcome of this issue could of course be the statement that we do not support
map_blocks
on arrays of different sizes.In that case an explicit error could be beneficial.
Environment:
The text was updated successfully, but these errors were encountered: