Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map_blocks with arrays of different sizes #10399

Open
tmillenaar opened this issue Jul 6, 2023 · 1 comment
Open

map_blocks with arrays of different sizes #10399

tmillenaar opened this issue Jul 6, 2023 · 1 comment
Labels
needs triage Needs a response from a contributor

Comments

@tmillenaar
Copy link

tmillenaar commented Jul 6, 2023

Describe the issue:
A function that is supplied to dask.array.map_blocks can be called with multiple Dask arrays as input.
This generally works if the shape of the input arrays is the same.
If however the input arrays have a different shape, unexpected behavior can arise.
The chunks and shapes are often not as expected.
Sometimes calling compute() on the result will yield an incorrect array and sometimes it errors with
*** IndexError: tuple index out of range

Minimal Complete Verifiable Example:
In this example I define a function where the second argument is unused and the first argument is returned unmodified.
The expectation therefore is that the result is unaffected by the second array.

import numpy as np
import dask.array as da

def foo(arr1, arr2):
    return arr1

arr_3d = da.stack(3*[da.eye(2)])
arr_2d = da.stack(3*[da.arange(2)])

print("Shape of arr_3d:", arr_3d.shape)
print("Shape of arr_2d:", arr_2d.shape)

result =  da.map_blocks(foo, arr_3d, arr_2d)
print("Shape of result:", result.shape)

This returns:

Shape of arr_3d: (3, 2, 2)
Shape of arr_2d: (3, 2)
Shape of result: (3, 3, 2)

Of course I would have expected the shape of the result to be (3, 2, 2) and not (3, 3, 2).
What is possibly even more telling than the shape are the chunks. If I run the code above but print the chunks instead of the shape I get:

Chunks of arr_3d: ((1, 1, 1), (2,), (2,))
Chunks of arr_2d: ((1, 1, 1), (2,))
Chunks of result: ((1, 1, 1), (1, 1, 1), (2,))

If instead of supplying arr_2d to the function, I supply any non-dask object, the function works as expected. In this case I supplied None instead:

Chunks of arr_3d: ((1, 1, 1), (2,), (2,))
Chunks of result: ((1, 1, 1), (2,), (2,))

So the mere presence of the second array will mess up the chunking and shape of the return array.

Anything else we need to know?:

I think the problem might be originating in the creation of argpairs here:

dask/dask/array/core.py

Lines 802 to 805 in 461355e

argpairs = [
(a, tuple(range(a.ndim))[::-1]) if isinstance(a, Array) else (a, None)
for a in args
]

The variable argpairs will have the length of the respective arrays, but it gets inverted.
In this example, the indices that go with the argpairs are (2, 1, 0) for the first array where the corresponding chunks are ((1, 1, 1), (2,), (2,)). For the second array we have the indices (1, 0) with the corresponding chunks ((1, 1, 1), (2,)).
What I was assuming when calling the function was that the chunks are matched in order. Instead now we have the (1,1,1) corresponding to index 1 of the second argument overwriting the (2,) corresponding to index 1 of the first argument.
The overwrite is occurring here:

for arg, ind in arginds:
for c, i in zip(arg.chunks, ind):
if i not in chunkss or len(c) > len(chunkss[i]):
chunkss[i] = c

I have not tried to fiddle with argpairs yet for I am not very familiar with Dasks internals and I want to check here first if this is even supported functionality.

Workaround
It does work if we create an extra (dummy) dimension on the second array so it matches the number of axes of the first.
The following works as expected:
arr_2d = da.expand_dims(arr_2d, 2)

Which gives the chunks:

Chunks of arr_3d: ((1, 1, 1), (2,), (2,))
Chunks of arr_2d: ((1, 1, 1), (2,), (1,))
Chunks of result: ((1, 1, 1), (2,), (2,))

Expectation of solution:
I am unsure if this is feature is meant to be supported.
While I would love for this feature to be supported, another possible outcome of this issue could of course be the statement that we do not support map_blocks on arrays of different sizes.
In that case an explicit error could be beneficial.

Environment:

  • Dask version: 2023.3.2 and 2023.5.0
  • Python version: 3.8.10
  • Operating System: Ubuntu 20.04 LTS (focal)
  • Install method (conda, pip, source): pip
@github-actions github-actions bot added the needs triage Needs a response from a contributor label Jul 6, 2023
@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Aug 7, 2023
@tmillenaar
Copy link
Author

Update for others that may encounter this problem

On the Dask forum, user guillaumeeb was so kind to point out that the inversion of the axes order is likely related to the numpy broadcasting rules. See forum post: https://dask.discourse.group/t/why-do-chunks-get-inverted/2131/3

While I still encounter problems when using map_blocks with arrays of differing shapes, I have since found out that dask.array.blockwise can be used to address the problem outlined above. For example, the line da.blockwise(foo, "ijk", arr_3d, "ijk", arr_2d, "ij") does result in an array with the expected shape.

@github-actions github-actions bot removed the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Needs a response from a contributor
Projects
None yet
Development

No branches or pull requests

1 participant