Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rechunking arrays with empty chunks #5256

Merged
merged 1 commit into from Aug 12, 2019

Conversation

jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Aug 9, 2019

It looks like our current rechunk implementation can fail when there are empty chunks present. For example:

import dask.array as da

# Create array with empty chunks
x = da.zeros((7, 24), chunks=((7,), (10, 0, 0, 9, 0, 5)))
# Rechunk array
y = x.rechunk((-1, -1))
print(f'x.shape, x.compute().shape = {x.shape}, {x.compute().shape}')
print(f'y.shape, y.compute().shape = {y.shape}, {y.compute().shape}')

Rechunking shouldn't impact the output shape of y. However, running this example on the current master branch gives:

x.shape, x.compute().shape = (7, 24), (7, 24)
y.shape, y.compute().shape = (7, 24), (7, 10)

I was able to track down the problem to _intersect_1d (in dask/array/rechunk.py) not properly incrementing the chunk index for the "old" (pre-rechunking) chunking scheme in the case where empty chunks are present. This PR adds logic to ensure the chunk index is always incremented, even when there are empty chunks.

  • Tests added / passed
  • Passes black dask / flake8 dask

@jrbourbeau
Copy link
Member Author

jrbourbeau commented Aug 9, 2019

I found visualizing the task graph for y in the above example to be useful. With the changes in this PR, y includes all the non-empty chunks from x (i.e. the (0,0), (0,3), and (0,5) x chunks):

mydask

On master, rechunk isn't keeping track of the correct chunk index in x which correspond to non-empty chunks. In this case, y doesn't include the (0,0), (0,3), and (0,5) x chunks, but instead includes the first three chunks (two of which have zero size):

mydask

hence why the output shape of y.compute() on master is (7, 10) instead of (7, 24).

@TomAugspurger
Copy link
Member

TomAugspurger commented Aug 12, 2019

thanks @jrbourbeau

@TomAugspurger TomAugspurger merged commit 618f5df into dask:master Aug 12, 2019
2 checks passed
@jrbourbeau jrbourbeau deleted the rechunk-0-size branch Aug 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants