Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rechunking arrays with empty chunks #5256

Merged
merged 1 commit into from Aug 12, 2019

Conversation

@jrbourbeau
Copy link
Member

commented Aug 9, 2019

It looks like our current rechunk implementation can fail when there are empty chunks present. For example:

import dask.array as da

# Create array with empty chunks
x = da.zeros((7, 24), chunks=((7,), (10, 0, 0, 9, 0, 5)))
# Rechunk array
y = x.rechunk((-1, -1))
print(f'x.shape, x.compute().shape = {x.shape}, {x.compute().shape}')
print(f'y.shape, y.compute().shape = {y.shape}, {y.compute().shape}')

Rechunking shouldn't impact the output shape of y. However, running this example on the current master branch gives:

x.shape, x.compute().shape = (7, 24), (7, 24)
y.shape, y.compute().shape = (7, 24), (7, 10)

I was able to track down the problem to _intersect_1d (in dask/array/rechunk.py) not properly incrementing the chunk index for the "old" (pre-rechunking) chunking scheme in the case where empty chunks are present. This PR adds logic to ensure the chunk index is always incremented, even when there are empty chunks.

  • Tests added / passed
  • Passes black dask / flake8 dask
@jrbourbeau

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

I found visualizing the task graph for y in the above example to be useful. With the changes in this PR, y includes all the non-empty chunks from x (i.e. the (0,0), (0,3), and (0,5) x chunks):

mydask

On master, rechunk isn't keeping track of the correct chunk index in x which correspond to non-empty chunks. In this case, y doesn't include the (0,0), (0,3), and (0,5) x chunks, but instead includes the first three chunks (two of which have zero size):

mydask

hence why the output shape of y.compute() on master is (7, 10) instead of (7, 24).

@TomAugspurger TomAugspurger added the array label Aug 12, 2019

@TomAugspurger

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

thanks @jrbourbeau

@TomAugspurger TomAugspurger merged commit 618f5df into dask:master Aug 12, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@jrbourbeau jrbourbeau deleted the Quansight-Labs:rechunk-0-size branch Aug 12, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.