Skip to content

Add indices#2268

Merged
mrocklin merged 6 commits intodask:masterfrom
jakirkham:add_indices
Apr 29, 2017
Merged

Add indices#2268
mrocklin merged 6 commits intodask:masterfrom
jakirkham:add_indices

Conversation

@jakirkham
Copy link
Copy Markdown
Member

Fixes #2267

Adds an equivalent function to numpy.indices.

@jakirkham jakirkham mentioned this pull request Apr 28, 2017
@jakirkham jakirkham force-pushed the add_indices branch 4 times, most recently from 1903f8c to 36af208 Compare April 28, 2017 00:44
Behaves like NumPy's equivalently named function except with Dask Arrays
instead.
@jakirkham jakirkham force-pushed the add_indices branch 4 times, most recently from 1cbe508 to 115d49c Compare April 28, 2017 01:48
r = r[s]

for j in chain(range(i), range(i + 1, len(dimensions))):
r = r.repeat(dimensions[j], axis=j).rechunk({j: chunks[j + 1]})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repeated rechunking here might be expensive in terms of both graph creation and numpy copies.

Also a slight preference to use itertools.chain rather than chain

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, is it possible to get by with arange and slicing (as you've done above), and multiplying by da.ones(full_shape, chunks=full_chunks)? This might also handle the rechunking automatically. This multiplication is probably more expensive on the workers than is repeat, but I suspect that this isn't that significant and that it's cheaper when constructing the graph

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was initially worried as I saw some chunking by 1 in earlier code, but it seems repeat here actually results in chunks equal to the number of repeats. Once everything is stacked, it does appear to get the right chunking for everything except the first dimension. However, stack does need to be rechunked on the first dimension as that is chunked by 1.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should add it does seem to go a little faster by dropping the chunking after repeat based on some simple testing.

It appears that calling `repeat` here results in a chunk size that is
equal to the number of repetitions. As such these repeated dimensions
will be rechunked to the right chunk size automatically. Namely
rechunking will be done according to the chunk size of all of the
`arange` calls. However when it comes to `stack` rechunking is still
required to return the intended chunk size.
Use `chain` from the module instead of importing the function directly.
grid.append(r)

if grid:
grid = stack(grid).rechunk({0: chunks[0]})
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could actually have chunks just be 1 here. Then the user would only specify chunks for the rest of the array. That way there is no need to rechunk. Thoughts?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counterpoint would be this could be confusing to the user to not specify chunks of the same dimensionality of the expected shape.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have k arrays of the same shape and we want to stack them along a new axis. The question is if we should chunk those arrays together or leave them as is? If so then leaving them as is seems fine to me.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. That simplifies some code as well. Have pushed a change.

@jakirkham
Copy link
Copy Markdown
Member Author

Have addressed your comments @mrocklin. Also have raised a question of my own above.

This seems to be passing except for one failure due to the test_interrupt issue ( #2192 ).

The `indices` function was chunking the 0th dimension based on the user
supplied `chunks` argument. Following some discussion it was decided
that chunking along this dimension should simply be skipped. After all
the user can easily do this themselves if they want. Plus this
simplifies a bit of code in the process. So now the chunks are only used
for the `arange` function calls. Thus it determines the 1st dimension
onwards. The 0th dimension keeps chunks of length 1.
return Array(dsk, name, chunks, dtype=dtype)


@wraps(np.indices)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically got this from copy-n-pasting something from dask.array.core. Is there any value in having this for Dask beyond getting the docstring? If not, was planning to replace the docstring given our parameters differ.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing with a custom docstring is preferred if you're willing.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, already have something ready to go. 😉

@jakirkham
Copy link
Copy Markdown
Member Author

Well I can't think of anything else needed on my end. Anything else you'd like to see @mrocklin or is this good to go?

@jakirkham
Copy link
Copy Markdown
Member Author

Hmm...not sure why the AppVeyor tests mysteriously stopped, but it seems unrelated Unfortunately no more information is provided by the log.

ref: https://ci.appveyor.com/project/daskdev/dask/build/1.0.1180#L467

@jakirkham
Copy link
Copy Markdown
Member Author

Repushed and it seems everything is green.

@mrocklin mrocklin merged commit ee3b1ed into dask:master Apr 29, 2017
@mrocklin
Copy link
Copy Markdown
Member

Thanks @jakirkham !

@jakirkham jakirkham deleted the add_indices branch April 29, 2017 18:30
@jakirkham
Copy link
Copy Markdown
Member Author

Thanks @mrocklin. Just realized I forgot to add this to the docs. Fixing that with PR ( #2275 ).

@sinhrks sinhrks added this to the 0.14.2 milestone May 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants