Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate using da.pad in da.overlap #3641

Open
jakirkham opened this issue Jun 19, 2018 · 27 comments
Open

Investigate using da.pad in da.overlap #3641

jakirkham opened this issue Jun 19, 2018 · 27 comments
Assignees
Labels

Comments

@jakirkham
Copy link
Member

jakirkham commented Jun 19, 2018

The da.overlap module was written to handle map_overlap and all of its intricacies. It provides some support for adding padding to edge chunks. Though it only supports a few modes of padding, but not nearly as many as da.pad supports. Given that da.pad now exists, it makes sense to just use this for performing padding in da.overlap to provide these same options to map_overlap. Also should help simplify some of the code in da.overlap.

@jakirkham
Copy link
Member Author

As a side note, map_overlap allows handling different boundaries with different modes of padding (e.g. reflection on one axis and constant on another). Not aware of any use cases where this is needed. We may consider deprecating this functionality to make this transition simpler.

@jakirkham
Copy link
Member Author

Any thoughts on this, @mrocklin?

@mrocklin
Copy link
Member

I suspect that the modes used were built to support scikit-image, but I don't recall exactly.

@jakirkham
Copy link
Member Author

Looking at da.map_overlap's boundaries docs, these all seem to be supported by da.pad. So it should be possible to call da.pad in all of these cases. Plus the custom user function case can be handled as well. Though agree it would be good to check.

@jakirkham
Copy link
Member Author

@emmanuelle @jni, am thinking about using da.pad to handle boundary conditions within da.map_overlap. Should provide greater flexibility to the user. Would this be of interest for things like apply_parallel? Are there other issues or concerns that you can think of, which would need to be addressed?

@jni
Copy link
Contributor

jni commented Jun 27, 2018

da.pad only applies to external boundaries, not the internal ones, right? So is this about not having to pad a NumPy array ahead of time?

@jakirkham
Copy link
Member Author

da.pad only applies to external boundaries, not the internal ones, right?

Correct.

So is this about not having to pad a NumPy array ahead of time?

Sort of.

da.map_overlap provides some padding options, but they are limited. da.pad provides substantially more options (including those supported by map_overlap currently). So this is about finding the best way to make use of da.pad with da.map_overlap to provide these additional options to the user.

@jni
Copy link
Contributor

jni commented Jun 27, 2018

Oh. Why does da.map_overlap not call out to da.pad?

@jakirkham
Copy link
Member Author

Indeed, agree it should. Mainly because da.pad only recently got added. :)

We should also think about how we can pass arguments through da.map_overlap to da.pad.

@jakirkham
Copy link
Member Author

As you have been looking into related issues, @hmaarrfk, would be interested to hear your thoughts on this. :)

@hmaarrfk
Copy link
Contributor

so pad adds fancy ramps on the outside of the array. I don't see any problem with that.

The issue with map_overlap was that it was using concatenate to add virtual boundaries that would then get removed when the boundary was none. For a 2D array, this would result in 3 extra memory copies (vertical boundary, horizontal boundary, corner boundary).

I don't know how pad is implemented but whatever it does, it should ensure that a single memory copy is necessary no matter the dimensionality of the originating array. There might even be more ways to optimize higher dimentionality arrays.

FYI: a dask is only interested in large arrays, I figured I would cross ref this post.

Some speedup can be expected in numpy 1.15.2 if the future PR mentionned at the bottom of this issue gets merged in: numpy/numpy#11919
Basically, copying large arrays linux was limited by page faulting

@hmaarrfk
Copy link
Contributor

Here are a few interesting benchmarks. We consider the case where we want to pad an chunked array.
If we want to have a contiguous array in memory, we must incur the cost of at least 1 memory copy. Numpy gives us an easy way to express this idea:

i7 7700 HQ, Dual channel DDR4 2400MHz

import numpy as np
from dask import array as da

shape = (50, 1024, 1024)
chunks = (50, shape[1]//2, shape[2]//2)
dtype = 'uint8'
padding = 10

padded_chunks = (chunks[0] + padding*2, chunks[1] + padding * 2, chunks[2] + padding*2)
d1 = da.ones(shape=shape, chunks=chunks, dtype=dtype).persist()
n1 = np.ones(shape=shape, dtype=dtype)
%%timeit
constant_value = 0
n00 = np.empty(shape=padded_chunks, dtype=d_padded.dtype)
n01 = np.empty(shape=padded_chunks, dtype=d_padded.dtype)
n10 = np.empty(shape=padded_chunks, dtype=d_padded.dtype)
n11 = np.empty(shape=padded_chunks, dtype=d_padded.dtype)

n00[padding:-padding, padding:-padding, padding:-padding] = n1[:, :n1.shape[1]//2, :n1.shape[2]//2]
n11[padding:-padding, padding:-padding, padding:-padding] = n1[:, n1.shape[1]//2:, n1.shape[2]//2:]
n01[padding:-padding, padding:-padding, padding:-padding] = n1[:, n1.shape[1]//2:, :n1.shape[2]//2]
n01[padding:-padding, padding:-padding, padding:-padding] = n1[:, :n1.shape[1]//2, n1.shape[2]//2:]


for this_n in [n00, n01, n10, n11]:
    this_n[:padding, ...] = 0
    this_n[-padding:, ...] = 0
    this_n[:, :padding, ...] = 0
    this_n[:, -padding:, ...] = 0
    this_n[:, :, :padding, ...] = 0
    this_n[:, :, -padding:, ...] = 0

12.1 ms ± 59.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Dask doesn't support array assignments. Which means we have to use the other primitives that are provided. Note that d1 is persisted, therefore we don't count the creation of the ones array. Also please note that creating zeros isn't the same as initializing the memory to zero. The OS might trick you and so you would still incur page faults.

d_padded = da.pad(d1, padding, mode='constant', constant_values=0)
%timeit _ = d_padded.persist()

43.7 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

d_overlaped = da.overlap.overlap(d1, depth=padding, boundary='nearest')
%timeit _ = d_overlaped.persist()

98.9 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I've seen dask take 30 ms to start up in an other benchmark. So i'm not sure what the culprit is.

I tried to write code that would effectively implement the numpy logic I wrote above in dask. Unfortunately, it was difficult without being able to assign to slices of dask arrays. Is this a feature that will be possible in the near future?

For tasks that finish in seconds, this kind of overhead is REALLY high. Is there a benchmarking suite for dask? It would be interesting to add these to it.

@mrocklin
Copy link
Member

Unfortunately, it was difficult without being able to assign to slices of dask arrays. Is this a feature that will be possible in the near future?

It's certainly in-scope, but it would be challenging. Implementing numpy-style slicing on chunked datasets is hard and requires some attention to detail. We did this work for __getitem__. The same work could be done for __setitem__, it just requires someone who has the focus and time to get it done. https://github.com/dask/dask/blob/master/dask/array/slicing.py

@mrocklin
Copy link
Member

Is there a benchmarking suite for dask? It would be interesting to add these to it.

There is, though it's not well maintained. No reason not to start now though :)

https://github.com/dask/dask-benchmarks

@hmaarrfk
Copy link
Contributor

WRT slicing, I think one time you mentioned that dask arrays were immutable.

I think that mutable assignments might make a whole copy of the array? kinda scary. That would require much though.

I think you could do it without that ability, just expressing things might be different.

@mrocklin
Copy link
Member

I think one time you mentioned that dask arrays were immutable.

this is not currently the case

I think that mutable assignments might make a whole copy of the array?

What does a "whole copy of the array" mean in this case? A dask array is a task graph that constructs an array. There isn't necessarily any data actively in memory when you copy it. It's just a plan.

@jakirkham
Copy link
Member Author

so pad adds fancy ramps on the outside of the array.

In some cases, yes. In other cases it may alter the original edge chunks. For example look at pad_edge. It really depends on the kind of padding chosen.

I don't know how pad is implemented but whatever it does, it should ensure that a single memory copy is necessary no matter the dimensionality of the originating array.

For the cases where it alters an original edge chunk, it just calls numpy.pad with appropriate arguments. Other cases where the original edge chunks are not altered we perform some sort of global operation and attach the results onto the edges and corners of the array. So the original edge chunks are not modified for the padded array in this case. In all cases, interior chunks are untouched.

@hmaarrfk
Copy link
Contributor

@jakirkham I didn't know about then numpy pad function. According to this benchmark it should basically not be used. a 4x slowdown!

%%timeit
n00 = np.pad(n1[:, :chunks[1], :chunks[2]], pad_width=padding, mode='constant', constant_values=0)
n11 = np.pad(n1[:, chunks[1]:, chunks[2]:], pad_width=padding, mode='constant', constant_values=0)
n01 = np.pad(n1[:, chunks[1]:, :chunks[2]], pad_width=padding, mode='constant', constant_values=0)
n01 = np.pad(n1[:, :chunks[1], chunks[2]:], pad_width=padding, mode='constant', constant_values=0)

68 ms ± 787 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

FYI, I may have installed that hugepages patch. so your benchmarks may differ.

The challenge that I think dask users will see (well I definitely saw) is that every operation becomes an out of cache operation. As such, repeated memory copies are unacceptable since you will always hit memory bandwidth limits.

@hmaarrfk
Copy link
Contributor

Effectively, for a 3D array, np.concatenate is called 3*ndims to pad an array!!

https://github.com/numpy/numpy/blob/v1.15.1/numpy/lib/arraypad.py#L97

I'm worried that I might be missing something. The comment is very encouraging so I'm not sure about the exact benchmark

This commit changed implementation, from something that looks like the fast, I showed to a new concatenation algorithm
numpy/numpy@246c06d#diff-4fcddc9140fbe901655b3e1bd484205cR1331

@jakirkham
Copy link
Member Author

Think I'm missing some things in your benchmark. What are n1, chunks, and padding in this case? Also what are we comparing to?

@hmaarrfk
Copy link
Contributor

You probably need code from #3641 (comment)
and i'm comparing to my naive implementation that makes use of np.empty and slicing.

@jakirkham
Copy link
Member Author

Need to think about this more, but comparing da.pad with a constant to da.overlap.constant is probably a more appropriate comparison for determining whether da.pad should be used in map_overlap.

If we think numpy.pad's performance is poor, would be good to address that upstream.

@hmaarrfk
Copy link
Contributor

Here is the numpy issue that looks at this:

numpy/numpy#11126 (comment)

I've provided my benchmark there that illustrates the problem in a much more succinct way for np.pad

I would say that benchmarks need to be used to motivate whatever choice is made. After all, dask is all about large arrays.

  1. I would be hesitant to add all the numpy options because it would be "irreversable". Instead, you can call out to the pad function, but keep the overlap functions minimal that way you can roll back in 6 months if np.pad improves in performance (whch it might since I might be motivated to do it depending on how much time I have).
  2. Improve the documentation on the dask.overlapped module to teach people how to break down their operations that way they can choose the implementation that is best for their use-case.

@hmaarrfk
Copy link
Contributor

So here is a PR that should make numpy.pad much faster especially for dask users.

numpy/numpy#11358

Benchmarking dask with this PR included might be super important if we do.

@jakirkham
Copy link
Member Author

Thanks for working with upstream on this @hmaarrfk. It’s greatly appreciated 🙂

@jakirkham
Copy link
Member Author

Given it's taking a bit to resolve the numpy.pad issue, took a look at da.pad again to see if we could eliminate numpy.pad in the cases it was used. Think PR ( #4152 ) addresses this concern. Also has the benefit of making the padding behavior a bit more consistent between padding modes.

@jakirkham jakirkham changed the title Investigate using da.pad in da.ghost Investigate using da.pad in da.overlap Nov 27, 2018
@jakirkham
Copy link
Member Author

jakirkham commented Jun 20, 2019

If we are comfortable with a breaking change here, we could make another 1.x release, which warns users that boundaries will not be handled by da.map_overlap going forward and users should call da.pad before calling da.map_overlap to get this functionality. As part of 2.0, we could then remove these options.

Edit: Also worth pointing to issue ( #4252 ), which suggests this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging a pull request may close this issue.

4 participants