-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
da.map_blocks with drop_axis stopped working in dask 1.1.4 #4584
Comments
Thanks for the report, confirm locally. These are a tad tricky to debug. I didn't come up with anything concrete in 15 minutes or so. Will give it another shot later. |
Hi Tom! I was wondering if you already had a chance to further investigate the root cause of the problem? Thank you! |
@jcrist @jakirkham if either of you have time can I ask you to look into
this?
…On Wed, Apr 3, 2019 at 5:20 AM Stefan Fuertinger ***@***.***> wrote:
Hi Tom!
I was wondering if you already had a chance to further investigate the
root cause of the problem? Thank you!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4584 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszIXmiglgSVy4GLWIN7V0tj_2tXsEks5vdIB3gaJpZM4brWsE>
.
|
Sure, the next step would be to run |
I took a brief look at this. This is one of the unfortuante collisions in complex logic of all of the keyword arguments within @pantaray I have a couple of suggestions: |
Whoops, misfire. A couple suggestions:
Lines 588 to 589 in d8a4f3b
To be clear, I'm definitely acknowledging that this is a bug and that it should be fixed. I'm also being somewhat honest here that given the complexity and edge-case-ness of this it's unlikely to be at the top of any of the maintainer's queues any time soon. Honestly we probably shouldn't be supporting all of the keyword arguments in Hope this helps somewhat. Sorry for the lack of support. |
@mrocklin Thank you for the explanation! Just to clarify: our base problem is that we have a dask array holding about 100GB of data scattered across multiple workers in distributed memory. How can we save the array to disk? Hence, the only reason we we're using this particular Thank you again for your time and effort! |
Zarr works reasonably well for parallel writes. I'm not unbiased though as I do work on it some. That said, I came to it as a Dask user interested in addressing this same problem. |
Sorry for the delay, we've spent the last few days testing various options for tackling this. @jakirkham: thank you for the suggestion! We ran a bunch of performance tests on our cluster using
In our tests Thank you again for your help! |
Certainly virtual HDF5 datasets are an interesting solution to explore here. I wonder if we should include an example in the docs. Thoughts from others here? |
Have run |
Should add that I don't see this error on array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0]) |
When working on #4712 I do recall looking at the code and thinking that it didn't look like it would handle |
This should address dask#4584, but it will still need unit tests. The amount of code is actually reduced, because I deleted some code that should never be used: it was trying to cater for arrays in kwargs, but the API says the kwargs must not be arrays. Instead of trying to track axes that appear and disappear, the axis labels that are already computed (for passing to blockwise) are leveraged to simplify the indexing logic. The current block location is represented as a dictionary from axis label to block index, and then the location for the current input array is looked up through that dictionary. This may have a minor performance impact because there is now an extra per-chunk dictionary being thrown around.
@pantaray Give the commit above a try. It seems to work on your minimal example, but that's no guarantee that it'll work in all cases since the example is only accessing one element from block_info. I still need to write unit tests for the interaction between various features (drop_axis with concatenation, new_axis and broadcasting) before I submit a pull request. |
The amount of code is actually reduced, because I deleted some code that should never be used: it was trying to cater for arrays in kwargs, but the API says the kwargs must not be arrays. Instead of trying to track axes that appear and disappear, the axis labels that are already computed (for passing to blockwise) are leveraged to simplify the indexing logic. The current block location is represented as a dictionary from axis label to block index, and then the location for the current input array is looked up through that dictionary. This may have a minor performance impact because there is now an extra per-chunk dictionary being thrown around. Closes dask#4584.
The amount of code is actually reduced, because I deleted some code that should never be used: it was trying to cater for arrays in kwargs, but the API says the kwargs must not be arrays. Instead of trying to track axes that appear and disappear, the axis labels that are already computed (for passing to blockwise) are leveraged to simplify the indexing logic. The current block location is represented as a dictionary from axis label to block index, and then the location for the current input array is looked up through that dictionary. This may have a minor performance impact because there is now an extra per-chunk dictionary being thrown around. Closes #4584.
@pantaray, if you haven't already, could you please retry with |
@bmerry @jakirkham Thanks for looking into this! With the current master ( ---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
~/Downloads/dask_fail.py in <module>
24 return (idx, 1)
25 result = a.map_blocks(da_dummy2, dtype="int", chunks=(1, 1))
---> 26 res = result.compute()
27
~/Downloads/dask/dask/base.py in compute(self, **kwargs)
154 dask.base.compute
155 """
--> 156 (result,) = compute(self, traverse=False, **kwargs)
157 return result
158
~/Downloads/dask/dask/base.py in compute(*args, **kwargs)
397 postcomputes = [x.__dask_postcompute__() for x in collections]
398 results = schedule(dsk, keys, **kwargs)
--> 399 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
400
401
~/Downloads/dask/dask/base.py in <listcomp>(.0)
397 postcomputes = [x.__dask_postcompute__() for x in collections]
398 results = schedule(dsk, keys, **kwargs)
--> 399 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
400
401
~/Downloads/dask/dask/array/core.py in finalize(results)
824 while isinstance(results2, (tuple, list)):
825 if len(results2) > 1:
--> 826 return concatenate3(results)
827 else:
828 results2 = results2[0]
~/Downloads/dask/dask/array/core.py in concatenate3(arrays)
3673 if not ndim:
3674 return arrays
-> 3675 chunks = chunks_from_arrays(arrays)
3676 shape = tuple(map(sum, chunks))
3677
~/Downloads/dask/dask/array/core.py in chunks_from_arrays(arrays)
3503
3504 while isinstance(arrays, (list, tuple)):
-> 3505 result.append(tuple([shape(deepfirst(a))[dim] for a in arrays]))
3506 arrays = arrays[0]
3507 dim += 1
~/Downloads/dask/dask/array/core.py in <listcomp>(.0)
3503
3504 while isinstance(arrays, (list, tuple)):
-> 3505 result.append(tuple([shape(deepfirst(a))[dim] for a in arrays]))
3506 arrays = arrays[0]
3507 dim += 1
IndexError: tuple index out of range |
As far as I know the |
Glad to hear this fixed the issue! 🎉 Thanks for checking @pantaray and thanks for the fix @bmerry. 😄 That's interesting that |
@bmerry @jakirkham Thanks for the quick feedback! My bad, you're of course correct: |
That's probably because you're returning an array with shape (2,) whereas it should return an array with shape (1, 1) to match the |
@bmerry good catch. With |
Glad to see we got this issue sorted 😄 |
Hi!
First and foremost a big Thank You for the amazing work that you're doing with dask! It's a fantastic piece of software!
We're using dask as parallelization engine for a large-scale data analysis package (still in early development). Our computational routines work on independent chunks of one big source data-set and then write to a single file (accessible to all workers). We used
map_blocks
for this in dask 1.0.0 and things worked great but unfortunately the same code does not work any more in dask 1.1.4. Here's a MCVE:In dask 1.0.0 this snippet produces an array
res
that looks like this:In dask 1.1.4 the same code triggers an
IndexError: index 2 is out of bounds for axis 0 with size 2
, here the full traceback:A (desperate) attempt at somehow working around this was to not use
drop_axis
and instead leave the chunk-size as is and return a bogus tuple, but no luck either:This gives an
IndexError: tuple index out of range
, full traceback here:Any help with this is greatly appreciated. Thank you!
The text was updated successfully, but these errors were encountered: