Feature/dask.array.apply_gufunc (Issue #702)#3109
Conversation
|
Awesome! I'm also looking forward to reading through this when I'm back home on Monday. |
Perhaps we should call |
mrocklin
left a comment
There was a problem hiding this comment.
Some initital comments, mostly to help educate myself.
dask/array/gufunc.py
Outdated
| _outs = _parse_args(_out) | ||
| if _outs and (len(_outs) == 1) and ('),' not in _out): | ||
| _outs = _outs[0] | ||
| return _ins, _outs |
There was a problem hiding this comment.
It would ease reviewing and reading by future devs if these functions had relatively simple docstrings with a couple examples of their inputs/outputs
def parse_signature(signature):
""""
Parse gufunc signature into ...
Examples
--------
>>> parse_signature("...")
...
"""| set ``vectorize=True``). | ||
| signature: String | ||
| Specifies what core dimensions are consumed and produced by ``func``. | ||
| According to the specification of numpy.gufunc signature [2]_ |
There was a problem hiding this comment.
Quick question, do numpy gufuncs have signatures attached to them? If so would it make sense to get the signature from the callable?
dask/array/gufunc.py
Outdated
| ... return np.mean(x, axis=-1), np.std(x, axis=-1) | ||
| ... a = da.random.normal(size=(10,20,30), chunks=5) | ||
| ... mean, std = da.apply_gufunc(stats, "(i)->(),()", a, output_dtypes=2*(a.dtype,)) | ||
| ... mean.compute() # doctest: +SKIP |
There was a problem hiding this comment.
Style note, I recommend using >>> on each individual line.
There was a problem hiding this comment.
Except the body of the stats function, of course
dask/array/gufunc.py
Outdated
| vectorize = kwargs.pop("vectorize", None) | ||
| concatenate = kwargs.pop("concatenate", True) | ||
| if output_dtypes is None: | ||
| raise ValueError("Must specify `output_dtypes` of output array(s)") |
There was a problem hiding this comment.
Typically we run the function on a small or empty array to see if that can produce the dtype for us. See dask/array/core.py::apply_infer_dtype
dask/array/gufunc.py
Outdated
| if len(output_dtypes) != nout: | ||
| raise ValueError("Must specify tuple of %d dtypes for `output_dtypes` for function with multiple outputs" % nout) | ||
| if nout is None and (isinstance(output_dtypes, tuple) or isinstance(output_dtypes, list)): | ||
| raise ValueError("Must specify single dtype for `output_dtypes` for function with one output") |
There was a problem hiding this comment.
Just checking, but do we test for this behavior? (I haven't gotten down to that yet)
| tmp = atop(ufunc, loop_output_dims, *arginds, | ||
| dtype=int, # Only dummy dtype, anyone will do | ||
| concatenate=concatenate, | ||
| **kwargs) |
There was a problem hiding this comment.
Given that we don't really want an array output, would it make sense to drop down to top here?
I'm also open to teaching atop how to handle multiple outputs. (I'll write a longer note about this in the main comments)
There was a problem hiding this comment.
I also needed unify_chunks being called, so that I basically use 80% of atop code here. After below's discussion potential refactor can be continued.
dask/array/gufunc.py
Outdated
| return leaf_arrs if nout else leaf_arrs[0] | ||
|
|
||
|
|
||
| @curry |
There was a problem hiding this comment.
Using curry can hide odd sorts of errors. If we know how this is likely to be used then it might be cleaner to make a function that returns another function explicitly.
There was a problem hiding this comment.
I have removed it for now. Check for correctness of passed **kwargs is still postponed until apply_gufunc is called, i.e. very late. Depending on the other discussion about dtype inference, we can finalize the behavior here eventually
dask/array/gufunc.py
Outdated
| ... return np.mean(x, axis=-1), np.std(x, axis=-1) | ||
| ... a = da.random.normal(size=(10,20,30), chunks=5) | ||
| ... mean, std = stats(a) | ||
| ... mean.compute() # doctest: +SKIP |
There was a problem hiding this comment.
Ideally we would define this thing as a numpy gufunc and implement some protocol in dask to handle those when applied to dask arrays. I suspect that this isn't doable today. I'll write more about this in a longer comment.
|
Regarding atop with multiple arguments I'm all in favor of doing this if it's easy to do and doesn't impact the easy growth of atop in the future. My guess is that this will be harder to do in general due to all of the other features that have been added to |
|
The combination of gufuncs and dask.array is a powerful one. In my ideal world we wouldn't define these as @dask.array.gufunc("(i)->(),()", output_dtypes=2*(a.dtype,))
def stats(x):
return np.mean(x, axis=-1), np.std(x, axis=-1)... but rather as numpy.gufuncs. @numpy.gufunc("(i)->(),()", output_dtypes=2*(a.dtype,))
def stats(x):
return np.mean(x, axis=-1), np.std(x, axis=-1)Then we would implement some sort of This would also be a nice touchpoint for other projects that are able to create gufuncs, like numba. cc @seibert |
|
It should be totally doable today to make Dask array directly support NumPy gufuncs. Gufuncs call |
|
Thanks for your comments.
In it's current implementation,
That means, we could automatically map numpy gufuncs. But how about plain python functions, which shall be used as gufunc as shown in this PRs use cases and examples. I.e. in Extending |
shoyer
left a comment
There was a problem hiding this comment.
A few comments on the signature parsing part of the code.
dask/array/gufunc.py
Outdated
| return list(r for r in ret if r is not None) | ||
|
|
||
|
|
||
| def parse_signature(signature): |
There was a problem hiding this comment.
You might just borrow the version of this function that I wrote for numpy.vectorize:
https://github.com/numpy/numpy/blob/99019107cbde06294a356fd7fc859f9b31f87410/numpy/lib/function_base.py#L1638-L1666
dask/array/gufunc.py
Outdated
| >>> parse_signature('(i,j)->(i),(j)') | ||
| ([('i', 'j')], [('i',), ('j',)]) | ||
|
|
||
| >>> parse_signature('->(i),') |
There was a problem hiding this comment.
I don't think this should be a valid gufunc signature. The spec doesn't allow trailing commas.
dask/array/gufunc.py
Outdated
| _ins = [i for i in _ins if i is not None] | ||
| _outs = _parse_args(_out) | ||
| if _outs and (len(_outs) == 1) and ('),' not in _out): | ||
| _outs = _outs[0] |
There was a problem hiding this comment.
These sort of special cases (return a scalar for length 1 output) are usually best avoided. They end up requiring work arounds in code that uses it.
There was a problem hiding this comment.
Thank you for pointing out. It confuses me though: Return arguments in Python can be either a scalar or tuples, even if only length 1. Python function return statements handle this consistently, i.e.
def foo():
return 1,
a, = foo() I don’t understand, why numpy.gufunc signature forbids length 1 return tuples. It doesn’t limit the normal usage of returning scalars.
Shouldn't functions be implemented that they behave consistently even for corner cases? E.g. also the signature "->()" or "->(i)" are currently not valid gufunc signatures. While it seems funny to wrap a non-gufunc without any arguments as gufunc, it should lead to a consistent workflow in the end.
I understand that for now, this should be implemented according to current numpy definition of the signature.
What is your stand on this? Should the signature definition in numpy be extended to cover mentioned corner cases?
There was a problem hiding this comment.
It might be worth bringing this up as a NumPy issue
dask/array/gufunc.py
Outdated
| _flag = len(outs) == 1 | ||
| outs = ",".join(("({0})".format(",".join(dims)) for dims in outs)) | ||
| if _flag: | ||
| outs += "," |
There was a problem hiding this comment.
I'm not quite sure why you need this?
dask/array/gufunc.py
Outdated
| outs = ",".join(("({0})".format(",".join(dims)) for dims in outs)) | ||
| if _flag: | ||
| outs += "," | ||
| else: |
There was a problem hiding this comment.
I think this clause could be dropped if you only support the versions always takes multiple outputs.
I agree, handling non-gufuncs that work like gufuncs is an important use case. It's a pain to write actual gufuncs (they need to be in C and have limited features), so even NumPy has some non-gufunc gufunc-likes like |
I also agree. It would be good to coordinate with Numpy devs so that we come to an interface that they might also be happy with in the future. It would be nice at some point to implement this as follows: if numpy.__version__ >= ...:
gufunc = numpy.gufunc
else:
def gufunc(...) |
|
Checking on the status here. This has been dormant for a few days. Is there anything that's blocking you @magonser ? No time pressure of course, I just want to make sure that you're able to move smoothly if/when you have time. |
|
Alive beat! Sorry for not getting back fast. Two reasons: this is in my spare time (so either way, this will take some time) and I am still wrapping my head around the necessary actions, and My current understanding is
|
|
Checking in again on this. No time pressure, I just want to track things. Are you blocking on anything from other Dask maintainers? Numpy maintainers? |
|
With respect to the above mentioned two numpy issues there haven’t been too many discussions. Especially aligning API has not received any comments. I don’t think it’s blocking anything, but I’ll just be needing more time. Is it better for you to reopen this PR later? |
|
It looks like there are some failures, perhaps based on numpy versions: �[1m if vectorize:�[0m
�[1m> func = np.vectorize(func, signature=signature, otypes=otypes)�[0m
�[1m�[31mE TypeError: __init__() got an unexpected keyword argument 'signature'�[0m |
…ize(..., signature=...)` support
|
@shoyer any further comments here? |
|
None from my side. Let me know, if I need to change more things. At this point: I thank you for your patience with this PR and your time for the review and discussions - I appreciate it. |
|
I'm generally pretty happy with this. Merging tomorrow morning US Eastern time if there are no further comments. |
|
@teoliphant I suspect that you might enjoy this work. |
dask/array/gufunc.py
Outdated
| # Input processing: | ||
| ## Signature | ||
| if not isinstance(signature, str): | ||
| raise ValueError('`signature` has to be of type string') |
| for arg in re.findall(_ARGUMENT, in_txt)] | ||
| outs = [tuple(re.findall(_DIMENSION_NAME, arg)) | ||
| for arg in re.findall(_ARGUMENT, out_txt)] | ||
| outs = outs[0] if ((len(outs) == 1) and (out_txt[-1] != ',')) else outs |
There was a problem hiding this comment.
Why does this have a different return type for a single output? I think that makes your logic below unnecessarily complicated.
| core_input_dimss, core_output_dimss = _parse_gufunc_signature(signature) | ||
|
|
||
| ## Determine nout: nout = None for functions of one direct return; nout = int for return tuples | ||
| nout = None if not isinstance(core_output_dimss, list) else len(core_output_dimss) |
There was a problem hiding this comment.
Why use nout=None instead of the simpler nout=1?
| else: | ||
| if nout is not None: | ||
| raise ValueError("Must specify tuple of dtypes for `output_dtypes` for function with multiple outputs") | ||
| otypes = [output_dtypes] |
There was a problem hiding this comment.
I don't quite understand why all this logic is necessary. Are you trying to be more flexible in allowing a single output_dtype to be provided if there is only one output?
| ### *) Treat direct output | ||
| if nout is None: | ||
| core_output_dimss = [core_output_dimss] | ||
| output_dtypes = [output_dtypes] |
There was a problem hiding this comment.
If you avoided unpacking core_output_dimss and output_dtypes above, I don't think you would need this step.
| def test_Array_numpy_gufunc_call__array_ufunc__01(): | ||
| x = da.random.normal(size=(3, 10, 10), chunks=(2, 10, 10)) | ||
| nx = x.compute() | ||
| ny = np.linalg._umath_linalg.inv(nx) |
There was a problem hiding this comment.
Can you use the public np.linalg.inv with apply_gufunc? It's not a good idea to rely on functions in private modules, because those could change in any release without warning.
I guess we could stick with this (since you document these methods in the docs, and numpy doesn't expose any gufuncs in its public API currently), but it makes me a little nervous. Maybe it would be better to make the skipif condition hasattr(hasattr(np.linalg, '_umath_linalg'), 'inv')`.
There was a problem hiding this comment.
Will do tomorrow
- Make _umath_linalg tests optional
There was a problem hiding this comment.
The problem seems, that the private functions exist already, but do not receive optional kwargs
Using numpy=1.11:
____________________________________ test_Array_numpy_gufunc_call__array_ufunc__01 _____________________________________
@pytest.mark.skipif(hasattr(hasattr(np.linalg, '_umath_linalg'), 'inv'),
reason="NumPy doesn't have `np.linalg._umath_linalg.inv`")
def test_Array_numpy_gufunc_call__array_ufunc__01():
x = da.random.normal(size=(3, 10, 10), chunks=(2, 10, 10))
nx = x.compute()
ny = np.linalg._umath_linalg.inv(nx)
> y = np.linalg._umath_linalg.inv(x, output_dtypes=float)
E TypeError: 'output_dtypes' is an invalid keyword to ufunc 'inv'
dask/array/tests/test_array_core.py:226: TypeError
=============================================== 1 failed in 0.76 seconds ===============================================
There was a problem hiding this comment.
I've added an pytest.mark.xfail to both relevant tests
| # def test_apply_gufunc_one_scalar_output(): | ||
| # def foo(): | ||
| # return 1, | ||
| # x, = apply_gufunc(foo, "->(),", output_dtypes=(int,)) |
There was a problem hiding this comment.
OK, now I understand why you special cased one output. Remind me, did you open an upstream issue in NumPy to discuss this?
My guess is that even though this is arguably poor API (special casing one vs. many return values) there will be little appetite for changing/extending it in NumPy itself.
There was a problem hiding this comment.
Yes, the upstream issue is numpy/numpy#10527
I slowly get to understand that I might need to convince many people for this to change upstream and that the order should be to implement against current API and change later if upstream changes.
Of course all your other code comments vanish, if I'd roll back to the current signature API.
As mentioned before I'd like to be transparent about the current implementation, which means also (focus on output_dtypes):
def foo(x):
return x
a = np.array([1, 2, 3])
apply_gufunc(foo, "()->()", a, output_dtypes=int) # Not: output_dtypes=[int]What do you recommend/favor? Keep or roll back to current signature API?
There was a problem hiding this comment.
I do think it would be slightly better not to try to extend the gufunc signature here (before we have concensus that it makes sense for numpy). There aren't a lot of use-cases for gufuncs returning a tuple of length 1 (beyond consistency) or for accepting no input arguments, and consistency with numpy is virtuous.
That said, given that you already wrote this, I don't think there's a lot of harm in leaving it in as unsupported features...
There was a problem hiding this comment.
Thanks @shoyer.
Ok - so we will keep these parts as they are now.
|
Checking in. What is the status here? Any further concerns @shoyer? If not then would you like to do the honors and merge? |
|
thanks @magonser! I'm really excited about this. |
|
Huzzah!
…On Fri, May 18, 2018 at 2:35 PM, Stephan Hoyer ***@***.***> wrote:
thanks @magonser <https://github.com/magonser>!
I'm really excited about this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3109 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszM2CT8NUb5LlMe7PkFVSxTJFbjpUks5tzxRygaJpZM4RvYaX>
.
|
|
Thank you for doing this @magonser and sticking with it. I'm very happy about this change |
|
Awesome! Thanks also again for your patience! Let's see what will come about it, once people start experimenting with it... |
|
Already done: numba/numba#2979
…On Fri, May 18, 2018 at 3:29 PM, Markus Gonser ***@***.***> wrote:
Awesome! Thanks also again for your patience!
Let's see what will come about it, once people start experimenting with
it...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3109 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszEya-e24E5iEMjYZ59zf1922EciOks5tzyEdgaJpZM4RvYaX>
.
|
|
Thanks for staying with the @magonser. Appreciate that pursuing a change like this is quite hard and it's really great that you have stuck with this and we were able to get in. Looking forward to playing with this in the new Dask version. |
This is an attempt to solve #702. I think the PR is not quite finished yet:
Done:
apply_gufuncand provide a variationgufuncflake8 daskdocs/source/changelog.rstfor all changesand one of the
docs/source/*-api.rstfiles for new APITodo:
apply_gufuncorgufuncNotes:
This implementation basically uses
atop. I found that the following modifications couldimprove this implementation:
toporunify_chunks:topsuch that specific dimensions could be passed to argumentconcatenate. This would be needed to make core-dims to have only one chunk, even for core-dims, which are present in the gufuncs outputunify_chunkssuch, that required chunk amount can be specified. That way it also could be enforced that core-dims have one chunksizeatopwith multiple outputs: Given how little code this implementation is, it should be possible to include multiple outputs also foratop. I've opted for this implementation, as it has been discussed in atop with multiple output arguments #702 and seems sensiblebroadcast_arraysas oulined in atop with multiple output arguments #702, that allows to explicitly broadcast array dimensions against each other by additional specifications.xarray.apply_ufuncimplements such functionality already, but having this in dask could enhance usability independent ofxarray.