Skip to content

Feature/dask.array.apply_gufunc (Issue #702)#3109

Merged
shoyer merged 49 commits intodask:masterfrom
magonser:feature/dask.array.apply_gufunc
May 18, 2018
Merged

Feature/dask.array.apply_gufunc (Issue #702)#3109
shoyer merged 49 commits intodask:masterfrom
magonser:feature/dask.array.apply_gufunc

Conversation

@magonser
Copy link
Copy Markdown
Contributor

@magonser magonser commented Jan 27, 2018

This is an attempt to solve #702. I think the PR is not quite finished yet:

Done:

  • Implement apply_gufunc and provide a variation gufunc
  • Tests added / passed
  • Passes flake8 dask
  • Fully documented, including docs/source/changelog.rst for all changes
    and one of the docs/source/*-api.rst files for new API

Todo:

  • Possibly decide between apply_gufunc or gufunc
  • Maybe make extra doc-page for gufuncs.

Notes:

This implementation basically uses atop. I found that the following modifications could
improve this implementation:

  1. Possible modifications to top or unify_chunks:
    1. Make top such that specific dimensions could be passed to argument concatenate. This would be needed to make core-dims to have only one chunk, even for core-dims, which are present in the gufuncs output
    2. Alternatively modify unify_chunks such, that required chunk amount can be specified. That way it also could be enforced that core-dims have one chunksize
  2. atop with multiple outputs: Given how little code this implementation is, it should be possible to include multiple outputs also for atop. I've opted for this implementation, as it has been discussed in atop with multiple output arguments #702 and seems sensible
  3. Implement something like broadcast_arrays as oulined in atop with multiple output arguments #702, that allows to explicitly broadcast array dimensions against each other by additional specifications. xarray.apply_ufunc implements such functionality already, but having this in dask could enhance usability independent of xarray.

@mrocklin
Copy link
Copy Markdown
Member

I'm looking forward to reading through this (though probably not for several hours)

Also cc'ing @shoyer (atop) and @jakevdp (gufuncs)

@shoyer
Copy link
Copy Markdown
Member

shoyer commented Jan 27, 2018

Awesome! I'm also looking forward to reading through this when I'm back home on Monday.

@mrocklin
Copy link
Copy Markdown
Member

Possible modifications to top or unify_chunks:

  • Make top such that specific dimensions could be passed to argument concatenate. This would be needed to make core-dims to have only one chunk, even for core-dims, which are present in the gufuncs output
  • Alternatively modify unify_chunks such, that required chunk amount can be specified. That way it also could be enforced that core-dims have one chunksize

Perhaps we should call rechunk explicitly here? In some cases I can also imagine wanting to reduce the chunk sizes of the non-core dimensions in order to maintain a modest chunksize. This might be too much magic though, I'm not sure.

Copy link
Copy Markdown
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initital comments, mostly to help educate myself.

_outs = _parse_args(_out)
if _outs and (len(_outs) == 1) and ('),' not in _out):
_outs = _outs[0]
return _ins, _outs
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would ease reviewing and reading by future devs if these functions had relatively simple docstrings with a couple examples of their inputs/outputs

def parse_signature(signature):
    """"
    Parse gufunc signature into ...

    Examples
    --------
    >>> parse_signature("...")
    ...
    """

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

set ``vectorize=True``).
signature: String
Specifies what core dimensions are consumed and produced by ``func``.
According to the specification of numpy.gufunc signature [2]_
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question, do numpy gufuncs have signatures attached to them? If so would it make sense to get the signature from the callable?

... return np.mean(x, axis=-1), np.std(x, axis=-1)
... a = da.random.normal(size=(10,20,30), chunks=5)
... mean, std = da.apply_gufunc(stats, "(i)->(),()", a, output_dtypes=2*(a.dtype,))
... mean.compute() # doctest: +SKIP
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style note, I recommend using >>> on each individual line.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except the body of the stats function, of course

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

vectorize = kwargs.pop("vectorize", None)
concatenate = kwargs.pop("concatenate", True)
if output_dtypes is None:
raise ValueError("Must specify `output_dtypes` of output array(s)")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically we run the function on a small or empty array to see if that can produce the dtype for us. See dask/array/core.py::apply_infer_dtype

if len(output_dtypes) != nout:
raise ValueError("Must specify tuple of %d dtypes for `output_dtypes` for function with multiple outputs" % nout)
if nout is None and (isinstance(output_dtypes, tuple) or isinstance(output_dtypes, list)):
raise ValueError("Must specify single dtype for `output_dtypes` for function with one output")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, but do we test for this behavior? (I haven't gotten down to that yet)

tmp = atop(ufunc, loop_output_dims, *arginds,
dtype=int, # Only dummy dtype, anyone will do
concatenate=concatenate,
**kwargs)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we don't really want an array output, would it make sense to drop down to top here?

I'm also open to teaching atop how to handle multiple outputs. (I'll write a longer note about this in the main comments)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also needed unify_chunks being called, so that I basically use 80% of atop code here. After below's discussion potential refactor can be continued.

return leaf_arrs if nout else leaf_arrs[0]


@curry
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using curry can hide odd sorts of errors. If we know how this is likely to be used then it might be cleaner to make a function that returns another function explicitly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed it for now. Check for correctness of passed **kwargs is still postponed until apply_gufunc is called, i.e. very late. Depending on the other discussion about dtype inference, we can finalize the behavior here eventually

... return np.mean(x, axis=-1), np.std(x, axis=-1)
... a = da.random.normal(size=(10,20,30), chunks=5)
... mean, std = stats(a)
... mean.compute() # doctest: +SKIP
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would define this thing as a numpy gufunc and implement some protocol in dask to handle those when applied to dask arrays. I suspect that this isn't doable today. I'll write more about this in a longer comment.

@mrocklin
Copy link
Copy Markdown
Member

Regarding atop with multiple arguments I'm all in favor of doing this if it's easy to do and doesn't impact the easy growth of atop in the future. My guess is that this will be harder to do in general due to all of the other features that have been added to atop in the past. As we add features to atop I think we need to be careful that they are added in such a way that is nicely orthogonal and respectful of the needs of future changes.

@mrocklin
Copy link
Copy Markdown
Member

The combination of gufuncs and dask.array is a powerful one. In my ideal world we wouldn't define these as da.gufuncs as we've done in this example ...

@dask.array.gufunc("(i)->(),()", output_dtypes=2*(a.dtype,))
def stats(x):
    return np.mean(x, axis=-1), np.std(x, axis=-1)

... but rather as numpy.gufuncs.

@numpy.gufunc("(i)->(),()", output_dtypes=2*(a.dtype,))
def stats(x):
    return np.mean(x, axis=-1), np.std(x, axis=-1)

Then we would implement some sort of __array_gufunc__ protocol and call your apply_gufunc operation whenever a gufunc was called on a set of arguments that included a dask array. CC @njsmith . My guess is that this isn't doable today, but is in scope for medium-term enhancements. Is this guess correct?

This would also be a nice touchpoint for other projects that are able to create gufuncs, like numba. cc @seibert

@shoyer
Copy link
Copy Markdown
Member

shoyer commented Jan 28, 2018

It should be totally doable today to make Dask array directly support NumPy gufuncs. Gufuncs call __array_ufunc__ and expose their signature with the signature attribute (if I remember correctly).

@magonser
Copy link
Copy Markdown
Contributor Author

Thanks for your comments.

Perhaps we should call rechunk explicitly here? In some cases I can also imagine wanting to reduce the chunk sizes of the non-core dimensions in order to maintain a modest chunksize. This might be too much magic though, I'm not sure.

In it's current implementation, apply_gufunc relies on unify_chunks, which is called by atop for rechunking. Maybe there are too many different use cases to find good rules about how to auto-magically rechunk?

It should be totally doable today to make Dask array directly support NumPy gufuncs. Gufuncs call array_ufunc and expose their signature with the signature attribute (if I remember correctly).

That means, we could automatically map numpy gufuncs. But how about plain python functions, which shall be used as gufunc as shown in this PRs use cases and examples. I.e. in numpy either you don't wrap your function at all, because you expect numpy arrays and use only numpy math, or if you want to vectorize it, use numpy.vectorize which supports the gufunc signature. Now, for dask there is an added layer in between of course - blocks. So I want to wrap my function like numpy.vectorize, but expect the blocks being paassed as numpy arrays including the loopdims for performance reasons. Maybe the chosen name apply_gufunc/gufunc is misleading here - any other suggestions?

Extending atop for many outputs is handy in itself, but I consider keeping function's signature (and the core-dimensions) separate from an added preliminary broadcast stage for the data at hand, a sensible thing to do.

Copy link
Copy Markdown
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on the signature parsing part of the code.

return list(r for r in ret if r is not None)


def parse_signature(signature):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might just borrow the version of this function that I wrote for numpy.vectorize:
https://github.com/numpy/numpy/blob/99019107cbde06294a356fd7fc859f9b31f87410/numpy/lib/function_base.py#L1638-L1666

>>> parse_signature('(i,j)->(i),(j)')
([('i', 'j')], [('i',), ('j',)])

>>> parse_signature('->(i),')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be a valid gufunc signature. The spec doesn't allow trailing commas.

_ins = [i for i in _ins if i is not None]
_outs = _parse_args(_out)
if _outs and (len(_outs) == 1) and ('),' not in _out):
_outs = _outs[0]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sort of special cases (return a scalar for length 1 output) are usually best avoided. They end up requiring work arounds in code that uses it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing out. It confuses me though: Return arguments in Python can be either a scalar or tuples, even if only length 1. Python function return statements handle this consistently, i.e.

def foo():
    return 1,

a, = foo() 

I don’t understand, why numpy.gufunc signature forbids length 1 return tuples. It doesn’t limit the normal usage of returning scalars.

Shouldn't functions be implemented that they behave consistently even for corner cases? E.g. also the signature "->()" or "->(i)" are currently not valid gufunc signatures. While it seems funny to wrap a non-gufunc without any arguments as gufunc, it should lead to a consistent workflow in the end.

I understand that for now, this should be implemented according to current numpy definition of the signature.

What is your stand on this? Should the signature definition in numpy be extended to cover mentioned corner cases?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth bringing this up as a NumPy issue

_flag = len(outs) == 1
outs = ",".join(("({0})".format(",".join(dims)) for dims in outs))
if _flag:
outs += ","
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure why you need this?

outs = ",".join(("({0})".format(",".join(dims)) for dims in outs))
if _flag:
outs += ","
else:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this clause could be dropped if you only support the versions always takes multiple outputs.

@shoyer
Copy link
Copy Markdown
Member

shoyer commented Jan 29, 2018

That means, we could automatically map numpy gufuncs. But how about plain python functions, which shall be used as gufunc as shown in this PRs use cases and examples.

I agree, handling non-gufuncs that work like gufuncs is an important use case. It's a pain to write actual gufuncs (they need to be in C and have limited features), so even NumPy has some non-gufunc gufunc-likes like numpy.matmul, np.linalg.inv and np.vectorize.

@mrocklin
Copy link
Copy Markdown
Member

That means, we could automatically map numpy gufuncs. But how about plain python functions, which shall be used as gufunc as shown in this PRs use cases and examples.

I agree, handling non-gufuncs that work like gufuncs is an important use case. It's a pain to write actual gufuncs (they need to be in C and have limited features), so even NumPy has some non-gufunc gufunc-likes like numpy.matmul, np.linalg.inv and np.vectorize.

I also agree. It would be good to coordinate with Numpy devs so that we come to an interface that they might also be happy with in the future. It would be nice at some point to implement this as follows:

if numpy.__version__ >= ...:
    gufunc = numpy.gufunc
else:
    def gufunc(...)

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Feb 2, 2018

Checking on the status here. This has been dormant for a few days. Is there anything that's blocking you @magonser ? No time pressure of course, I just want to make sure that you're able to move smoothly if/when you have time.

@magonser
Copy link
Copy Markdown
Contributor Author

magonser commented Feb 5, 2018

Alive beat! Sorry for not getting back fast. Two reasons: this is in my spare time (so either way, this will take some time) and I am still wrapping my head around the necessary actions, and numpy internals.

My current understanding is

numpy:

  • Raise issue #10526 for general additional method gufunc for wrapping pyfuncs into gufunc objects, just like numpy.vectorize, but without the "vectorize" part (broadcast of loop dimensions would take place)
  • Raise issue #10527 for extending corner-cases of signature definition

dask (this PR):

  • Make signature consistent with current implementation of np.vectorize
  • Hook gufunc inside dask.array.array.__array_ufunc__ routine
  • Provide own gufunc wrapper until numpy might provide it's own implementation, as suggested by @mrocklin

@mrocklin
Copy link
Copy Markdown
Member

Checking in again on this. No time pressure, I just want to track things. Are you blocking on anything from other Dask maintainers? Numpy maintainers?

@magonser
Copy link
Copy Markdown
Contributor Author

With respect to the above mentioned two numpy issues there haven’t been too many discussions. Especially aligning API has not received any comments. I don’t think it’s blocking anything, but I’ll just be needing more time. Is it better for you to reopen this PR later?

@mrocklin
Copy link
Copy Markdown
Member

It looks like there are some failures, perhaps based on numpy versions:

�[1m        if vectorize:�[0m
�[1m>           func = np.vectorize(func, signature=signature, otypes=otypes)�[0m
�[1m�[31mE           TypeError: __init__() got an unexpected keyword argument 'signature'�[0m

@mrocklin
Copy link
Copy Markdown
Member

@shoyer any further comments here?

@magonser
Copy link
Copy Markdown
Contributor Author

None from my side. Let me know, if I need to change more things.

At this point: I thank you for your patience with this PR and your time for the review and discussions - I appreciate it.

@mrocklin
Copy link
Copy Markdown
Member

I'm generally pretty happy with this. Merging tomorrow morning US Eastern time if there are no further comments.

@mrocklin
Copy link
Copy Markdown
Member

# Input processing:
## Signature
if not isinstance(signature, str):
raise ValueError('`signature` has to be of type string')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should be TypeError

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

for arg in re.findall(_ARGUMENT, in_txt)]
outs = [tuple(re.findall(_DIMENSION_NAME, arg))
for arg in re.findall(_ARGUMENT, out_txt)]
outs = outs[0] if ((len(outs) == 1) and (out_txt[-1] != ',')) else outs
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this have a different return type for a single output? I think that makes your logic below unnecessarily complicated.

core_input_dimss, core_output_dimss = _parse_gufunc_signature(signature)

## Determine nout: nout = None for functions of one direct return; nout = int for return tuples
nout = None if not isinstance(core_output_dimss, list) else len(core_output_dimss)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use nout=None instead of the simpler nout=1?

else:
if nout is not None:
raise ValueError("Must specify tuple of dtypes for `output_dtypes` for function with multiple outputs")
otypes = [output_dtypes]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand why all this logic is necessary. Are you trying to be more flexible in allowing a single output_dtype to be provided if there is only one output?

### *) Treat direct output
if nout is None:
core_output_dimss = [core_output_dimss]
output_dtypes = [output_dtypes]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you avoided unpacking core_output_dimss and output_dtypes above, I don't think you would need this step.

def test_Array_numpy_gufunc_call__array_ufunc__01():
x = da.random.normal(size=(3, 10, 10), chunks=(2, 10, 10))
nx = x.compute()
ny = np.linalg._umath_linalg.inv(nx)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the public np.linalg.inv with apply_gufunc? It's not a good idea to rely on functions in private modules, because those could change in any release without warning.

I guess we could stick with this (since you document these methods in the docs, and numpy doesn't expose any gufuncs in its public API currently), but it makes me a little nervous. Maybe it would be better to make the skipif condition hasattr(hasattr(np.linalg, '_umath_linalg'), 'inv')`.

Copy link
Copy Markdown
Contributor Author

@magonser magonser May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do tomorrow

  • Make _umath_linalg tests optional

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem seems, that the private functions exist already, but do not receive optional kwargs

Using numpy=1.11:

____________________________________ test_Array_numpy_gufunc_call__array_ufunc__01 _____________________________________

    @pytest.mark.skipif(hasattr(hasattr(np.linalg, '_umath_linalg'), 'inv'),
                        reason="NumPy doesn't have `np.linalg._umath_linalg.inv`")
    def test_Array_numpy_gufunc_call__array_ufunc__01():
        x = da.random.normal(size=(3, 10, 10), chunks=(2, 10, 10))
        nx = x.compute()
        ny = np.linalg._umath_linalg.inv(nx)
>       y = np.linalg._umath_linalg.inv(x, output_dtypes=float)
E       TypeError: 'output_dtypes' is an invalid keyword to ufunc 'inv'

dask/array/tests/test_array_core.py:226: TypeError
=============================================== 1 failed in 0.76 seconds ===============================================

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an pytest.mark.xfail to both relevant tests

# def test_apply_gufunc_one_scalar_output():
# def foo():
# return 1,
# x, = apply_gufunc(foo, "->(),", output_dtypes=(int,))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now I understand why you special cased one output. Remind me, did you open an upstream issue in NumPy to discuss this?

My guess is that even though this is arguably poor API (special casing one vs. many return values) there will be little appetite for changing/extending it in NumPy itself.

Copy link
Copy Markdown
Contributor Author

@magonser magonser May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the upstream issue is numpy/numpy#10527

I slowly get to understand that I might need to convince many people for this to change upstream and that the order should be to implement against current API and change later if upstream changes.

Of course all your other code comments vanish, if I'd roll back to the current signature API.

As mentioned before I'd like to be transparent about the current implementation, which means also (focus on output_dtypes):

def foo(x):
    return x
a = np.array([1, 2, 3])
apply_gufunc(foo, "()->()", a, output_dtypes=int)  # Not: output_dtypes=[int]

What do you recommend/favor? Keep or roll back to current signature API?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it would be slightly better not to try to extend the gufunc signature here (before we have concensus that it makes sense for numpy). There aren't a lot of use-cases for gufuncs returning a tuple of length 1 (beyond consistency) or for accepting no input arguments, and consistency with numpy is virtuous.

That said, given that you already wrote this, I don't think there's a lot of harm in leaving it in as unsupported features...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shoyer.

Ok - so we will keep these parts as they are now.

@mrocklin
Copy link
Copy Markdown
Member

Checking in. What is the status here?

Any further concerns @shoyer? If not then would you like to do the honors and merge?

@shoyer shoyer merged commit bf5cfb0 into dask:master May 18, 2018
@shoyer
Copy link
Copy Markdown
Member

shoyer commented May 18, 2018

thanks @magonser!

I'm really excited about this.

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented May 18, 2018 via email

@mrocklin
Copy link
Copy Markdown
Member

Thank you for doing this @magonser and sticking with it. I'm very happy about this change

@magonser
Copy link
Copy Markdown
Contributor Author

Awesome! Thanks also again for your patience!

Let's see what will come about it, once people start experimenting with it...

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented May 18, 2018 via email

@jakirkham
Copy link
Copy Markdown
Member

Thanks for staying with the @magonser. Appreciate that pursuing a change like this is quite hard and it's really great that you have stuck with this and we were able to get in. Looking forward to playing with this in the new Dask version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants