Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for __array_function__ protocol #4567

Merged
merged 15 commits into from Mar 15, 2019

Conversation

Projects
None yet
5 participants
@pentschev
Copy link
Member

commented Mar 8, 2019

Resolves #4563.

  • Tests added / passed
  • Passes flake8 dask
@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 8, 2019

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

Ok, I'm not really sure why that's failing only on newer Python/NumPy versions. Anybody sees something obvious I've missed?

@hameerabbasi

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2019

NumPy 1.13 and 1.11 didn't have this protocol... So what you added was essentially dead code then.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 8, 2019

I'm surprised to see tests with sparse and cupy in this PR. From my perspective this is more about having Dask Arrays respond to NumPy functions, rather than having the blocks of a Dask array respond to NumPy functions.

So I would be more interested in checks like the following:

x = da.ones(...)
y = np.concatenate([x, x, x])
assert isinstance(y, da.Array)

Where we would also want to check things like np.linalg.qr (which you ran into with cupy) and any others that we know are sometimes problematic.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

NumPy 1.13 and 1.11 didn't have this protocol... So what you added was essentially dead code then.

Oh yes, that makes sense. I've been testing with upstream NumPy, that's why it passed here.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

@mrocklin it's good that you mention that, I've totally forgotten about that case. But either way, are you suggesting we drop Sparse and CuPy tests? I understand it may be a little early to have them properly setup due to different version requirements, but ultimately, isn't that what we want, to have Dask working also with other arrays?

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 8, 2019

We do want those to work, yes, but I think that that effort is somewhat orthogonal from the solution you've added here.

Or, in testing terms, you've got some great integration tests here, it would be good to see some unit tests as well that just test dask array's support of the __array_function__ protocol with normal NumPy arrays.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

Ok, so here's my suggestion, I will add the tests you suggested (which I really forgot before), and skip the tests we currently have if NumPy < 1.17, does that sound acceptable?

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 8, 2019

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

Sure. More tests are always great (assuming they're decently fast)

Agreed, the current ones were tweaked to be fast, I didn't profile them exactly, but all 3 tests were certainly < 1 sec.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

Ah, now I realized why the tests are failing, before 1.17 (currently, upstream), we have to set the environment variable NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1.

So the question here is, can we enabled that in CI for newer versions of Python? Otherwise, I think the only alternative is to skip these tests until v1.17 is released.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 8, 2019

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

Alright, I added some __array_function__ unit tests, it's by no means an extensive list, but probably a good starting point for us.

Regarding the NumPy version, we need at least v1.16 (now skipped for older versions) and NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1 variable exported, which I now added. Is it possible for us to test also with NumPy v1.16.x?

Show resolved Hide resolved dask/array/tests/test_array_function.py
Show resolved Hide resolved dask/array/core.py Outdated
@mrocklin
Copy link
Member

left a comment

Ah, looks like I didn't submit this review. I copy a bit of what @shoyer said, but some other comments as well.

Show resolved Hide resolved continuous_integration/travis/run_tests.sh
if da_func is func:
return NotImplemented
if not all(issubclass(t, Array) for t in types):
return NotImplemented

This comment has been minimized.

Copy link
@mrocklin

mrocklin Mar 9, 2019

Member

So in many cases a dask array function will work with numpy objects (or presumably anything on which it can effectively call asarray. I suggest a test like np.tensordot(dask_array, numpy_array) and replacing the all above with an any.

This comment has been minimized.

Copy link
@shoyer

shoyer Mar 9, 2019

Member

The __array_function__ protocol guarantees that it will only get called if dask.array.Array is in types. So if that's all you wanted to check, you could drop this entirely.

The problem is that you want to support operations involving some but not all other array types. For example, xarray.DataArray wraps dask arrays, not the other way around. So dask.Array.__array_function__ should return NotImplemented in that case, leaving it up xarray.DataArray.__array_function__ to define the operation.

We basically need some protocol or registration system to mark arrays as "can be coerced into dask array", e.g., either

sparse.Array.__dask_chunk_compatibile__ = True

or

dask.array.register_chunk_type(sparse.Array)

This comment has been minimized.

Copy link
@mrocklin

mrocklin Mar 9, 2019

Member

Hrm, that's an interesting problem. Maybe we try to turn all of the args into dask arrays with da.asarray and, if that works, proceed?

I'm not sure that this would produce the correct behavior for xarray.DataArray though. I genuinely don't know what the right behavior is there.

This comment has been minimized.

Copy link
@hameerabbasi

hameerabbasi Mar 9, 2019

Contributor

I actually think my idea for mixins can help with this. Specifically, we could check the specific mixins that Dask absolutely requires, I’m assuming that’d be the indexing mixin.

This comment has been minimized.

Copy link
@pentschev

pentschev Mar 9, 2019

Author Member

Hrm, that's an interesting problem. Maybe we try to turn all of the args into dask arrays with da.asarray and, if that works, proceed?

I actually like this idea for its simplicity.

If we want to be completely general, we could use hasattr(t, ‘__array_function__’). But this would also include Pandas dataframes at some point I’m assuming...

I also like this idea in general, but I think then it would be best before to agree on all mixins we expect to have/support. Plus, if I understood correctly, it feels like this will require much more type checking, which may become complicated to handle.

This comment has been minimized.

Copy link
@mrocklin

mrocklin Mar 10, 2019

Member

I think that that would only help to address the second point above. I'm not sure that it would solve it though. As a worst case scenario I suspect that a project like TensorFlow would not inherit from those classes. We could get NumPy, Dask, Sparse, and Xarray on board but I'm not sure about any of the others out there.

I personally think that duck typing is more likely to work across projects than explicit typing.

Mixins may still be a good idea for the reason you mention, you can easily implement a lot of NumPy-like functionality without too much effort, but I don't think that they're a good choice to test for Numpy-ness.

This comment has been minimized.

Copy link
@hameerabbasi

hameerabbasi Mar 10, 2019

Contributor

I do agree that it's hard to get traction for mixins or add dependencies, but adding a protocol isn't hard.

While you're right that TensorFlow doesn't plan on implementing any of these protocols, it has always historically done its own thing, and it has its own distributed system. I suspect that with Google's historical "rewrite the world and do it better" philosophy, it'll be hard to get them to play along anyway.

In contrast, anyone specifically aiming to be a NumPy duck array won't have a problem implementing protocols, and that is what I'm planning to do, to allow mixins to implement methods and so forth based on the implementation of protocols.

This comment has been minimized.

Copy link
@pentschev

pentschev Mar 11, 2019

Author Member

For the time being, I've used @shoyer and @mrocklin suggestions replacing only that last condition in __array_function__ to:

if not all(issubclass(t, Array) for t in types):
    return NotImplemented

It seems to me that the mixin discussion has grown enough that we should move it to a Dask issue, discussed there and later review the current code. Anybody disagrees with my proposal?

This comment has been minimized.

Copy link
@pentschev

pentschev Mar 12, 2019

Author Member

Since nobody opposed, I moved the discussion to #4583.

This comment has been minimized.

Copy link
@mrocklin

mrocklin Mar 14, 2019

Member

I think that with @shoyer 's statements below we can drop this conditional entirely. It should always pass

The array_function protocol guarantees that it will only get called if dask.array.Array is in types. So if that's all you wanted to check, you could drop this entirely.

Show resolved Hide resolved dask/array/tests/test_array_function.py Outdated
@hameerabbasi

This comment has been minimized.

Copy link
Contributor

commented Mar 9, 2019

Regarding the NumPy version, we need at least v1.16 (now skipped for older versions) and NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1 variable exported, which I now added. Is it possible for us to test also with NumPy v1.16.x?

You need to add that for 1.16, where it was experimental. For 1.17, though, the env var isn't needed currently, but people are proposing that we keep the "experimental" status of the NEP for one more version...

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 9, 2019

You need to add that for 1.16, where it was experimental. For 1.17, though, the env var isn't needed currently, but people are proposing that we keep the "experimental" status of the NEP for one more version...

While I agree that features should be experimental until they can really be considered stable, I was really expecting that v1.17 would introduce this feature as stable. Without that, it's difficult to even get other projects to even run unit tests for it, e.g., CuPy. I'm totally willing to help improving everything we can to attempt having this in v1.17 as stable, are there many major concerns about its maturity?

@hameerabbasi

This comment has been minimized.

Copy link
Contributor

commented Mar 9, 2019

are there many major concerns about its maturity?

CuPy has been using it for a while (as short as whiles go)... So I think not. There might be features added but not removed.

Add more __array_function__ tests
Tests added:
* Submodules
* Functionality not implemented in Dask
@hameerabbasi

This comment has been minimized.

Copy link
Contributor

commented Mar 10, 2019

One thing I must mention I like about this implementation is that it specifically avoids a NumPy dependency or mapping.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 12, 2019

Oh, there's more in ufunc tests. It fails specifically for i0, nan_to_num and sinc, which are being tested in test_unary_ufunc. However, those 3 functions are defined just after the following comment in ufunc.py: "non-ufunc elementwise functions".

I fixed this issue in my latest commit. If anyone thinks this is wrong, please let me know.

I would say this is ready for another round of review now.

@jakirkham

This comment has been minimized.

Copy link
Member

commented Mar 12, 2019

I fixed this issue in my latest commit. If anyone thinks this is wrong, please let me know.

That looks correct to me, @pentschev. Those are not really ufuncs, but they happened to more or less work for the purposes of the original tests. So it is more correct to test them separately. Thanks for correcting these.

@jakirkham
Copy link
Member

left a comment

This seems fine to me. I added a small comment above about adjusting the tests a little, but don't view this as a blocker.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 12, 2019

@jakirkham thanks for the review!

@@ -502,7 +502,7 @@ def test_bincount_with_weights():

dweights = da.from_array(weights, chunks=2)
e = da.bincount(d, weights=dweights, minlength=6)
assert_eq(e, np.bincount(x, weights=dweights, minlength=6))
assert_eq(e, np.bincount(x, weights=dweights.compute(), minlength=6))

This comment has been minimized.

Copy link
@mrocklin

mrocklin Mar 14, 2019

Member

Why was this change needed?

This comment has been minimized.

Copy link
@pentschev

pentschev Mar 14, 2019

Author Member

Because in this, the right argument of assert_eq() is used as baseline, and x being a NumPy array, it fails here:

    @wraps(np.bincount)
    def bincount(x, weights=None, minlength=None):
        if minlength is None:
            raise TypeError("Must specify minlength argument in da.bincount")
        assert x.ndim == 1
        if weights is not None:
>           assert weights.chunks == x.chunks
E           AttributeError: 'numpy.ndarray' object has no attribute 'chunks'

Therefore, to me it makes sense to make both arguments NumPy arrays and let it fallback to NumPy's implementation.

This comment has been minimized.

Copy link
@hameerabbasi

hameerabbasi Mar 14, 2019

Contributor

I think the reverse should be done, with weights being coerced to da.Array with the same chunks as x.

This comment has been minimized.

Copy link
@mrocklin

mrocklin Mar 14, 2019

Member

Ah I see. I think I agree with @pentschev here. The left argument, e, is testing the dask/dask situation while the right argument is testing the numpy/numpy situation.

This comment has been minimized.

Copy link
@pentschev

pentschev Mar 14, 2019

Author Member

I think the reverse should be done, with weights being coerced to da.Array with the same chunks as x.

That would cause da.bincount() to be dispatched, then we would be testing against e which computes exactly that.

if da_func is func:
return NotImplemented
if not all(issubclass(t, Array) for t in types):
return NotImplemented

This comment has been minimized.

Copy link
@mrocklin

mrocklin Mar 14, 2019

Member

I think that with @shoyer 's statements below we can drop this conditional entirely. It should always pass

The array_function protocol guarantees that it will only get called if dask.array.Array is in types. So if that's all you wanted to check, you could drop this entirely.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 14, 2019

I think that with @shoyer 's statements below we can drop this conditional entirely. It should always pass

The array_function protocol guarantees that it will only get called if dask.array.Array is in types. So if that's all you wanted to check, you could drop this entirely.

I think I've left this for a reason, I can't remember for sure. But I think it was in the case of having arrays of different types. Wouldn't it if one array was a Dask array and another a CuPy one, for example?

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 14, 2019

Currently the code is

        if not any(issubclass(t, (Array, np.ndarray)) for t in types):
            return NotImplemented

But we know that any(issubclass(t, Array) for t in types) is true. If it wasn't then we wouldn't be in this function.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 14, 2019

@mrocklin I'm confused, you're saying now we should get rid of the entire condition or just of Array and keep np.ndarray?

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 14, 2019

I'm saying that I think we can remove the entire condition. If what @shoyer suggests is true then I don't think it will ever be triggered.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 14, 2019

I was thinking of other types, like CuPy and maybe Sparse. I thought I had seen one such case, but maybe I got confused. I'll remove for now, and if need be, we can always add it back later.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 14, 2019

Currently your condition will not trigger if any of the arguments is of dask.array.Array type.

The fact that Array.__array_function__ is triggered means that at least one of our arguments is of dask.array.Array type. It doesn't matter what the other arguments are.

I may not fully understand things though.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 14, 2019

@mrocklin I don't either, but that's most likely right, since Dask can operate on different types as well, it should probably be capable of operator on mixed types too. For now I've removed this condition.

@mrocklin mrocklin merged commit 87f3a48 into dask:master Mar 15, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 15, 2019

Merging this in. Thanks for the effort here @pentschev and thank you @hameerabbasi, @jakirkham, and @shoyer for the review

@shoyer

This comment has been minimized.

Copy link
Member

commented Mar 15, 2019

For testing, I suppose this is OK, but I do think it's a really bad idea for dask not to include any type checking inside __array_function__. This is basically guaranteed to result in bugs.

But let's discuss more in #4583.

@shoyer

This comment has been minimized.

Copy link
Member

commented Mar 15, 2019

To be clear, thanks @pentschev and @mrocklin for pushing this forward. I'm really excited about the potential here!

I just want to get the design right. Given that we were involved in writing the NEP, I feel like we should try hard to be an exemplar implementation of __array_function__.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 15, 2019

Ah, my apologies if I jumped the gun here. Happy to revert and reopen if desired.

@shoyer

This comment has been minimized.

Copy link
Member

commented Mar 15, 2019

no need to revert, we can fix this forward

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2019

Thanks for the reviews @mrocklin, @shoyer, @hameerabbasi, @jakirkham.

I absolutely agree with @shoyer, we should get the design right. I'm happy to help with whatever I can, I'm not sure I can help much now with coming up with a design since I'm not aware of many of the use cases and how they should work, but count on me later on for implementation and testing.

@jrbourbeau jrbourbeau referenced this pull request Mar 17, 2019

Merged

Add zarr to CI environment #4604

0 of 2 tasks complete

@pentschev pentschev referenced this pull request Mar 27, 2019

Open

Interop with Dask #22

@pentschev pentschev deleted the pentschev:array_function_support branch Apr 17, 2019

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

Add support for __array_function__ protocol (dask#4567)
Add __array_function__ support to Array class
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.