Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return da.Array rather than dd.Series for non-ufunc elementwise functions on dd.Series #8558

Merged
merged 3 commits into from Apr 7, 2022

Conversation

jsignell
Copy link
Member

This seems to work fine, but it does change the output type of ufuncs. I think this is better than it was originally and this code is very old, but I'd appreciate a second set of eyes. Maybe someone from @dask/array

@jsignell
Copy link
Member Author

@jorisvandenbossche or @rjzamora would either of you be able to sanity check this?

@jakirkham
Copy link
Member

Have we tested this on older versions of NumPy (like 1.18)?

@jsignell
Copy link
Member Author

jsignell commented Jan 18, 2022

Have we tested this on older versions of NumPy (like 1.18)?

That's the version of numpy installed on the python 3.7 tests https://github.com/dask/dask/runs/4781480587?check_suite_focus=true for instance

@jorisvandenbossche
Copy link
Member

but it does change the output type of ufuncs.

Do you mean that da.ufunc(dd.Series) now gives a da.Array instead of preserving the dd.Series?
(it seems to me that dd.Series is the "better" result type)

@jakirkham
Copy link
Member

In fact this is how it did work in the past ( #1669 ), which interestingly is the same PR that adds __array_wrap__ and makes it raise NotImplementedError.

Given how old that PR is and how little that code has changed, I wonder if something else is causing issues here.

@jsignell
Copy link
Member Author

but it does change the output type of ufuncs.

Do you mean that da.ufunc(dd.Series) now gives a da.Array instead of preserving the dd.Series? (it seems to me that dd.Series is the "better" result type)

That is what I mean, it seems like with this PR we come a lot closer to matching the numpy + pandas object behavior:

import numpy as np
import pandas as pd
import dask.array as da
import dask.dataframe as dd

s = pd.Series(np.random.randn(100))
ds = dd.from_pandas(s, 3)

da.sin(ds)  # dd.Series
np.sin(s)  # pd.Series
da.real(ds)  # da.Array
np.real(s)  # np.Array

@jorisvandenbossche
Copy link
Member

The __array_wrap__ method was added back, but deprecated (pandas-dev/pandas#45451). That should give you some more time to get this fixed (i.e. the pandas release tomorrow won't yet break dask directly).

it seems like with this PR we come a lot closer to matching the numpy + pandas object behavior:

Isn't it the other way around, i.e. further from numpy/pandas' behaviour?

With the numpy / pandas combo, calling a numpy ufunc on a pandas Series returns a pandas Series (np.ufunc(pd.Series) -> pd.Series).
And with dask, before this PR, calling a dask.array ufunc on a dask Series also preserved the dask Series (da.ufunc(dd.Series) -> dd.Series).

While you mention that with this PR it will start returning a da.Array instead of dd.Series, so which AFAIU is less consistent with pandas.

@jsignell
Copy link
Member Author

Sorry @jorisvandenbossche I think I'm missing something, can you provide an example that gives surprising behavior? The example that I gave above seems to match the numpy on pandas behavior.

@jorisvandenbossche
Copy link
Member

Ah, sorry, I misinterpreted your code snippet in #8558 (comment). I wasn't looking carefully enough to notice the difference between np.sin return types and np.real return types.
And the cause of my confusion is that np.sin is a ufunc, and np.real is not a ufunc.

So just to be very explicit, the following wasn't actually a correct statement:

but it does change the output type of ufuncs.

Do you mean that da.ufunc(dd.Series) now gives a da.Array instead of preserving the dd.Series? (it seems to me that dd.Series is the "better" result type)

That is what I mean

So for ufuncs the output did not change (according to the code snippet in #8558 (comment), the ufunc da.sin still returns a dd.Series).
But it is for the non-ufunc numpy functions that the output changed. Previously, dask made them preserve the Series class as well, while numpy doesn't actually do that with pandas.Series.

So yes, that indeed seems to bring the behaviour closer to numpy/pandas.

@jsignell
Copy link
Member Author

Ah sorry for the confusion @jorisvandenbossche I don't know why I thought those were all ufuncs 🙃

@bryanwweber
Copy link
Contributor

@jsignell Does this PR need a review? Do you have a suggestion for someone to do that?

@jsignell
Copy link
Member Author

jsignell commented Feb 2, 2022

This PR needs a decision more than it needs review. I'll add a deprecation cycle and then it'll be ready for review.

@jsignell jsignell added the deprecation Something is being removed label Feb 4, 2022
@jsignell
Copy link
Member Author

jsignell commented Feb 4, 2022

grrr it's hard to deprecate this in a way that is at all usable. I'll think some more about it unless anyone else has ideas.

@jsignell
Copy link
Member Author

@ncclementi in case you are looking into "test upstream" failures this resolves 12 or so, but it is kind of hard to implement in a backwards compatible way.

@ncclementi
Copy link
Member

@jsignell Thanks for the heads up, it looks like there are not many left. I think if we find a solution to #8722 (comment) that should solve all upstream problems.

@jsignell
Copy link
Member Author

I talked with @jcrist and we agreed that since the previous behavior was really more of a bug than anything else, we think it's ok to merge this without a deprecation.

@jsignell jsignell added the almost done Work is almost done! label Mar 23, 2022
@jsignell jsignell changed the title Make ufunc not fail because of missing __array_wrap__ Return da.Array rather than dd.Series for non-ufunc elementwise functions on dd.Series Mar 23, 2022
@jsignell
Copy link
Member Author

I'll try kicking off ci later when github is healthier

@jakirkham
Copy link
Member

cc @galipremsagar

@jsignell
Copy link
Member Author

jsignell commented Apr 7, 2022

I'm planning on merging this when tests pass.

@ian-r-rose
Copy link
Collaborator

Thanks @jsignell

@jsignell jsignell merged commit e07c98c into dask:main Apr 7, 2022
@jsignell jsignell deleted the fix-array-wrap branch April 7, 2022 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
almost done Work is almost done! array dataframe deprecation Something is being removed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants