-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support arithmetic with row-series #2085
Support arithmetic with row-series #2085
Conversation
Sometimes a Series is just one row and is intended to be broadcast across a Dataframe, rather than in an elementwise fashion. This creates a convention, that a Series with divisions (0, 1) signals that it is just a single row, and is thus appropriate for broadcasting. This enables computations like the following (which used to err): df = df - df.mean() However it also introduces silent failures if elementwise operation against this series is intended, despite its divisions of (0, 1) Fixes dask#1759
It's not really "1 row", rather it's a series with the index equal to the columns in the original dataframe. In this case, I'd rather check if the divisions are equal to the (first,last) elements in columns of the frame it's broadcasting against, and set them accordingly in the same place you have here. This also allows keeps the meaning of divisions consistent. |
Good point. I think that I have resolved this in a recent commit. |
1342f7b
to
2987070
Compare
Any further comments @jcrist ? Merging this afternoon if not |
d899e68
to
a131265
Compare
return (isinstance(s, Series) and | ||
s.npartitions == 1 and | ||
s.known_divisions and | ||
any(s.divisions == (min(df.columns), max(df.columns)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine to me. Just for posterity, I was initially worried about three things:
- What happens when columns is an unordered
CategoricalIndex
- What happens when columns is a
MultiIndex
- Performance of using
min(columns)
instead ofcolumns.min()
The first one actually works fine, since the iterator goes over the values in the index not the categorical codes (makes sense). For multiindex, the iterator is just tuples, so that also works. The third is interesting - the performance of min
is waaaay faster than the min
method (not that it matters here, columns is likely to be small). This is true for Index
and MultiIndex
.
In [62]: ind = pd.Index(map(str, range(100000)))
In [63]: %timeit ind.min()
100 loops, best of 3: 10.8 ms per loop
In [64]: %timeit min(ind)
100 loops, best of 3: 3.16 ms per loop
In [65]: ind = pd.Index(['a', 'b', 'c', 'd', 'e']) # Something smaller
In [66]: %timeit ind.min()
The slowest run took 4.64 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 30.2 µs per loop
In [67]: %timeit min(ind)
The slowest run took 43.95 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.62 µs per loop
Sometimes a Series is just one row and is intended to be broadcast
across a Dataframe, rather than in an elementwise fashion.
This creates a convention, that a Series with divisions (0, 1) signals
that it is just a single row, and is thus appropriate for broadcasting.
This enables computations like the following (which used to err):
However it also introduces silent failures if elementwise operation
against this series is intended, despite its divisions of (0, 1)
Fixes #1759
cc @jcrist for review