Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arithmetic with row-series #2085

Merged
merged 3 commits into from
Mar 15, 2017

Conversation

mrocklin
Copy link
Member

Sometimes a Series is just one row and is intended to be broadcast
across a Dataframe, rather than in an elementwise fashion.

This creates a convention, that a Series with divisions (0, 1) signals
that it is just a single row, and is thus appropriate for broadcasting.

This enables computations like the following (which used to err):

df = df - df.mean()

However it also introduces silent failures if elementwise operation
against this series is intended, despite its divisions of (0, 1)

Fixes #1759

cc @jcrist for review

Sometimes a Series is just one row and is intended to be broadcast
across a Dataframe, rather than in an elementwise fashion.

This creates a convention, that a Series with divisions (0, 1) signals
that it is just a single row, and is thus appropriate for broadcasting.

This enables computations like the following (which used to err):

    df = df - df.mean()

However it also introduces silent failures if elementwise operation
against this series is intended, despite its divisions of (0, 1)

Fixes dask#1759
@jcrist
Copy link
Member

jcrist commented Mar 14, 2017

It's not really "1 row", rather it's a series with the index equal to the columns in the original dataframe. In this case, I'd rather check if the divisions are equal to the (first,last) elements in columns of the frame it's broadcasting against, and set them accordingly in the same place you have here. This also allows keeps the meaning of divisions consistent.

@mrocklin
Copy link
Member Author

Good point. I think that I have resolved this in a recent commit.

@mrocklin mrocklin force-pushed the dataframe-reduction-arithmetic branch from 1342f7b to 2987070 Compare March 14, 2017 22:44
@mrocklin
Copy link
Member Author

Any further comments @jcrist ? Merging this afternoon if not

@mrocklin mrocklin force-pushed the dataframe-reduction-arithmetic branch from d899e68 to a131265 Compare March 15, 2017 13:50
return (isinstance(s, Series) and
s.npartitions == 1 and
s.known_divisions and
any(s.divisions == (min(df.columns), max(df.columns))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me. Just for posterity, I was initially worried about three things:

  • What happens when columns is an unordered CategoricalIndex
  • What happens when columns is a MultiIndex
  • Performance of using min(columns) instead of columns.min()

The first one actually works fine, since the iterator goes over the values in the index not the categorical codes (makes sense). For multiindex, the iterator is just tuples, so that also works. The third is interesting - the performance of min is waaaay faster than the min method (not that it matters here, columns is likely to be small). This is true for Index and MultiIndex.

In [62]: ind = pd.Index(map(str, range(100000)))

In [63]: %timeit ind.min()
100 loops, best of 3: 10.8 ms per loop

In [64]: %timeit min(ind)
100 loops, best of 3: 3.16 ms per loop

In [65]: ind = pd.Index(['a', 'b', 'c', 'd', 'e'])  # Something smaller

In [66]: %timeit ind.min()
The slowest run took 4.64 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 30.2 µs per loop

In [67]: %timeit min(ind)
The slowest run took 43.95 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.62 µs per loop

@mrocklin mrocklin merged commit 175ae5a into dask:master Mar 15, 2017
@mrocklin mrocklin deleted the dataframe-reduction-arithmetic branch March 15, 2017 19:40
@sinhrks sinhrks added this to the 0.14.1 milestone Mar 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants