Support arithmetic with row-series #2085

mrocklin · 2017-03-14T20:43:27Z

Sometimes a Series is just one row and is intended to be broadcast
across a Dataframe, rather than in an elementwise fashion.

This creates a convention, that a Series with divisions (0, 1) signals
that it is just a single row, and is thus appropriate for broadcasting.

This enables computations like the following (which used to err):

df = df - df.mean()

However it also introduces silent failures if elementwise operation
against this series is intended, despite its divisions of (0, 1)

Fixes #1759

cc @jcrist for review

Sometimes a Series is just one row and is intended to be broadcast across a Dataframe, rather than in an elementwise fashion. This creates a convention, that a Series with divisions (0, 1) signals that it is just a single row, and is thus appropriate for broadcasting. This enables computations like the following (which used to err): df = df - df.mean() However it also introduces silent failures if elementwise operation against this series is intended, despite its divisions of (0, 1) Fixes dask#1759

jcrist · 2017-03-14T20:49:29Z

It's not really "1 row", rather it's a series with the index equal to the columns in the original dataframe. In this case, I'd rather check if the divisions are equal to the (first,last) elements in columns of the frame it's broadcasting against, and set them accordingly in the same place you have here. This also allows keeps the meaning of divisions consistent.

mrocklin · 2017-03-14T21:31:34Z

Good point. I think that I have resolved this in a recent commit.

mrocklin · 2017-03-15T12:25:20Z

Any further comments @jcrist ? Merging this afternoon if not

jcrist · 2017-03-15T19:33:25Z

dask/dataframe/core.py

+    return (isinstance(s, Series) and
+            s.npartitions == 1 and
+            s.known_divisions and
+            any(s.divisions == (min(df.columns), max(df.columns))


This seems fine to me. Just for posterity, I was initially worried about three things:

What happens when columns is an unordered CategoricalIndex

What happens when columns is a MultiIndex

Performance of using min(columns) instead of columns.min()

The first one actually works fine, since the iterator goes over the values in the index not the categorical codes (makes sense). For multiindex, the iterator is just tuples, so that also works. The third is interesting - the performance of min is waaaay faster than the min method (not that it matters here, columns is likely to be small). This is true for Index and MultiIndex.

In [62]: ind = pd.Index(map(str, range(100000))) In [63]: %timeit ind.min() 100 loops, best of 3: 10.8 ms per loop In [64]: %timeit min(ind) 100 loops, best of 3: 3.16 ms per loop In [65]: ind = pd.Index(['a', 'b', 'c', 'd', 'e']) # Something smaller In [66]: %timeit ind.min() The slowest run took 4.64 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 30.2 µs per loop In [67]: %timeit min(ind) The slowest run took 43.95 times longer than the fastest. This could mean that an intermediate result is being cached. 1000000 loops, best of 3: 1.62 µs per loop

Switch from (0, 1) to min/max of columns

2987070

mrocklin force-pushed the dataframe-reduction-arithmetic branch from 1342f7b to 2987070 Compare March 14, 2017 22:44

Merge branch 'master' into dataframe-reduction-arithmetic

a131265

mrocklin force-pushed the dataframe-reduction-arithmetic branch from d899e68 to a131265 Compare March 15, 2017 13:50

jcrist reviewed Mar 15, 2017

View reviewed changes

mrocklin merged commit 175ae5a into dask:master Mar 15, 2017

mrocklin deleted the dataframe-reduction-arithmetic branch March 15, 2017 19:40

sinhrks added this to the 0.14.1 milestone Mar 30, 2017

dhirschfeld mentioned this pull request Sep 19, 2018

Supporting alternative RandomState objects #3993

Closed

gjoseph92 mentioned this pull request Nov 3, 2021

map_partitions doesn't broadcast single-partition DataFrames #8338

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support arithmetic with row-series #2085

Support arithmetic with row-series #2085

mrocklin commented Mar 14, 2017

jcrist commented Mar 14, 2017

mrocklin commented Mar 14, 2017

mrocklin commented Mar 15, 2017

jcrist Mar 15, 2017

Support arithmetic with row-series #2085

Support arithmetic with row-series #2085

Conversation

mrocklin commented Mar 14, 2017

jcrist commented Mar 14, 2017

mrocklin commented Mar 14, 2017

mrocklin commented Mar 15, 2017

jcrist Mar 15, 2017

Choose a reason for hiding this comment