Does ddf.pipe() make sense? #1555

hussainsultan · 2016-09-18T22:18:03Z

Often times, i end up writing a function that takes in a dask.dataframe. pandas implements pd.pipe(func) that i find pretty convenient.

This is a pretty easy to implement but i think pipe may be pretty confusing in dask.dataframe world especially if someone tries to do columnar reductions. Thoughts on if it makes sense to implement?

The text was updated successfully, but these errors were encountered:

mrocklin · 2016-09-18T22:23:10Z

cc @TomAugspurger

shoyer · 2016-09-19T01:02:14Z

pipe is just sugar that lets you apply a function like a method. So, it seems pretty obvious and sensible to me.

This is a pretty easy to implement but i think pipe may be pretty confusing in dask.dataframe world especially if someone tries to do columnar reductions.

I'm not sure I understand the concern here. How would this be more confusing for dask.dataframe than for pandas?

TomAugspurger · 2016-09-19T12:40:11Z

I think this makes sense too.

pipe is just sugar that lets you apply a function like a method

Would dask expect the function to be a delayed function? Or would pipe accept any function that takes adask.dataframe?

hussainsultan · 2016-09-19T15:14:05Z

consider this scenario:

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
                          'y': [1., 2., 3., 4., 5.]})

ddf = dd.from_pandas(df, npartitions=2)
def f(df):
    return df.x.sum()

ddf.pipe(f).compute()

0    6
1    9
dtype: int64

while pandas pipe returns:

What should be the correct behavior here?

mrocklin · 2016-09-19T15:17:28Z

I think that dask.dataframe should almost always follow pandas semantics

hussainsultan · 2016-09-19T15:43:00Z

in that case, it makes sense for the input function to be a delayed method. Thoughts?

ddf.pipe(delayed(f)).compute()

shoyer · 2016-09-19T15:50:42Z

I would literally copy the implementation of pipe from pandas, e.g.,

    def pipe(self, func, *args, **kwargs):
        if isinstance(func, tuple):
            func, target = func
            if target in kwargs:
                raise ValueError('%s is both the pipe target and a keyword '
                                 'argument' % target)
            kwargs[target] = self
            return func(*args, **kwargs)
        else:
            return func(self, *args, **kwargs)

df.pipe(func) should not separately map func over partitions no more than calling a func(df) does.

Fixes dask#1555

Fixes #1555

jcrist added a commit to jcrist/dask that referenced this issue Sep 22, 2016

Add pipe method to dask.dataframe

2b4ecf0

Fixes dask#1555

jcrist mentioned this issue Sep 22, 2016

Add pipe method to dask.dataframe #1567

Merged

jcrist closed this as completed in #1567 Sep 23, 2016

jcrist added a commit that referenced this issue Sep 23, 2016

Add pipe method to dask.dataframe (#1567)

32ad1a0

Fixes #1555

sinhrks added this to the 0.11.1 milestone Oct 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does ddf.pipe() make sense? #1555

Does ddf.pipe() make sense? #1555

hussainsultan commented Sep 18, 2016

mrocklin commented Sep 18, 2016

shoyer commented Sep 19, 2016

TomAugspurger commented Sep 19, 2016

hussainsultan commented Sep 19, 2016 •

edited

mrocklin commented Sep 19, 2016

hussainsultan commented Sep 19, 2016

shoyer commented Sep 19, 2016

Does ddf.pipe() make sense? #1555

Does ddf.pipe() make sense? #1555

Comments

hussainsultan commented Sep 18, 2016

mrocklin commented Sep 18, 2016

shoyer commented Sep 19, 2016

TomAugspurger commented Sep 19, 2016

hussainsultan commented Sep 19, 2016 • edited

mrocklin commented Sep 19, 2016

hussainsultan commented Sep 19, 2016

shoyer commented Sep 19, 2016

hussainsultan commented Sep 19, 2016 •

edited