Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does ddf.pipe() make sense? #1555

Closed
hussainsultan opened this issue Sep 18, 2016 · 7 comments
Closed

Does ddf.pipe() make sense? #1555

hussainsultan opened this issue Sep 18, 2016 · 7 comments
Milestone

Comments

@hussainsultan
Copy link
Contributor

Often times, i end up writing a function that takes in a dask.dataframe. pandas implements pd.pipe(func) that i find pretty convenient.

This is a pretty easy to implement but i think pipe may be pretty confusing in dask.dataframe world especially if someone tries to do columnar reductions. Thoughts on if it makes sense to implement?

@mrocklin
Copy link
Member

cc @TomAugspurger

@shoyer
Copy link
Member

shoyer commented Sep 19, 2016

pipe is just sugar that lets you apply a function like a method. So, it seems pretty obvious and sensible to me.

This is a pretty easy to implement but i think pipe may be pretty confusing in dask.dataframe world especially if someone tries to do columnar reductions.

I'm not sure I understand the concern here. How would this be more confusing for dask.dataframe than for pandas?

@TomAugspurger
Copy link
Member

I think this makes sense too.

pipe is just sugar that lets you apply a function like a method

Would dask expect the function to be a delayed function? Or would pipe accept any function that takes adask.dataframe?

@hussainsultan
Copy link
Contributor Author

hussainsultan commented Sep 19, 2016

consider this scenario:

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
                          'y': [1., 2., 3., 4., 5.]})

ddf = dd.from_pandas(df, npartitions=2)
def f(df):
    return df.x.sum()

ddf.pipe(f).compute()
0    6
1    9
dtype: int64

while pandas pipe returns:

15

What should be the correct behavior here?

@mrocklin
Copy link
Member

I think that dask.dataframe should almost always follow pandas semantics

@hussainsultan
Copy link
Contributor Author

in that case, it makes sense for the input function to be a delayed method. Thoughts?

ddf.pipe(delayed(f)).compute()

@shoyer
Copy link
Member

shoyer commented Sep 19, 2016

I would literally copy the implementation of pipe from pandas, e.g.,

    def pipe(self, func, *args, **kwargs):
        if isinstance(func, tuple):
            func, target = func
            if target in kwargs:
                raise ValueError('%s is both the pipe target and a keyword '
                                 'argument' % target)
            kwargs[target] = self
            return func(*args, **kwargs)
        else:
            return func(self, *args, **kwargs)

df.pipe(func) should not separately map func over partitions no more than calling a func(df) does.

jcrist added a commit to jcrist/dask that referenced this issue Sep 22, 2016
jcrist added a commit that referenced this issue Sep 23, 2016
@sinhrks sinhrks added this to the 0.11.1 milestone Oct 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants