Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23060][Python] New feature - apply method to extend rdd's functionality #20258

Conversation

gianmarcodonetti
Copy link

@gianmarcodonetti gianmarcodonetti commented Jan 13, 2018

What changes were proposed in this pull request?

Extend the RDD class with the method apply.
This method should be like the pipe operator, attached to the RDD class itself.
Example:

def foo(rdd): return rdd.map(lambda x: x.split('|')).filter(lambda x: x[0] == 'ERROR')
rdd = sc.parallelize(['ERROR|10', 'ERROR|12', 'WARNING|10', 'INFO|2'])
result = rdd.apply(foo)
result.collect()
[('ERROR', '10'), ('ERROR', '12')]

The idea is to have something like this:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pipe.html?

How was this patch tested?

Manual tests. Easy patch.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it just a helper function?

def apply(self, func):
    return func(self)

I don't think it's quite worth adding it.

@gianmarcodonetti
Copy link
Author

@HyukjinKwon in my opinion, it helps a lot.
My goal is to avoid this case:

final_rdd = func_3(func_2(func_1(initial_rdd)))

And admit this:

final_rdd = initial_rdd.apply(func_1).apply(func_2).apply(func_3)

More functional and readable...

@HyukjinKwon
Copy link
Member

That resembles pipe as I pointed out in the JIRA. It's just a little trick and I don't think it's worth adding it for an API alone.

BTW, we should consider Java / Scala APIs and how it's going to work with Dataset and DataFrame too.

@ueshin
Copy link
Member

ueshin commented Jan 16, 2018

Is this similar to Dataset.transform() in Java/Scala API? But we don't have similar APIs for RDDs.

@HyukjinKwon
Copy link
Member

Oh, I see! Yea, they look quite same.

@srowen
Copy link
Member

srowen commented Jan 16, 2018

At best, the functionality already exists for the new API in a form. This should be closed.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@holdenk
Copy link
Contributor

holdenk commented Feb 26, 2018

I'm +1 to @srowen on this, I don't believe this is a change we're going to make to the API. @gianmarcodonetti please close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants