-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[WIP][SPARK-28006] User-defined grouped transform pandas_udf for window operations #24896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Still WIP, but the implementation looks pretty straight forward. Need to make WindowExecInPandas handle multiple eval types... |
|
Test build #106597 has finished for PR 24896 at commit
|
| e.asInstanceOf[PythonUDF].evalType == PythonEvalType.SQL_GROUPED_XFORM_PANDAS_UDF | ||
| } | ||
|
|
||
| // This is currently same as GroupedAggPandasUDF, but we might support new types in the future, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the comment together?
|
Test build #109638 has finished for PR 24896 at commit
|
fe80821 to
9bb37ba
Compare
9bb37ba to
43eba25
Compare
|
Test build #111855 has finished for PR 24896 at commit
|
|
Just want to update a quick status here. I still think this is a useful addition to the current Pandas UDF and a quite easy one to implement. Internally we have quite a few use cases that could use grouped transform. Currently I am not pushing for this change because of SPARK-28264. @HyukjinKwon @BryanCutler what's your thought on getting this into 3.0 as well? I am happy to do the work here. |
|
@icexelloss I'm a little confused about a couple things
|
|
@BryanCutler, I think yes for both questions. BTW, I guess the name should be something similar with |
|
I think I don't strongly feel about this. I will defer to other people here. |
|
If the answer is yes to question (2) above, then I think we should hold off on this until SPARK-28264 is sorted out, and it will be good to keep this use case in mind too. |
|
@BryanCutler @HyukjinKwon Thanks both for the feedback. I am hoping we could reach some agreement about the functionality here. The spelling of course will depend on SPARK-28264. Do we think that because grouped_map already exists this functionality is not so useful? |
|
I roughly think like that for now so don't feel strongly .. I or somebody else will probably give a try for SPARK-28264 soon .. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Currently, pandas_udf supports "grouped aggregate" type that can be used with unbounded and unbounded windows. There is another set of use cases that can benefit from a "grouped transform" type pandas_udf.
Grouped transform is defined as a N -> N mapping over a group. For example, "compute zscore for values in the group using the grouped mean and grouped stdev", or "rank the values in the group".
Currently, in order to do this, user needs to use "grouped apply", for example:
This approach has a few downside:
Specifying the full return schema is complicated for the user although the function only changes one column.
Here we propose a new type of pandas_udf to work with these types of use cases:
Which addresses the above downsides.
This is similar to groupby transform in pandas, hence the name "grouped transform"
How was this patch tested?
Add new tests in test_pandas_udf_window