-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21692][PYSPARK][SQL] Add nullability support to PythonUDF. #18906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@ptkool Thank you for working on this! |
|
@ueshin Thanks for commenting. It's unfortunate that users find nullability confusing. If you're coming from a SQL world, you should be quite familiar with nullability and null values. Nevertheless, Spark has a few issues with nullability, this being one of them, that I believe need to be addressed. And given the fact that the Catalyst optimizer considers nullability in several optimization rules makes this extremely important. As far as a use case, consider any platform using Spark where a null value is considered a "real" value, whether valid or invalid, in the given context, and where data must conform to a particular schema. When Python UDFs are being used, nullability must be specified correctly in order for conformance to work correctly. Also, a PR was recently merged to address this issue on the Scala side. |
|
@ptkool have you seen a real use case so far that you need this? I'm a bit surprised since Python UDFs are already pretty slow, and you'd care about this. Are there other cases you run into? One thing we can do is to do a runtime non-null check (insert an AssertNotNull) when this is annotated with not null. |
|
@rxin We have several large systems with 100s of Spark jobs implemented in Python and PySpark, and use Python UDFs due to lack of equivalent functionality in Spark. I understand what your saying re Python UDFs being slow and using AssertNotNull, but making this kind of change would be a huge effort. |
|
I understand why you are using Python. What I don't understand is why you'd need to annotate nullability, because those are typically annotated for the purpose of performance improvement, but Python UDFs are already slow and most performance improvements are probably not going to help you much there. Can you tell me the case why you care about this? |
|
This PR isn't about performance at all. I realize Python UDFs do not perform well and I also realize annotating Python UDFs with nullability is not going to make any difference performance-wise. This PR is about:
|
0d5163d to
4f591f0
Compare
|
Jenkins, ok to test. |
|
Test build #83752 has finished for PR 18906 at commit
|
10aa412 to
38dc32d
Compare
|
Test build #83786 has finished for PR 18906 at commit
|
721ca5e to
0bee999
Compare
|
Test build #83796 has finished for PR 18906 at commit
|
|
Test build #83798 has finished for PR 18906 at commit
|
|
Test build #83814 has finished for PR 18906 at commit
|
|
CC @HyukjinKwon |
dev/sparktestsupport/modules.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it was included by mistake? If you did this to quickly run selected tests you can use the --module with ./python/run-tests
I am actually not quite clear of the usecases. Providing actual codes and elaborating it should be helpful.
Could you link the equivalent Scala side API endpoints to check the API consistency? |
|
Here are the similar changes in the Scala API: #17911 |
|
I meant the actual equivalent endpoints and actual codes with usecases. |
|
Test build #83846 has finished for PR 18906 at commit
|
|
So I think with the performance improvements coming into Python UDFs considering annotating results as nullable or not could make sense (although I imagine we'd need to do something differeent for the vectorized UDFs if they aren't already being done). Let's loop in @BryanCutler , but the I think the performance improvements could be reasonable to be thinking about in Spark 2.3+. |
|
I believe the equivalent API in Scala would only be in the following form when registering a UDF Would it be preferable to just stick with a similar API for Python if we are trying to match the behavior?
Regarding performance increases with vectorized UDFs, right now the Java side is only implemented to accept nullable return types, so there wouldn't be any difference. In the future it would be possible to accept either and that would give a little performance bump. |
|
Thanks for the background Bryan :) So it sounds like from an API perspective it makes sense to support this in the future possibly on the Pandas UDFs (but the code isn't there on the JVM side). I'd say if @ptkool has the time it might make sense to match the scala API on the current UDFs its easier when we want to add this to the Panda's UDFs |
|
Test build #95624 has finished for PR 18906 at commit
|
|
Test build #95677 has finished for PR 18906 at commit
|
|
Test build #95712 has finished for PR 18906 at commit
|
|
Test build #96765 has finished for PR 18906 at commit
|
|
Test build #101127 has finished for PR 18906 at commit
|
|
@HyukjinKwon Can I get this reviewed again? |
|
Test build #108464 has finished for PR 18906 at commit
|
|
Test build #108477 has finished for PR 18906 at commit
|
|
Test build #110287 has finished for PR 18906 at commit
|
|
@HyukjinKwon @rxin Any chance of this being merged some time? |
|
Test build #116975 has finished for PR 18906 at commit
|
|
Test build #117006 has finished for PR 18906 at commit
|
|
Can one of the admins verify this patch? |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
When registering a Python UDF, a user may know whether the function can return null values or not. PythonUDF and all related classes should handle nullability.
How was this patch tested?
Existing tests and a few new tests.