[SPARK-21692][PYSPARK][SQL] Add nullability support to PythonUDF. #18906

ptkool · 2017-08-10T15:22:59Z

What changes were proposed in this pull request?

When registering a Python UDF, a user may know whether the function can return null values or not. PythonUDF and all related classes should handle nullability.

How was this patch tested?

Existing tests and a few new tests.

ueshin · 2017-08-21T02:44:15Z

@ptkool Thank you for working on this!
I'd like to ask what your use-case is. Users have historically been confused about what nullable means, and we don't think we should give them yet another avenue to get it wrong.

ptkool · 2017-08-21T11:58:43Z

@ueshin Thanks for commenting.

It's unfortunate that users find nullability confusing. If you're coming from a SQL world, you should be quite familiar with nullability and null values. Nevertheless, Spark has a few issues with nullability, this being one of them, that I believe need to be addressed. And given the fact that the Catalyst optimizer considers nullability in several optimization rules makes this extremely important.

As far as a use case, consider any platform using Spark where a null value is considered a "real" value, whether valid or invalid, in the given context, and where data must conform to a particular schema. When Python UDFs are being used, nullability must be specified correctly in order for conformance to work correctly.

Also, a PR was recently merged to address this issue on the Scala side.
https://issues.apache.org/jira/browse/SPARK-20668

rxin · 2017-08-22T20:58:42Z

@ptkool have you seen a real use case so far that you need this? I'm a bit surprised since Python UDFs are already pretty slow, and you'd care about this. Are there other cases you run into?

One thing we can do is to do a runtime non-null check (insert an AssertNotNull) when this is annotated with not null.

ptkool · 2017-08-23T11:05:30Z

@rxin We have several large systems with 100s of Spark jobs implemented in Python and PySpark, and use Python UDFs due to lack of equivalent functionality in Spark. I understand what your saying re Python UDFs being slow and using AssertNotNull, but making this kind of change would be a huge effort.

rxin · 2017-08-24T21:16:32Z

I understand why you are using Python. What I don't understand is why you'd need to annotate nullability, because those are typically annotated for the purpose of performance improvement, but Python UDFs are already slow and most performance improvements are probably not going to help you much there.

Can you tell me the case why you care about this?

ptkool · 2017-08-30T12:17:30Z

@rxin

This PR isn't about performance at all. I realize Python UDFs do not perform well and I also realize annotating Python UDFs with nullability is not going to make any difference performance-wise.

This PR is about:

Providing users an avenue to get nullability right.
Satisfying a use case that I described in my initial comments.
Consistency between the Python and Scala APIs.

holdenk · 2017-11-12T19:13:41Z

Jenkins, ok to test.

SparkQA · 2017-11-12T19:19:43Z

Test build #83752 has finished for PR 18906 at commit 38dc32d.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-13T12:23:11Z

Test build #83786 has finished for PR 18906 at commit 38dc32d.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-13T18:04:38Z

Test build #83796 has finished for PR 18906 at commit 0bee999.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-13T20:50:13Z

Test build #83798 has finished for PR 18906 at commit 402a814.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-14T01:46:04Z

Test build #83814 has finished for PR 18906 at commit feefb3e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-11-14T04:12:47Z

CC @HyukjinKwon

holdenk · 2017-11-14T04:13:08Z

dev/sparktestsupport/modules.py

This seems like it was included by mistake? If you did this to quickly run selected tests you can use the --module with ./python/run-tests

HyukjinKwon · 2017-11-14T13:28:50Z

Satisfying a use case that I described in my initial comments.

I am actually not quite clear of the usecases. Providing actual codes and elaborating it should be helpful.

Consistency between the Python and Scala APIs.

Could you link the equivalent Scala side API endpoints to check the API consistency?

ptkool · 2017-11-14T14:59:43Z

Here are the similar changes in the Scala API: #17911

HyukjinKwon · 2017-11-14T15:15:33Z

I meant the actual equivalent endpoints and actual codes with usecases.

SparkQA · 2017-11-14T15:50:50Z

Test build #83846 has finished for PR 18906 at commit 9856be6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-11-18T15:07:50Z

So I think with the performance improvements coming into Python UDFs considering annotating results as nullable or not could make sense (although I imagine we'd need to do something differeent for the vectorized UDFs if they aren't already being done).

Let's loop in @BryanCutler , but the I think the performance improvements could be reasonable to be thinking about in Spark 2.3+.

BryanCutler · 2017-11-21T19:41:44Z

I believe the equivalent API in Scala would only be in the following form when registering a UDF

spark.udf.register("func", () => { 1 }).asNonNullable()

Would it be preferable to just stick with a similar API for Python if we are trying to match the behavior?

So I think with the performance improvements coming into Python UDFs considering annotating results as nullable or not could make sense (although I imagine we'd need to do something differeent for the vectorized UDFs if they aren't already being done).

Regarding performance increases with vectorized UDFs, right now the Java side is only implemented to accept nullable return types, so there wouldn't be any difference. In the future it would be possible to accept either and that would give a little performance bump.

holdenk · 2017-11-25T09:18:22Z

Thanks for the background Bryan :) So it sounds like from an API perspective it makes sense to support this in the future possibly on the Pandas UDFs (but the code isn't there on the JVM side). I'd say if @ptkool has the time it might make sense to match the scala API on the current UDFs its easier when we want to add this to the Panda's UDFs

SparkQA · 2018-09-03T14:22:51Z

Test build #95624 has finished for PR 18906 at commit 9038520.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-04T15:05:01Z

Test build #95677 has finished for PR 18906 at commit dcf3f07.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-05T15:47:32Z

Test build #95712 has finished for PR 18906 at commit 97305f5.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-28T19:11:18Z

Test build #96765 has finished for PR 18906 at commit fac6f1e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-12T17:27:22Z

Test build #101127 has finished for PR 18906 at commit eebc18f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ptkool · 2019-01-12T21:45:50Z

@HyukjinKwon Can I get this reviewed again?

SparkQA · 2019-07-31T13:14:01Z

Test build #108464 has finished for PR 18906 at commit e1d68e8.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-31T20:48:20Z

Test build #108477 has finished for PR 18906 at commit be5735a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-07T21:48:38Z

Test build #110287 has finished for PR 18906 at commit 9cfc5b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ptkool · 2019-09-10T14:30:19Z

@HyukjinKwon @rxin Any chance of this being merged some time?

SparkQA · 2020-01-18T16:16:24Z

Test build #116975 has finished for PR 18906 at commit b074eb1.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-19T15:58:23Z

Test build #117006 has finished for PR 18906 at commit 516a708.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2020-03-16T15:33:59Z

Can one of the admins verify this patch?

github-actions · 2020-06-25T00:23:36Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

ptkool changed the title ~~[SPARK-21692] Add nullability support to PythonUDF.~~ [SPARK-21692][PYSPARK][SQL] Add nullability support to PythonUDF. Aug 11, 2017

ptkool force-pushed the udf_nullability branch from 0d5163d to 4f591f0 Compare October 2, 2017 17:26

ptkool force-pushed the udf_nullability branch 5 times, most recently from 10aa412 to 38dc32d Compare November 13, 2017 11:07

ptkool force-pushed the udf_nullability branch 2 times, most recently from 721ca5e to 0bee999 Compare November 13, 2017 17:59

holdenk reviewed Nov 14, 2017

View reviewed changes

Fix generated code compilation failures

dcf3f07

Fix failing ML tests

97305f5

ptkool added 2 commits September 28, 2018 09:39

Merge branch 'master' into udf_nullability

0377e28

Fix compilation failure when running Pandas tests

fac6f1e

Rebase and add udf tests to proper modules

eebc18f

dongjoon-hyun added PYSPARK SQL labels Jun 14, 2019

Fix merge conflicts

e1d68e8

Fix compilation errors

be5735a

Resolve merge conflict

9cfc5b6

ptkool requested review from HyukjinKwon and ueshin December 14, 2019 18:47

Fix merge conflicts

b074eb1

Fix test failures

516a708

github-actions bot added the Stale label Jun 25, 2020

github-actions bot closed this Jun 26, 2020

[SPARK-21692][PYSPARK][SQL] Add nullability support to PythonUDF. #18906

[SPARK-21692][PYSPARK][SQL] Add nullability support to PythonUDF. #18906

Uh oh!

Conversation

ptkool commented Aug 10, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ueshin commented Aug 21, 2017

Uh oh!

ptkool commented Aug 21, 2017

Uh oh!

rxin commented Aug 22, 2017

Uh oh!

ptkool commented Aug 23, 2017

Uh oh!

rxin commented Aug 24, 2017

Uh oh!

ptkool commented Aug 30, 2017

Uh oh!

holdenk commented Nov 12, 2017

Uh oh!

SparkQA commented Nov 12, 2017

Uh oh!

SparkQA commented Nov 13, 2017

Uh oh!

SparkQA commented Nov 13, 2017

Uh oh!

SparkQA commented Nov 13, 2017

Uh oh!

SparkQA commented Nov 14, 2017

Uh oh!

holdenk commented Nov 14, 2017

Uh oh!

holdenk Nov 14, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 14, 2017

Uh oh!

ptkool commented Nov 14, 2017

Uh oh!

HyukjinKwon commented Nov 14, 2017

Uh oh!

SparkQA commented Nov 14, 2017

Uh oh!

holdenk commented Nov 18, 2017

Uh oh!

BryanCutler commented Nov 21, 2017

Uh oh!

holdenk commented Nov 25, 2017

Uh oh!

SparkQA commented Sep 3, 2018

Uh oh!

SparkQA commented Sep 4, 2018

Uh oh!

SparkQA commented Sep 5, 2018

Uh oh!

SparkQA commented Sep 28, 2018

Uh oh!

SparkQA commented Jan 12, 2019

Uh oh!

ptkool commented Jan 12, 2019

Uh oh!

SparkQA commented Jul 31, 2019

Uh oh!

SparkQA commented Jul 31, 2019

Uh oh!

SparkQA commented Sep 7, 2019

Uh oh!

ptkool commented Sep 10, 2019

Uh oh!

SparkQA commented Jan 18, 2020

Uh oh!

SparkQA commented Jan 19, 2020

Uh oh!

AmplabJenkins commented Mar 16, 2020

Uh oh!

github-actions bot commented Jun 25, 2020

Uh oh!

Reviewers