Skip to content

Conversation

@ptkool
Copy link
Contributor

@ptkool ptkool commented Aug 10, 2017

What changes were proposed in this pull request?

When registering a Python UDF, a user may know whether the function can return null values or not. PythonUDF and all related classes should handle nullability.

How was this patch tested?

Existing tests and a few new tests.

@ptkool ptkool changed the title [SPARK-21692] Add nullability support to PythonUDF. [SPARK-21692][PYSPARK][SQL] Add nullability support to PythonUDF. Aug 11, 2017
@ueshin
Copy link
Member

ueshin commented Aug 21, 2017

@ptkool Thank you for working on this!
I'd like to ask what your use-case is. Users have historically been confused about what nullable means, and we don't think we should give them yet another avenue to get it wrong.

@ptkool
Copy link
Contributor Author

ptkool commented Aug 21, 2017

@ueshin Thanks for commenting.

It's unfortunate that users find nullability confusing. If you're coming from a SQL world, you should be quite familiar with nullability and null values. Nevertheless, Spark has a few issues with nullability, this being one of them, that I believe need to be addressed. And given the fact that the Catalyst optimizer considers nullability in several optimization rules makes this extremely important.

As far as a use case, consider any platform using Spark where a null value is considered a "real" value, whether valid or invalid, in the given context, and where data must conform to a particular schema. When Python UDFs are being used, nullability must be specified correctly in order for conformance to work correctly.

Also, a PR was recently merged to address this issue on the Scala side.
https://issues.apache.org/jira/browse/SPARK-20668

@rxin
Copy link
Contributor

rxin commented Aug 22, 2017

@ptkool have you seen a real use case so far that you need this? I'm a bit surprised since Python UDFs are already pretty slow, and you'd care about this. Are there other cases you run into?

One thing we can do is to do a runtime non-null check (insert an AssertNotNull) when this is annotated with not null.

@ptkool
Copy link
Contributor Author

ptkool commented Aug 23, 2017

@rxin We have several large systems with 100s of Spark jobs implemented in Python and PySpark, and use Python UDFs due to lack of equivalent functionality in Spark. I understand what your saying re Python UDFs being slow and using AssertNotNull, but making this kind of change would be a huge effort.

@rxin
Copy link
Contributor

rxin commented Aug 24, 2017

I understand why you are using Python. What I don't understand is why you'd need to annotate nullability, because those are typically annotated for the purpose of performance improvement, but Python UDFs are already slow and most performance improvements are probably not going to help you much there.

Can you tell me the case why you care about this?

@ptkool
Copy link
Contributor Author

ptkool commented Aug 30, 2017

@rxin

This PR isn't about performance at all. I realize Python UDFs do not perform well and I also realize annotating Python UDFs with nullability is not going to make any difference performance-wise.

This PR is about:

  • Providing users an avenue to get nullability right.
  • Satisfying a use case that I described in my initial comments.
  • Consistency between the Python and Scala APIs.

@holdenk
Copy link
Contributor

holdenk commented Nov 12, 2017

Jenkins, ok to test.

@SparkQA
Copy link

SparkQA commented Nov 12, 2017

Test build #83752 has finished for PR 18906 at commit 38dc32d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ptkool ptkool force-pushed the udf_nullability branch 5 times, most recently from 10aa412 to 38dc32d Compare November 13, 2017 11:07
@SparkQA
Copy link

SparkQA commented Nov 13, 2017

Test build #83786 has finished for PR 18906 at commit 38dc32d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ptkool ptkool force-pushed the udf_nullability branch 2 times, most recently from 721ca5e to 0bee999 Compare November 13, 2017 17:59
@SparkQA
Copy link

SparkQA commented Nov 13, 2017

Test build #83796 has finished for PR 18906 at commit 0bee999.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2017

Test build #83798 has finished for PR 18906 at commit 402a814.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 14, 2017

Test build #83814 has finished for PR 18906 at commit feefb3e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Nov 14, 2017

CC @HyukjinKwon

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it was included by mistake? If you did this to quickly run selected tests you can use the --module with ./python/run-tests

@HyukjinKwon
Copy link
Member

Satisfying a use case that I described in my initial comments.

I am actually not quite clear of the usecases. Providing actual codes and elaborating it should be helpful.

Consistency between the Python and Scala APIs.

Could you link the equivalent Scala side API endpoints to check the API consistency?

@ptkool
Copy link
Contributor Author

ptkool commented Nov 14, 2017

Here are the similar changes in the Scala API: #17911

@HyukjinKwon
Copy link
Member

I meant the actual equivalent endpoints and actual codes with usecases.

@SparkQA
Copy link

SparkQA commented Nov 14, 2017

Test build #83846 has finished for PR 18906 at commit 9856be6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Nov 18, 2017

So I think with the performance improvements coming into Python UDFs considering annotating results as nullable or not could make sense (although I imagine we'd need to do something differeent for the vectorized UDFs if they aren't already being done).

Let's loop in @BryanCutler , but the I think the performance improvements could be reasonable to be thinking about in Spark 2.3+.

@BryanCutler
Copy link
Member

I believe the equivalent API in Scala would only be in the following form when registering a UDF

spark.udf.register("func", () => { 1 }).asNonNullable()

Would it be preferable to just stick with a similar API for Python if we are trying to match the behavior?

So I think with the performance improvements coming into Python UDFs considering annotating results as nullable or not could make sense (although I imagine we'd need to do something differeent for the vectorized UDFs if they aren't already being done).

Regarding performance increases with vectorized UDFs, right now the Java side is only implemented to accept nullable return types, so there wouldn't be any difference. In the future it would be possible to accept either and that would give a little performance bump.

@holdenk
Copy link
Contributor

holdenk commented Nov 25, 2017

Thanks for the background Bryan :) So it sounds like from an API perspective it makes sense to support this in the future possibly on the Pandas UDFs (but the code isn't there on the JVM side). I'd say if @ptkool has the time it might make sense to match the scala API on the current UDFs its easier when we want to add this to the Panda's UDFs

@SparkQA
Copy link

SparkQA commented Sep 3, 2018

Test build #95624 has finished for PR 18906 at commit 9038520.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 4, 2018

Test build #95677 has finished for PR 18906 at commit dcf3f07.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 5, 2018

Test build #95712 has finished for PR 18906 at commit 97305f5.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 28, 2018

Test build #96765 has finished for PR 18906 at commit fac6f1e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 12, 2019

Test build #101127 has finished for PR 18906 at commit eebc18f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ptkool
Copy link
Contributor Author

ptkool commented Jan 12, 2019

@HyukjinKwon Can I get this reviewed again?

@SparkQA
Copy link

SparkQA commented Jul 31, 2019

Test build #108464 has finished for PR 18906 at commit e1d68e8.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2019

Test build #108477 has finished for PR 18906 at commit be5735a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 7, 2019

Test build #110287 has finished for PR 18906 at commit 9cfc5b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ptkool
Copy link
Contributor Author

ptkool commented Sep 10, 2019

@HyukjinKwon @rxin Any chance of this being merged some time?

@ptkool ptkool requested review from HyukjinKwon and ueshin December 14, 2019 18:47
@SparkQA
Copy link

SparkQA commented Jan 18, 2020

Test build #116975 has finished for PR 18906 at commit b074eb1.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 19, 2020

Test build #117006 has finished for PR 18906 at commit 516a708.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jun 25, 2020
@github-actions github-actions bot closed this Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants