Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18777][PYTHON][SQL] Return UDF from udf.register #17831

Closed
wants to merge 4 commits into from

Conversation

zero323
Copy link
Member

@zero323 zero323 commented May 2, 2017

What changes were proposed in this pull request?

  • Move udf wrapping code from functions.udf to functions.UserDefinedFunction.
  • Return wrapped udf from catalog.registerFunction and dependent methods.
  • Update docstrings in catalog.registerFunction and SQLContext.registerFunction.
  • Unit tests.

How was this patch tested?

  • Existing unit tests and docstests.
  • Additional tests covering new feature.

@zero323 zero323 changed the title [SPARK-18777] Return UDF from udf.register [SPARK-18777][PYTHON][SQL] Return UDF from udf.register May 2, 2017
@SparkQA
Copy link

SparkQA commented May 2, 2017

Test build #76380 has finished for PR 17831 at commit 9edc5e9.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 2, 2017

Test build #76381 has finished for PR 17831 at commit ced8814.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 2, 2017

Test build #76382 has finished for PR 17831 at commit e8615c2.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 2, 2017

Test build #76383 has finished for PR 17831 at commit 18c0253.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented May 2, 2017

Thank so for doing this, will let me cleanup some of my PySpark notebooks :)

@holdenk
Copy link
Contributor

holdenk commented May 2, 2017

LGTM, since this is just exposing existing functionality I think it makes sense to merge into 2.2 as well, but I'll leave this for a day incase anyone has any comments about this.

@gatorsmile
Copy link
Member

Python interface udf.register needs to be consistent with the scala one.

@holdenk
Copy link
Contributor

holdenk commented May 2, 2017

I don't think the Scala API is particularly consistent and I'd argue that this does actually match the current API as much as is reasonable. Looking at the udf.register interface defined in UDFRegistration.scala this does return either a UserDefinedAggregateFunction or UserDefinedFunction when passed in UserDefinedAggregateFunction or a Function0 through Function22. depending on which function is passed in most of the time. We do return Unit for the private registerJavaFunction & when registering a UDF1 through UDF22 . (Also @gatorsmile this doesn't change the scala API, its really just plumbing through the existing Scala API to Python).

@gatorsmile
Copy link
Member

In Scala API, we have five different ways to register UDFs. We have to be more careful when changing the external APIs. It makes the users confused if we make the APIs more diverse. I do not suggest to make any further change in Python UDF, until we figure out how to make us easy to add more UDF types.

@holdenk
Copy link
Contributor

holdenk commented May 2, 2017

I agree we need to be careful when making external API changes, but this is a pretty clear win for Python users and it matches the Scala API so I think improving this for Python users shouldn't necessarily block on any unification of the Scala API (is their a plan to improve this or JIRA associated with the planned Scala API changes I could subscribe to notifications on)?

@gatorsmile
Copy link
Member

So far, one of the biggest issues is we always assume UDF is deterministic. It could cause incorrect results after query optimization. I am working on a UDF API change to add more types for UDFs. Will submit a PR recently.

We can revisit this PR later.

@holdenk
Copy link
Contributor

holdenk commented May 2, 2017

I feel like that's an unrelated challenge. I'm happy to see other improvements but I'm worried that we will hold up changes for things which aren't happening soon - is there a JIRA for these changes?

@zero323
Copy link
Member Author

zero323 commented May 2, 2017

We can revisit this PR later.

@gatorsmile This sounds reasonable but I am not sure if I fully understand your concerns.

If anything this brings PySpark closer to the Scala API. At this moment we have

 registerFunction(self, name: str, f: Callable[[T], U], returnType: DataType) -> None: ...

and we would move to:

 registerFunction(self, name: str, f: Callable[[T], U], returnType: DataType) -> Callable[[Column], Column]: ...

and similarly with other We can revisit this PR later.

This, as pointed out by @holdenk, matches register API for Function0 .. Function22.

If you're planning breaking changes in the Scala API, it may render this PR obsolete, but we don't commit here to any particular implementation. The only promise here is that registering udf for SQL applications, returns an object, which can be used with DataFrame API. I believe this sounds like a reasonable requirement for any upcoming API.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, it looks good to me.

>>> from pyspark.sql.types import IntegerType
>>> spark.catalog.registerFunction("stringLengthInt", lambda x: len(x), IntegerType())
>>> strlen = spark.catalog.registerFunction("stringLengthInt", len, IntegerType())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks strlen is not being used. Should we remove this or add an example for this?

Copy link
Member Author

@zero323 zero323 May 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to either capture the output (we could use _ I guess) or use it in the doctests:

>>> spark.catalog.registerFunction("stringLengthInt", len, IntegerType())
<function len>

I am not sure if it makes sense to use it, unless we provide more diverse examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think capturing is rather better because it is what this API returns (and as it dose not return None anymore) if we don't have a better idea. Variable that indicates it's private, _, looks also fine to me. No strong opinion. I am fine if we don't have a simpler and cleaner idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use _. It is a common to indicate discarded output and, unlike repr, it is portable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Yea, discarded output is a better term ...)

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 3, 2017

I want to know the UDF API change in more details and see if that blocks this PR. I would appreciate if there is a JIRA or details I could check.

@SparkQA
Copy link

SparkQA commented May 3, 2017

Test build #76416 has finished for PR 17831 at commit bce80c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@zero323 Will ping you when my PR is submitted. You can then add the UDF changes in Python.

@gatorsmile
Copy link
Member

@HyukjinKwon @holdenk Spark 2.2 RC has been out. This PR is not a bug fix. Thus, this PR will be merged to the master branch instead of 2.2. I will submit a PR to show the changes soon.

@gatorsmile
Copy link
Member

JIRA: https://issues.apache.org/jira/browse/SPARK-20586 is opened.

@zero323
Copy link
Member Author

zero323 commented May 4, 2017

Thanks @gatorsmile

@HyukjinKwon
Copy link
Member

Yea, thanks for chiming in. It helped me a lot to understand the context.

@zero323
Copy link
Member Author

zero323 commented May 4, 2017

I think there is no conflict between #17848 and this. As of 2.2 we no longer return UserDefinedFunction from udf (and we never documented UserDefinedFunctions) so changes will have to be reflected in the arguments of udf and UserDefinedFunction.__init__.

@HyukjinKwon
Copy link
Member

It sounds orthogonal to me as well. LGTM.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 4, 2017

cc @viirya too who I believe is also appropriate to review this.

@holdenk
Copy link
Contributor

holdenk commented May 4, 2017

Thanks for making the #17848 PR with the details @gatorsmile it appears to be orthogonal to this change. Historically we've treated Python API parity fixes as closer to bug fixes rather than new features which is why I was thinking this made sense for 2.2, but I'll defer to your expertise in SQL and we can just target this for master.

@holdenk
Copy link
Contributor

holdenk commented May 6, 2017

@gatorsmile want to know if you're ok with this going into master or if you still have concerns about this if its targeted to 2.3?

@viirya
Copy link
Member

viirya commented May 7, 2017

This change LGTM.

I go to check #17848. It seems to me that the PR simply adds two flags into ScalaUDF. It appears that there is not API change regarding with existing UDF registration. I agreed with @holdenk and @HyukjinKwon that it is orthogonal to this change for now.

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in 63d90e7 May 7, 2017
@zero323
Copy link
Member Author

zero323 commented May 7, 2017

Thanks everyone.

@zero323 zero323 deleted the SPARK-18777 branch May 8, 2017 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants