[SPARK-26216][SQL] Do not use case class as public API (UserDefinedFunction) #23178

cloud-fan · 2018-11-29T13:55:16Z

What changes were proposed in this pull request?

It's a bad idea to use case class as public API, as it has a very wide surface. For example, the copy method, its fields, the companion object, etc.

For a particular case, UserDefinedFunction. It has a private constructor, and I believe we only want users to access a few methods:apply, nullable, asNonNullable, etc.

However, all its fields, and copy method, and the companion object are public unexpectedly. As a result, we made many tricks to work around the binary compatibility issues.

This PR proposes to only make interfaces public, and hide implementations behind with a private class. Now UserDefinedFunction is a pure trait, and the concrete implementation is SparkUserDefinedFunction, which is private.

Changing class to interface is not binary compatible(but source compatible), so 3.0 is a good chance to do it.

This is the first PR to go with this direction. If it's accepted, I'll create a umbrella JIRA and fix all the public case classes.

How was this patch tested?

existing tests.

cloud-fan · 2018-11-29T13:55:52Z

cc @rxin @srowen @gatorsmile @HyukjinKwon @dongjoon-hyun

srowen · 2018-11-29T15:01:07Z

sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala

    if (inputTypes.isDefined) {
      assert(inputTypes.get.length == nullableTypes.get.length)
    }

+    val inputsNullSafe = if (nullableTypes.isEmpty) {


You can use getOrElse here and even inline this into the call below, but I don't really care.

srowen · 2018-11-29T15:03:16Z

sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala

-  // This is a `var` instead of in the constructor for backward compatibility of this case class.
-  // TODO: revisit this case class in Spark 3.0, and narrow down the public surface.
-  private[sql] var nullableTypes: Option[Seq[Boolean]] = None
+trait UserDefinedFunction {


Should we make this sealed? I'm not sure. Would any user ever extend this meaningfully? I kind of worry someone will start doing so; maybe they already subclass it in some cases though. Elsewhere it might help the compiler understand in match statements that there is only ever one type of UDF class to match on.

good idea! though I'm not sure if sealed works for Java.

rxin · 2018-11-29T15:05:30Z

Good idea to have it sealed!

…

On Nov 29, 2018, at 7:04 AM, Sean Owen ***@***.***> wrote: @srowen commented on this pull request. In sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala: > if (inputTypes.isDefined) { assert(inputTypes.get.length == nullableTypes.get.length) } + val inputsNullSafe = if (nullableTypes.isEmpty) { You can use getOrElse here and even inline this into the call below, but I don't really care. In sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala: > @@ -38,114 +38,108 @@ import org.apache.spark.sql.types.DataType * @SInCE 1.3.0 */ @stable -case class UserDefinedFunction protected[sql] ( - f: AnyRef, - dataType: DataType, - inputTypes: Option[Seq[DataType]]) { - - private var _nameOption: Option[String] = None - private var _nullable: Boolean = true - private var _deterministic: Boolean = true - - // This is a `var` instead of in the constructor for backward compatibility of this case class. - // TODO: revisit this case class in Spark 3.0, and narrow down the public surface. - private[sql] var nullableTypes: Option[Seq[Boolean]] = None +trait UserDefinedFunction { Should we make this sealed? I'm not sure. Would any user ever extend this meaningfully? I kind of worry someone will start doing so; maybe they already subclass it in some cases though. Elsewhere it might help the compiler understand in match statements that there is only ever one type of UDF class to match on. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

SparkQA · 2018-11-29T17:31:20Z

Test build #99454 has finished for PR 23178 at commit 700334f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
// [SPARK-26216][SQL] Do not use case class as public API (UserDefinedFunction)
trait UserDefinedFunction

SparkQA · 2018-11-29T19:11:27Z

Test build #4448 has started for PR 23178 at commit 69bc466.

SparkQA · 2018-11-29T19:44:00Z

Test build #99457 has finished for PR 23178 at commit 69bc466.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
sealed trait UserDefinedFunction

HyukjinKwon · 2018-11-30T01:20:52Z

+1 as well

SparkQA · 2018-11-30T12:24:02Z

Test build #99502 has finished for PR 23178 at commit ad6605e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-11-30T21:26:49Z

sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala

@@ -38,114 +38,106 @@ import org.apache.spark.sql.types.DataType
 * @since 1.3.0
 */
 @Stable


I'm +1 for this PR, but I'm just wondering if this @Stable tag with @since 1.3.0 tag is valid or not here.
Previous case class was stable until 2.4.x and new trait will be stable since 3.0. But, the stability is broken at 3.0.0 once. Did I understand correctly?

Is it better to change it to @Stable with @since 3.0.0?

yea actually I was wondering about the same thing.

I'd go ahead and leave the Since version. The API is essentially unchanged, though there are some marginal breaking compile time changes. But same is true of many things we are changing in 3.0. I've tagged the JIRA with release-notes and will add a blurb about the change.

It's not a new API anyway, it will be weird to change since to 3.0.

Got it. Thank you, @HyukjinKwon , @srowen , @cloud-fan .

cloud-fan · 2018-12-02T02:46:56Z

thanks for the review, merging to master!

marmbrus · 2018-12-18T20:49:47Z

Changing class to interface is not binary compatible(but source compatible), so 3.0 is a good chance to do it.

Why not keep it an abstract class? This is going to break every application that uses UDFs, which while allowed at a major version, seems like a pretty big annoyance.

cloud-fan · 2018-12-19T15:22:10Z

thanks @marmbrus , that's a good idea! I've updated it in #23351

gatorsmile · 2018-12-19T17:50:46Z

project/MimaExcludes.scala

+    },
+
+    // [SPARK-26216][SQL] Do not use case class as public API (UserDefinedFunction)
+    ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.expressions.UserDefinedFunction$"),


Can we get rid of this in #23351?

…UserDefinedFunction ## What changes were proposed in this pull request? A followup of apache#23178 , to keep binary compability by using abstract class. ## How was this patch tested? Manual test. I created a simple app with Spark 2.4 ``` object TryUDF { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate() import spark.implicits._ val f1 = udf((i: Int) => i + 1) println(f1.deterministic) spark.range(10).select(f1.asNonNullable().apply($"id")).show() spark.stop() } } ``` When I run it with current master, it fails with ``` java.lang.IncompatibleClassChangeError: Found interface org.apache.spark.sql.expressions.UserDefinedFunction, but class was expected ``` When I run it with this PR, it works Closes apache#23351 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…nction) ## What changes were proposed in this pull request? It's a bad idea to use case class as public API, as it has a very wide surface. For example, the `copy` method, its fields, the companion object, etc. For a particular case, `UserDefinedFunction`. It has a private constructor, and I believe we only want users to access a few methods:`apply`, `nullable`, `asNonNullable`, etc. However, all its fields, and `copy` method, and the companion object are public unexpectedly. As a result, we made many tricks to work around the binary compatibility issues. This PR proposes to only make interfaces public, and hide implementations behind with a private class. Now `UserDefinedFunction` is a pure trait, and the concrete implementation is `SparkUserDefinedFunction`, which is private. Changing class to interface is not binary compatible(but source compatible), so 3.0 is a good chance to do it. This is the first PR to go with this direction. If it's accepted, I'll create a umbrella JIRA and fix all the public case classes. ## How was this patch tested? existing tests. Closes apache#23178 from cloud-fan/udf. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…UserDefinedFunction ## What changes were proposed in this pull request? A followup of apache#23178 , to keep binary compability by using abstract class. ## How was this patch tested? Manual test. I created a simple app with Spark 2.4 ``` object TryUDF { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate() import spark.implicits._ val f1 = udf((i: Int) => i + 1) println(f1.deterministic) spark.range(10).select(f1.asNonNullable().apply($"id")).show() spark.stop() } } ``` When I run it with current master, it fails with ``` java.lang.IncompatibleClassChangeError: Found interface org.apache.spark.sql.expressions.UserDefinedFunction, but class was expected ``` When I run it with this PR, it works Closes apache#23351 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

srowen reviewed Nov 29, 2018

View reviewed changes

cloud-fan added 3 commits November 30, 2018 16:09

Do not use case class as public API (UserDefinedFunction)

6c0971f

address comments

10138f5

add migration guide

ad6605e

cloud-fan force-pushed the udf branch from 69bc466 to ad6605e Compare November 30, 2018 08:20

dongjoon-hyun reviewed Nov 30, 2018

View reviewed changes

asfgit closed this in 39617cb Dec 2, 2018

cloud-fan mentioned this pull request Dec 19, 2018

[SPARK-26216][SQL][followup] use abstract class instead of trait for UserDefinedFunction #23351

Closed

gatorsmile reviewed Dec 19, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-26216][SQL] Do not use case class as public API (UserDefinedFunction) #23178

[SPARK-26216][SQL] Do not use case class as public API (UserDefinedFunction) #23178

cloud-fan commented Nov 29, 2018 •

edited

cloud-fan commented Nov 29, 2018

srowen Nov 29, 2018

srowen Nov 29, 2018

cloud-fan Nov 29, 2018

rxin commented Nov 29, 2018 via email

SparkQA commented Nov 29, 2018

SparkQA commented Nov 29, 2018

SparkQA commented Nov 29, 2018

HyukjinKwon commented Nov 30, 2018

SparkQA commented Nov 30, 2018

dongjoon-hyun Nov 30, 2018

dongjoon-hyun Nov 30, 2018

HyukjinKwon Dec 1, 2018

srowen Dec 1, 2018

cloud-fan Dec 2, 2018

dongjoon-hyun Dec 2, 2018

cloud-fan commented Dec 2, 2018

marmbrus commented Dec 18, 2018

cloud-fan commented Dec 19, 2018

gatorsmile Dec 19, 2018

[SPARK-26216][SQL] Do not use case class as public API (UserDefinedFunction) #23178

[SPARK-26216][SQL] Do not use case class as public API (UserDefinedFunction) #23178

Conversation

cloud-fan commented Nov 29, 2018 • edited

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Nov 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Nov 29, 2018 via email

SparkQA commented Nov 29, 2018

SparkQA commented Nov 29, 2018

SparkQA commented Nov 29, 2018

HyukjinKwon commented Nov 30, 2018

SparkQA commented Nov 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Dec 2, 2018

marmbrus commented Dec 18, 2018

cloud-fan commented Dec 19, 2018

Choose a reason for hiding this comment

cloud-fan commented Nov 29, 2018 •

edited