Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26216][SQL] Do not use case class as public API (UserDefinedFunction) #23178

Closed
wants to merge 3 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Nov 29, 2018

What changes were proposed in this pull request?

It's a bad idea to use case class as public API, as it has a very wide surface. For example, the copy method, its fields, the companion object, etc.

For a particular case, UserDefinedFunction. It has a private constructor, and I believe we only want users to access a few methods:apply, nullable, asNonNullable, etc.

However, all its fields, and copy method, and the companion object are public unexpectedly. As a result, we made many tricks to work around the binary compatibility issues.

This PR proposes to only make interfaces public, and hide implementations behind with a private class. Now UserDefinedFunction is a pure trait, and the concrete implementation is SparkUserDefinedFunction, which is private.

Changing class to interface is not binary compatible(but source compatible), so 3.0 is a good chance to do it.

This is the first PR to go with this direction. If it's accepted, I'll create a umbrella JIRA and fix all the public case classes.

How was this patch tested?

existing tests.

@cloud-fan
Copy link
Contributor Author

if (inputTypes.isDefined) {
assert(inputTypes.get.length == nullableTypes.get.length)
}

val inputsNullSafe = if (nullableTypes.isEmpty) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use getOrElse here and even inline this into the call below, but I don't really care.

// This is a `var` instead of in the constructor for backward compatibility of this case class.
// TODO: revisit this case class in Spark 3.0, and narrow down the public surface.
private[sql] var nullableTypes: Option[Seq[Boolean]] = None
trait UserDefinedFunction {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this sealed? I'm not sure. Would any user ever extend this meaningfully? I kind of worry someone will start doing so; maybe they already subclass it in some cases though. Elsewhere it might help the compiler understand in match statements that there is only ever one type of UDF class to match on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! though I'm not sure if sealed works for Java.

@rxin
Copy link
Contributor

rxin commented Nov 29, 2018 via email

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99454 has finished for PR 23178 at commit 700334f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • // [SPARK-26216][SQL] Do not use case class as public API (UserDefinedFunction)
  • trait UserDefinedFunction

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #4448 has started for PR 23178 at commit 69bc466.

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99457 has finished for PR 23178 at commit 69bc466.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • sealed trait UserDefinedFunction

@HyukjinKwon
Copy link
Member

+1 as well

@SparkQA
Copy link

SparkQA commented Nov 30, 2018

Test build #99502 has finished for PR 23178 at commit ad6605e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -38,114 +38,106 @@ import org.apache.spark.sql.types.DataType
* @since 1.3.0
*/
@Stable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 for this PR, but I'm just wondering if this @Stable tag with @since 1.3.0 tag is valid or not here.
Previous case class was stable until 2.4.x and new trait will be stable since 3.0. But, the stability is broken at 3.0.0 once. Did I understand correctly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to change it to @Stable with @since 3.0.0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea actually I was wondering about the same thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go ahead and leave the Since version. The API is essentially unchanged, though there are some marginal breaking compile time changes. But same is true of many things we are changing in 3.0. I've tagged the JIRA with release-notes and will add a blurb about the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a new API anyway, it will be weird to change since to 3.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thank you, @HyukjinKwon , @srowen , @cloud-fan .

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@asfgit asfgit closed this in 39617cb Dec 2, 2018
@marmbrus
Copy link
Contributor

Changing class to interface is not binary compatible(but source compatible), so 3.0 is a good chance to do it.

Why not keep it an abstract class? This is going to break every application that uses UDFs, which while allowed at a major version, seems like a pretty big annoyance.

@cloud-fan
Copy link
Contributor Author

thanks @marmbrus , that's a good idea! I've updated it in #23351

},

// [SPARK-26216][SQL] Do not use case class as public API (UserDefinedFunction)
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.expressions.UserDefinedFunction$"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of this in #23351?

holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
…UserDefinedFunction

## What changes were proposed in this pull request?

A followup of apache#23178 , to keep binary compability by using abstract class.

## How was this patch tested?

Manual test. I created a simple app with Spark 2.4
```
object TryUDF {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
    import spark.implicits._
    val f1 = udf((i: Int) => i + 1)
    println(f1.deterministic)
    spark.range(10).select(f1.asNonNullable().apply($"id")).show()
    spark.stop()
  }
}
```

When I run it with current master, it fails with
```
java.lang.IncompatibleClassChangeError: Found interface org.apache.spark.sql.expressions.UserDefinedFunction, but class was expected
```

When I run it with this PR, it works

Closes apache#23351 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…nction)

## What changes were proposed in this pull request?

It's a bad idea to use case class as public API, as it has a very wide surface. For example, the `copy` method, its fields, the companion object, etc.

For a particular case, `UserDefinedFunction`. It has a private constructor, and I believe we only want users to access a few methods:`apply`, `nullable`, `asNonNullable`, etc.

However, all its fields, and `copy` method, and the companion object are public unexpectedly. As a result, we made many tricks to work around the binary compatibility issues.

This PR proposes to only make interfaces public, and hide implementations behind with a private class. Now `UserDefinedFunction` is a pure trait, and the concrete implementation is `SparkUserDefinedFunction`, which is private.

Changing class to interface is not binary compatible(but source compatible), so 3.0 is a good chance to do it.

This is the first PR to go with this direction. If it's accepted, I'll create a umbrella JIRA and fix all the public case classes.

## How was this patch tested?

existing tests.

Closes apache#23178 from cloud-fan/udf.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…UserDefinedFunction

## What changes were proposed in this pull request?

A followup of apache#23178 , to keep binary compability by using abstract class.

## How was this patch tested?

Manual test. I created a simple app with Spark 2.4
```
object TryUDF {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
    import spark.implicits._
    val f1 = udf((i: Int) => i + 1)
    println(f1.deterministic)
    spark.range(10).select(f1.asNonNullable().apply($"id")).show()
    spark.stop()
  }
}
```

When I run it with current master, it fails with
```
java.lang.IncompatibleClassChangeError: Found interface org.apache.spark.sql.expressions.UserDefinedFunction, but class was expected
```

When I run it with this PR, it works

Closes apache#23351 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants