Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31010][SQL][ML][FOLLOW-UP] Throw exception when use untyped UDF by default #27488

Closed
wants to merge 13 commits into from

Conversation

Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Feb 7, 2020

What changes were proposed in this pull request?

This PR proposes to throw exception by default when user use untyped UDF(a.k.a org.apache.spark.sql.functions.udf(AnyRef, DataType)).

And user could still use it by setting spark.sql.legacy.useUnTypedUdf.enabled to true.

Why are the changes needed?

According to #23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, val f = udf((x: Int) => x, IntegerType), f($"x") will return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default.

As a result, this might change data silently and may cause correctness issue if user still expect null in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem.

Does this PR introduce any user-facing change?

Yeah. User will hit exception now when use untyped UDF.

How was this patch tested?

Added test and updated some tests.

@@ -79,7 +80,7 @@ abstract class Transformer extends PipelineStage {
* result as a new column.
*/
@DeveloperApi
abstract class UnaryTransformer[IN, OUT, T <: UnaryTransformer[IN, OUT, T]]
abstract class UnaryTransformer[IN: TypeTag, OUT: TypeTag, T <: UnaryTransformer[IN, OUT, T]]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeTag is required for typed UDF when create udf for createTransformFunc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, but I think it's better than silent result changing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid this breaking change if we know that the type parameter won't be primitive types. cc @srowen @zhengruifeng

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree, but this is trading a possible error for a definite error. In light of the recent conversations about not-breaking things, is this wise? (I don't object though.)

Yes, let's restrict this to primitive types. I think Spark ML even uses some UDFs that accept AnyRef or something to work with tuples or triples, IIRC.

Copy link
Contributor

@cloud-fan cloud-fan Feb 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a developer API, so I'm wondering if third-party implementations would use primitive type and hit the silent result changing.

I think it's better to ask users to re-compile their Spark application than just telling them that they may hit result changinng.

@Ngone51
Copy link
Member Author

Ngone51 commented Feb 7, 2020

ping @cloud-fan @mengxr @WeichenXu123 please help review.

@SparkQA
Copy link

SparkQA commented Feb 7, 2020

Test build #118024 has finished for PR 27488 at commit 809703e.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -65,6 +65,8 @@ license: |

- In Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. Since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.

- Since Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, DataType)` is not allowed by default. Set `spark.sql.legacy.useUnTypedUdf.enabled` to true to keep use it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we merge the migration guide between this one and the one that changes the behavior?

@@ -2006,6 +2006,14 @@ object SQLConf {
.booleanConf
.createWithDefault(false)

val LEGACY_USE_UNTYPED_UDF =
buildConf("spark.sql.legacy.useUnTypedUdf.enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think spark.sql.legacy.allowUntypedScalaUDF is better.

"and the closure will see the default value of the Java type for the null argument, " +
"e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. You could use " +
"other typed udf APIs to avoid this problem, or set " +
"spark.sql.legacy.useUnTypedUdf.enabled to true to insistently use this."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not hardcode config names.

@SparkQA
Copy link

SparkQA commented Feb 7, 2020

Test build #118028 has finished for PR 27488 at commit 23e52ea.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2020

Test build #118127 has finished for PR 27488 at commit 93bf652.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Feb 10, 2020

Test build #118136 has finished for PR 27488 at commit 93bf652.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Feb 10, 2020

Jenkins, retest this please.

@@ -63,8 +63,8 @@ license: |

- Since Spark 3.0, JSON datasource and JSON function `schema_of_json` infer TimestampType from string values if they match to the pattern defined by the JSON option `timestampFormat`. Set JSON option `inferTimestamp` to `false` to disable such type inferring.

- In Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. Since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.

- Since Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, DataType)` is not allowed by default. Set `spark.sql.legacy.allowUntypedScalaUDF` to true to keep use it. But please note that, in Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(AnyRef, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. However, since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to keep use it -> to keep using it

@@ -4732,6 +4733,15 @@ object functions {
* @since 2.0.0
*/
def udf(f: AnyRef, dataType: DataType): UserDefinedFunction = {
if (!SQLConf.get.getConf(SQLConf.LEGACY_ALLOW_UNTYPED_SCALA_UDF)) {
val errorMsg = "You're using untyped udf, which does not have the input type information. " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

untyped Scala UDF

@@ -4732,6 +4733,15 @@ object functions {
* @since 2.0.0
*/
def udf(f: AnyRef, dataType: DataType): UserDefinedFunction = {
if (!SQLConf.get.getConf(SQLConf.LEGACY_ALLOW_UNTYPED_SCALA_UDF)) {
val errorMsg = "You're using untyped udf, which does not have the input type information. " +
"So, Spark may blindly pass null to the Scala closure with primitive-type argument, " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, Spark ... -> Spark ...

"and the closure will see the default value of the Java type for the null argument, " +
"e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. You could use " +
"other typed udf APIs to avoid this problem, or set " +
s"${SQLConf.LEGACY_ALLOW_UNTYPED_SCALA_UDF.key} to true to insistently use this."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to true and use this API with caution

"e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. You could use " +
"other typed udf APIs to avoid this problem, or set " +
s"${SQLConf.LEGACY_ALLOW_UNTYPED_SCALA_UDF.key} to true to insistently use this."
throw new SparkException(errorMsg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnalysisException?

@SparkQA
Copy link

SparkQA commented Feb 10, 2020

Test build #118145 has finished for PR 27488 at commit 93bf652.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 11, 2020

Test build #118198 has finished for PR 27488 at commit 549cdf4.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Feb 11, 2020

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Feb 11, 2020

Test build #118212 has finished for PR 27488 at commit 549cdf4.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Feb 11, 2020

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Feb 11, 2020

Test build #118216 has finished for PR 27488 at commit 549cdf4.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 11, 2020

Test build #118246 has finished for PR 27488 at commit 6969596.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Feb 12, 2020

retest this please.

@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118280 has finished for PR 27488 at commit 6969596.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Feb 12, 2020

retest this please.

@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118287 has finished for PR 27488 at commit 6969596.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Feb 13, 2020

Kindly ping @cloud-fan @mengxr @WeichenXu123

@WeichenXu123
Copy link
Contributor

LGTM. Also ping @srowen @zhengruifeng Could you help double check ?

- In Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. Since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.

- Since Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, DataType)` is not allowed by default. Set `spark.sql.legacy.allowUntypedScalaUDF` to true to keep using it. But please note that, in Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(AnyRef, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. However, since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove spaces here?

@SparkQA
Copy link

SparkQA commented Feb 20, 2020

Test build #118708 has finished for PR 27488 at commit 0d1601b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 20, 2020

Test build #118717 has finished for PR 27488 at commit 0d1601b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 20, 2020

Test build #118721 has finished for PR 27488 at commit 0d1601b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 20, 2020

Test build #118727 has finished for PR 27488 at commit 0d1601b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 82ce475 Feb 21, 2020
@Ngone51
Copy link
Member Author

Ngone51 commented Feb 21, 2020

thanks all!

cloud-fan pushed a commit that referenced this pull request Feb 21, 2020
…F by default

### What changes were proposed in this pull request?

This PR proposes to throw exception by default when user use untyped UDF(a.k.a `org.apache.spark.sql.functions.udf(AnyRef, DataType)`).

And user could still use it by setting `spark.sql.legacy.useUnTypedUdf.enabled` to `true`.

### Why are the changes needed?

According to #23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will  return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default.

As a result, this might change data silently and may cause correctness issue if user still expect `null` in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem.

### Does this PR introduce any user-facing change?

Yeah. User will hit exception now when use untyped UDF.

### How was this patch tested?

Added test and updated some tests.

Closes #27488 from Ngone51/spark_26580_followup.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: wuyi <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 82ce475)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan cloud-fan changed the title [SPARK-26580][SQL][ML][FOLLOW-UP] Throw exception when use untyped UDF by default [SPARK-31010][SQL][ML][FOLLOW-UP] Throw exception when use untyped UDF by default Mar 2, 2020
cloud-fan pushed a commit that referenced this pull request Mar 6, 2020
### What changes were proposed in this pull request?

Use scala annotation deprecate to deprecate untyped scala UDF.

### Why are the changes needed?

After #27488, it's weird to see the untyped scala UDF will fail by default without deprecation.

### Does this PR introduce any user-facing change?

Yes, user will see the warning:
```
<console>:26: warning: method udf in object functions is deprecated (since 3.0.0): Untyped Scala UDF API is deprecated, please use typed Scala UDF API such as 'def udf[RT: TypeTag](f: Function0[RT]): UserDefinedFunction' instead.
       val myudf = udf(() => Math.random(), DoubleType)
                   ^
```

### How was this patch tested?

Tested manually.

Closes #27794 from Ngone51/deprecate_untyped_scala_udf.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Mar 6, 2020
### What changes were proposed in this pull request?

Use scala annotation deprecate to deprecate untyped scala UDF.

### Why are the changes needed?

After #27488, it's weird to see the untyped scala UDF will fail by default without deprecation.

### Does this PR introduce any user-facing change?

Yes, user will see the warning:
```
<console>:26: warning: method udf in object functions is deprecated (since 3.0.0): Untyped Scala UDF API is deprecated, please use typed Scala UDF API such as 'def udf[RT: TypeTag](f: Function0[RT]): UserDefinedFunction' instead.
       val myudf = udf(() => Math.random(), DoubleType)
                   ^
```

### How was this patch tested?

Tested manually.

Closes #27794 from Ngone51/deprecate_untyped_scala_udf.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 587266f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
buildConf("spark.sql.legacy.allowUntypedScalaUDF")
.internal()
.doc("When set to true, user is allowed to use org.apache.spark.sql.functions." +
"udf(f: AnyRef, dataType: DataType). Otherwise, exception will be throw.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exception will be throw
->
an exception will be thrown at runtime.

"information. Spark may blindly pass null to the Scala closure with primitive-type " +
"argument, and the closure will see the default value of the Java type for the null " +
"argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. " +
"You could use other typed Scala UDF APIs to avoid this problem, or set " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the error message, we should give an example to show how to use the typed Scala UDF for implementing "udf((x: Int) => x, IntegerType)"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…F by default

### What changes were proposed in this pull request?

This PR proposes to throw exception by default when user use untyped UDF(a.k.a `org.apache.spark.sql.functions.udf(AnyRef, DataType)`).

And user could still use it by setting `spark.sql.legacy.useUnTypedUdf.enabled` to `true`.

### Why are the changes needed?

According to apache#23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will  return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default.

As a result, this might change data silently and may cause correctness issue if user still expect `null` in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem.

### Does this PR introduce any user-facing change?

Yeah. User will hit exception now when use untyped UDF.

### How was this patch tested?

Added test and updated some tests.

Closes apache#27488 from Ngone51/spark_26580_followup.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: wuyi <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

Use scala annotation deprecate to deprecate untyped scala UDF.

### Why are the changes needed?

After apache#27488, it's weird to see the untyped scala UDF will fail by default without deprecation.

### Does this PR introduce any user-facing change?

Yes, user will see the warning:
```
<console>:26: warning: method udf in object functions is deprecated (since 3.0.0): Untyped Scala UDF API is deprecated, please use typed Scala UDF API such as 'def udf[RT: TypeTag](f: Function0[RT]): UserDefinedFunction' instead.
       val myudf = udf(() => Math.random(), DoubleType)
                   ^
```

### How was this patch tested?

Tested manually.

Closes apache#27794 from Ngone51/deprecate_untyped_scala_udf.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
9 participants