[SPARK-31010][SQL][ML][FOLLOW-UP] Throw exception when use untyped UDF by default #27488

Ngone51 · 2020-02-07T08:28:52Z

What changes were proposed in this pull request?

This PR proposes to throw exception by default when user use untyped UDF(a.k.a org.apache.spark.sql.functions.udf(AnyRef, DataType)).

And user could still use it by setting spark.sql.legacy.useUnTypedUdf.enabled to true.

Why are the changes needed?

According to #23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, val f = udf((x: Int) => x, IntegerType), f($"x") will return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default.

As a result, this might change data silently and may cause correctness issue if user still expect null in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem.

Does this PR introduce any user-facing change?

Yeah. User will hit exception now when use untyped UDF.

How was this patch tested?

Added test and updated some tests.

Ngone51 · 2020-02-07T08:31:43Z

mllib/src/main/scala/org/apache/spark/ml/Transformer.scala

@@ -79,7 +80,7 @@ abstract class Transformer extends PipelineStage {
 * result as a new column.
 */
 @DeveloperApi
-abstract class UnaryTransformer[IN, OUT, T <: UnaryTransformer[IN, OUT, T]]
+abstract class UnaryTransformer[IN: TypeTag, OUT: TypeTag, T <: UnaryTransformer[IN, OUT, T]]


TypeTag is required for typed UDF when create udf for createTransformFunc.

This is a breaking change, but I think it's better than silent result changing.

We can avoid this breaking change if we know that the type parameter won't be primitive types. cc @srowen @zhengruifeng

I don't disagree, but this is trading a possible error for a definite error. In light of the recent conversations about not-breaking things, is this wise? (I don't object though.)

Yes, let's restrict this to primitive types. I think Spark ML even uses some UDFs that accept AnyRef or something to work with tuples or triples, IIRC.

This is a developer API, so I'm wondering if third-party implementations would use primitive type and hit the silent result changing.

I think it's better to ask users to re-compile their Spark application than just telling them that they may hit result changinng.

Ngone51 · 2020-02-07T08:32:53Z

ping @cloud-fan @mengxr @WeichenXu123 please help review.

SparkQA · 2020-02-07T08:41:22Z

Test build #118024 has finished for PR 27488 at commit 809703e.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-07T09:57:50Z

docs/sql-migration-guide.md

@@ -65,6 +65,8 @@ license: |

  - In Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. Since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.

+  - Since Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, DataType)` is not allowed by default. Set `spark.sql.legacy.useUnTypedUdf.enabled` to true to keep use it.


can we merge the migration guide between this one and the one that changes the behavior?

cloud-fan · 2020-02-07T09:59:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -2006,6 +2006,14 @@ object SQLConf {
      .booleanConf
      .createWithDefault(false)

+  val LEGACY_USE_UNTYPED_UDF =
+    buildConf("spark.sql.legacy.useUnTypedUdf.enabled")


I think spark.sql.legacy.allowUntypedScalaUDF is better.

cloud-fan · 2020-02-07T10:00:19Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+        "and the closure will see the default value of the Java type for the null argument, " +
+        "e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. You could use " +
+        "other typed udf APIs to avoid this problem, or set " +
+        "spark.sql.legacy.useUnTypedUdf.enabled to true to insistently use this."


let's not hardcode config names.

SparkQA · 2020-02-07T11:41:30Z

Test build #118028 has finished for PR 27488 at commit 23e52ea.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-10T07:59:25Z

Test build #118127 has finished for PR 27488 at commit 93bf652.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-02-10T08:00:56Z

retest this please

SparkQA · 2020-02-10T08:14:47Z

Test build #118136 has finished for PR 27488 at commit 93bf652.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-02-10T08:45:24Z

Jenkins, retest this please.

cloud-fan · 2020-02-10T08:53:28Z

docs/sql-migration-guide.md

@@ -63,8 +63,8 @@ license: |

  - Since Spark 3.0, JSON datasource and JSON function `schema_of_json` infer TimestampType from string values if they match to the pattern defined by the JSON option `timestampFormat`. Set JSON option `inferTimestamp` to `false` to disable such type inferring.

-  - In Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. Since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.
-
+  - Since Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, DataType)` is not allowed by default. Set `spark.sql.legacy.allowUntypedScalaUDF` to true to keep use it. But please note that, in Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(AnyRef, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. However, since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.


to keep use it -> to keep using it

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala

cloud-fan · 2020-02-10T08:56:00Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

@@ -4732,6 +4733,15 @@ object functions {
   * @since 2.0.0
   */
  def udf(f: AnyRef, dataType: DataType): UserDefinedFunction = {
+    if (!SQLConf.get.getConf(SQLConf.LEGACY_ALLOW_UNTYPED_SCALA_UDF)) {
+      val errorMsg = "You're using untyped udf, which does not have the input type information. " +


untyped Scala UDF

cloud-fan · 2020-02-10T08:56:22Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

@@ -4732,6 +4733,15 @@ object functions {
   * @since 2.0.0
   */
  def udf(f: AnyRef, dataType: DataType): UserDefinedFunction = {
+    if (!SQLConf.get.getConf(SQLConf.LEGACY_ALLOW_UNTYPED_SCALA_UDF)) {
+      val errorMsg = "You're using untyped udf, which does not have the input type information. " +
+        "So, Spark may blindly pass null to the Scala closure with primitive-type argument, " +


So, Spark ... -> Spark ...

cloud-fan · 2020-02-10T08:57:13Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+        "and the closure will see the default value of the Java type for the null argument, " +
+        "e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. You could use " +
+        "other typed udf APIs to avoid this problem, or set " +
+        s"${SQLConf.LEGACY_ALLOW_UNTYPED_SCALA_UDF.key} to true to insistently use this."


to true and use this API with caution

cloud-fan · 2020-02-10T08:57:43Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+        "e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. You could use " +
+        "other typed udf APIs to avoid this problem, or set " +
+        s"${SQLConf.LEGACY_ALLOW_UNTYPED_SCALA_UDF.key} to true to insistently use this."
+      throw new SparkException(errorMsg)


AnalysisException?

SparkQA · 2020-02-10T11:26:00Z

Test build #118145 has finished for PR 27488 at commit 93bf652.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-11T03:06:47Z

Test build #118198 has finished for PR 27488 at commit 549cdf4.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-02-11T06:31:03Z

Jenkins, retest this please.

SparkQA · 2020-02-11T06:40:37Z

Test build #118212 has finished for PR 27488 at commit 549cdf4.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-02-11T07:36:39Z

Jenkins, retest this please.

SparkQA · 2020-02-11T07:46:11Z

Test build #118216 has finished for PR 27488 at commit 549cdf4.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-11T17:13:23Z

Test build #118246 has finished for PR 27488 at commit 6969596.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-02-12T06:47:55Z

retest this please.

SparkQA · 2020-02-12T08:05:02Z

Test build #118280 has finished for PR 27488 at commit 6969596.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-02-12T09:06:07Z

retest this please.

SparkQA · 2020-02-12T11:49:16Z

Test build #118287 has finished for PR 27488 at commit 6969596.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-02-13T01:52:10Z

Kindly ping @cloud-fan @mengxr @WeichenXu123

WeichenXu123 · 2020-02-17T12:03:44Z

LGTM. Also ping @srowen @zhengruifeng Could you help double check ?

zhengruifeng · 2020-02-18T07:38:51Z

docs/sql-migration-guide.md

-  - In Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. Since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.
-
+  - Since Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, DataType)` is not allowed by default. Set `spark.sql.legacy.allowUntypedScalaUDF` to true to keep using it. But please note that, in Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(AnyRef, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. However, since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.
+    


nit: remove spaces here?

SparkQA · 2020-02-20T11:08:08Z

Test build #118708 has finished for PR 27488 at commit 0d1601b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-20T11:41:10Z

retest this please

SparkQA · 2020-02-20T14:09:35Z

Test build #118717 has finished for PR 27488 at commit 0d1601b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-20T14:13:45Z

retest this please

SparkQA · 2020-02-20T16:54:47Z

Test build #118721 has finished for PR 27488 at commit 0d1601b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-20T17:03:30Z

retest this please

SparkQA · 2020-02-20T20:52:36Z

Test build #118727 has finished for PR 27488 at commit 0d1601b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-21T06:47:11Z

thanks, merging to master/3.0!

Ngone51 · 2020-02-21T07:02:12Z

thanks all!

…F by default ### What changes were proposed in this pull request? This PR proposes to throw exception by default when user use untyped UDF(a.k.a `org.apache.spark.sql.functions.udf(AnyRef, DataType)`). And user could still use it by setting `spark.sql.legacy.useUnTypedUdf.enabled` to `true`. ### Why are the changes needed? According to #23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default. As a result, this might change data silently and may cause correctness issue if user still expect `null` in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem. ### Does this PR introduce any user-facing change? Yeah. User will hit exception now when use untyped UDF. ### How was this patch tested? Added test and updated some tests. Closes #27488 from Ngone51/spark_26580_followup. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 82ce475) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Use scala annotation deprecate to deprecate untyped scala UDF. ### Why are the changes needed? After #27488, it's weird to see the untyped scala UDF will fail by default without deprecation. ### Does this PR introduce any user-facing change? Yes, user will see the warning: ``` <console>:26: warning: method udf in object functions is deprecated (since 3.0.0): Untyped Scala UDF API is deprecated, please use typed Scala UDF API such as 'def udf[RT: TypeTag](f: Function0[RT]): UserDefinedFunction' instead. val myudf = udf(() => Math.random(), DoubleType) ^ ``` ### How was this patch tested? Tested manually. Closes #27794 from Ngone51/deprecate_untyped_scala_udf. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Use scala annotation deprecate to deprecate untyped scala UDF. ### Why are the changes needed? After #27488, it's weird to see the untyped scala UDF will fail by default without deprecation. ### Does this PR introduce any user-facing change? Yes, user will see the warning: ``` <console>:26: warning: method udf in object functions is deprecated (since 3.0.0): Untyped Scala UDF API is deprecated, please use typed Scala UDF API such as 'def udf[RT: TypeTag](f: Function0[RT]): UserDefinedFunction' instead. val myudf = udf(() => Math.random(), DoubleType) ^ ``` ### How was this patch tested? Tested manually. Closes #27794 from Ngone51/deprecate_untyped_scala_udf. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 587266f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

gatorsmile · 2020-03-10T00:24:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.legacy.allowUntypedScalaUDF")
+      .internal()
+      .doc("When set to true, user is allowed to use org.apache.spark.sql.functions." +
+        "udf(f: AnyRef, dataType: DataType). Otherwise, exception will be throw.")


exception will be throw
->
an exception will be thrown at runtime.

gatorsmile · 2020-03-10T00:26:09Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+        "information. Spark may blindly pass null to the Scala closure with primitive-type " +
+        "argument, and the closure will see the default value of the Java type for the null " +
+        "argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. " +
+        "You could use other typed Scala UDF APIs to avoid this problem, or set " +


In the error message, we should give an example to show how to use the typed Scala UDF for implementing "udf((x: Int) => x, IntegerType)"

…F by default ### What changes were proposed in this pull request? This PR proposes to throw exception by default when user use untyped UDF(a.k.a `org.apache.spark.sql.functions.udf(AnyRef, DataType)`). And user could still use it by setting `spark.sql.legacy.useUnTypedUdf.enabled` to `true`. ### Why are the changes needed? According to apache#23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default. As a result, this might change data silently and may cause correctness issue if user still expect `null` in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem. ### Does this PR introduce any user-facing change? Yeah. User will hit exception now when use untyped UDF. ### How was this patch tested? Added test and updated some tests. Closes apache#27488 from Ngone51/spark_26580_followup. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Use scala annotation deprecate to deprecate untyped scala UDF. ### Why are the changes needed? After apache#27488, it's weird to see the untyped scala UDF will fail by default without deprecation. ### Does this PR introduce any user-facing change? Yes, user will see the warning: ``` <console>:26: warning: method udf in object functions is deprecated (since 3.0.0): Untyped Scala UDF API is deprecated, please use typed Scala UDF API such as 'def udf[RT: TypeTag](f: Function0[RT]): UserDefinedFunction' instead. val myudf = udf(() => Math.random(), DoubleType) ^ ``` ### How was this patch tested? Tested manually. Closes apache#27794 from Ngone51/deprecate_untyped_scala_udf. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Ngone51 commented Feb 7, 2020

View reviewed changes

cloud-fan reviewed Feb 7, 2020

View reviewed changes

cloud-fan reviewed Feb 10, 2020

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala Show resolved Hide resolved

cloud-fan reviewed Feb 10, 2020

View reviewed changes

dongjoon-hyun added the SQL label Feb 10, 2020

Ngone51 added 6 commits February 11, 2020 17:22

disable untyped udf

417860d

revert minor change

b50677f

add test

9b00afc

fix mima

1a74b4b

fix FPGrowthSuite

6eb3c4b

rename config name

5b95715

cloud-fan approved these changes Feb 17, 2020

View reviewed changes

zhengruifeng reviewed Feb 18, 2020

View reviewed changes

nit

0d1601b

cloud-fan closed this in 82ce475 Feb 21, 2020

cloud-fan changed the title ~~[SPARK-26580][SQL][ML][FOLLOW-UP] Throw exception when use untyped UDF by default~~ [SPARK-31010][SQL][ML][FOLLOW-UP] Throw exception when use untyped UDF by default Mar 2, 2020

Ngone51 mentioned this pull request Mar 4, 2020

[SPARK-31010][SQL][FOLLOW-UP] Deprecate untyped scala UDF #27794

Closed

gatorsmile reviewed Mar 10, 2020

View reviewed changes

		@@ -65,6 +65,8 @@ license: \|

		- In Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. Since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.

		- Since Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, DataType)` is not allowed by default. Set `spark.sql.legacy.useUnTypedUdf.enabled` to true to keep use it.

[SPARK-31010][SQL][ML][FOLLOW-UP] Throw exception when use untyped UDF by default #27488

[SPARK-31010][SQL][ML][FOLLOW-UP] Throw exception when use untyped UDF by default #27488

Conversation

Ngone51 commented Feb 7, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Feb 17, 2020 • edited

Choose a reason for hiding this comment

Ngone51 commented Feb 7, 2020

SparkQA commented Feb 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 7, 2020

SparkQA commented Feb 10, 2020

HyukjinKwon commented Feb 10, 2020

SparkQA commented Feb 10, 2020

Ngone51 commented Feb 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 10, 2020

SparkQA commented Feb 11, 2020

Ngone51 commented Feb 11, 2020

SparkQA commented Feb 11, 2020

Ngone51 commented Feb 11, 2020

SparkQA commented Feb 11, 2020

SparkQA commented Feb 11, 2020

Ngone51 commented Feb 12, 2020

SparkQA commented Feb 12, 2020

Ngone51 commented Feb 12, 2020

SparkQA commented Feb 12, 2020

Ngone51 commented Feb 13, 2020

WeichenXu123 commented Feb 17, 2020

Choose a reason for hiding this comment

SparkQA commented Feb 20, 2020

cloud-fan commented Feb 20, 2020

SparkQA commented Feb 20, 2020

cloud-fan commented Feb 20, 2020

SparkQA commented Feb 20, 2020

cloud-fan commented Feb 20, 2020

SparkQA commented Feb 20, 2020

cloud-fan commented Feb 21, 2020

Ngone51 commented Feb 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Feb 17, 2020 •

edited