New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31010][SQL][ML][FOLLOW-UP] Throw exception when use untyped UDF by default #27488
Conversation
@@ -79,7 +80,7 @@ abstract class Transformer extends PipelineStage { | |||
* result as a new column. | |||
*/ | |||
@DeveloperApi | |||
abstract class UnaryTransformer[IN, OUT, T <: UnaryTransformer[IN, OUT, T]] | |||
abstract class UnaryTransformer[IN: TypeTag, OUT: TypeTag, T <: UnaryTransformer[IN, OUT, T]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypeTag
is required for typed UDF when create udf for createTransformFunc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change, but I think it's better than silent result changing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid this breaking change if we know that the type parameter won't be primitive types. cc @srowen @zhengruifeng
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree, but this is trading a possible error for a definite error. In light of the recent conversations about not-breaking things, is this wise? (I don't object though.)
Yes, let's restrict this to primitive types. I think Spark ML even uses some UDFs that accept AnyRef or something to work with tuples or triples, IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a developer API, so I'm wondering if third-party implementations would use primitive type and hit the silent result changing.
I think it's better to ask users to re-compile their Spark application than just telling them that they may hit result changinng.
ping @cloud-fan @mengxr @WeichenXu123 please help review. |
Test build #118024 has finished for PR 27488 at commit
|
docs/sql-migration-guide.md
Outdated
@@ -65,6 +65,8 @@ license: | | |||
|
|||
- In Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. Since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default. | |||
|
|||
- Since Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, DataType)` is not allowed by default. Set `spark.sql.legacy.useUnTypedUdf.enabled` to true to keep use it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we merge the migration guide between this one and the one that changes the behavior?
@@ -2006,6 +2006,14 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(false) | |||
|
|||
val LEGACY_USE_UNTYPED_UDF = | |||
buildConf("spark.sql.legacy.useUnTypedUdf.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think spark.sql.legacy.allowUntypedScalaUDF
is better.
"and the closure will see the default value of the Java type for the null argument, " + | ||
"e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. You could use " + | ||
"other typed udf APIs to avoid this problem, or set " + | ||
"spark.sql.legacy.useUnTypedUdf.enabled to true to insistently use this." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not hardcode config names.
Test build #118028 has finished for PR 27488 at commit
|
Test build #118127 has finished for PR 27488 at commit
|
retest this please |
Test build #118136 has finished for PR 27488 at commit
|
Jenkins, retest this please. |
docs/sql-migration-guide.md
Outdated
@@ -63,8 +63,8 @@ license: | | |||
|
|||
- Since Spark 3.0, JSON datasource and JSON function `schema_of_json` infer TimestampType from string values if they match to the pattern defined by the JSON option `timestampFormat`. Set JSON option `inferTimestamp` to `false` to disable such type inferring. | |||
|
|||
- In Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. Since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default. | |||
|
|||
- Since Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, DataType)` is not allowed by default. Set `spark.sql.legacy.allowUntypedScalaUDF` to true to keep use it. But please note that, in Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(AnyRef, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. However, since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to keep use it
-> to keep using it
@@ -4732,6 +4733,15 @@ object functions { | |||
* @since 2.0.0 | |||
*/ | |||
def udf(f: AnyRef, dataType: DataType): UserDefinedFunction = { | |||
if (!SQLConf.get.getConf(SQLConf.LEGACY_ALLOW_UNTYPED_SCALA_UDF)) { | |||
val errorMsg = "You're using untyped udf, which does not have the input type information. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
untyped Scala UDF
@@ -4732,6 +4733,15 @@ object functions { | |||
* @since 2.0.0 | |||
*/ | |||
def udf(f: AnyRef, dataType: DataType): UserDefinedFunction = { | |||
if (!SQLConf.get.getConf(SQLConf.LEGACY_ALLOW_UNTYPED_SCALA_UDF)) { | |||
val errorMsg = "You're using untyped udf, which does not have the input type information. " + | |||
"So, Spark may blindly pass null to the Scala closure with primitive-type argument, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, Spark ...
-> Spark ...
"and the closure will see the default value of the Java type for the null argument, " + | ||
"e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. You could use " + | ||
"other typed udf APIs to avoid this problem, or set " + | ||
s"${SQLConf.LEGACY_ALLOW_UNTYPED_SCALA_UDF.key} to true to insistently use this." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to true and use this API with caution
"e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. You could use " + | ||
"other typed udf APIs to avoid this problem, or set " + | ||
s"${SQLConf.LEGACY_ALLOW_UNTYPED_SCALA_UDF.key} to true to insistently use this." | ||
throw new SparkException(errorMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AnalysisException
?
Test build #118145 has finished for PR 27488 at commit
|
Test build #118198 has finished for PR 27488 at commit
|
Jenkins, retest this please. |
Test build #118212 has finished for PR 27488 at commit
|
Jenkins, retest this please. |
Test build #118216 has finished for PR 27488 at commit
|
Test build #118246 has finished for PR 27488 at commit
|
retest this please. |
Test build #118280 has finished for PR 27488 at commit
|
retest this please. |
Test build #118287 has finished for PR 27488 at commit
|
Kindly ping @cloud-fan @mengxr @WeichenXu123 |
LGTM. Also ping @srowen @zhengruifeng Could you help double check ? |
docs/sql-migration-guide.md
Outdated
- In Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. Since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default. | ||
|
||
- Since Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, DataType)` is not allowed by default. Set `spark.sql.legacy.allowUntypedScalaUDF` to true to keep using it. But please note that, in Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(AnyRef, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. However, since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove spaces here?
Test build #118708 has finished for PR 27488 at commit
|
retest this please |
Test build #118717 has finished for PR 27488 at commit
|
retest this please |
Test build #118721 has finished for PR 27488 at commit
|
retest this please |
Test build #118727 has finished for PR 27488 at commit
|
thanks, merging to master/3.0! |
thanks all! |
…F by default ### What changes were proposed in this pull request? This PR proposes to throw exception by default when user use untyped UDF(a.k.a `org.apache.spark.sql.functions.udf(AnyRef, DataType)`). And user could still use it by setting `spark.sql.legacy.useUnTypedUdf.enabled` to `true`. ### Why are the changes needed? According to #23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default. As a result, this might change data silently and may cause correctness issue if user still expect `null` in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem. ### Does this PR introduce any user-facing change? Yeah. User will hit exception now when use untyped UDF. ### How was this patch tested? Added test and updated some tests. Closes #27488 from Ngone51/spark_26580_followup. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 82ce475) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Use scala annotation deprecate to deprecate untyped scala UDF. ### Why are the changes needed? After #27488, it's weird to see the untyped scala UDF will fail by default without deprecation. ### Does this PR introduce any user-facing change? Yes, user will see the warning: ``` <console>:26: warning: method udf in object functions is deprecated (since 3.0.0): Untyped Scala UDF API is deprecated, please use typed Scala UDF API such as 'def udf[RT: TypeTag](f: Function0[RT]): UserDefinedFunction' instead. val myudf = udf(() => Math.random(), DoubleType) ^ ``` ### How was this patch tested? Tested manually. Closes #27794 from Ngone51/deprecate_untyped_scala_udf. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Use scala annotation deprecate to deprecate untyped scala UDF. ### Why are the changes needed? After #27488, it's weird to see the untyped scala UDF will fail by default without deprecation. ### Does this PR introduce any user-facing change? Yes, user will see the warning: ``` <console>:26: warning: method udf in object functions is deprecated (since 3.0.0): Untyped Scala UDF API is deprecated, please use typed Scala UDF API such as 'def udf[RT: TypeTag](f: Function0[RT]): UserDefinedFunction' instead. val myudf = udf(() => Math.random(), DoubleType) ^ ``` ### How was this patch tested? Tested manually. Closes #27794 from Ngone51/deprecate_untyped_scala_udf. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 587266f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
buildConf("spark.sql.legacy.allowUntypedScalaUDF") | ||
.internal() | ||
.doc("When set to true, user is allowed to use org.apache.spark.sql.functions." + | ||
"udf(f: AnyRef, dataType: DataType). Otherwise, exception will be throw.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exception will be throw
->
an exception will be thrown at runtime.
"information. Spark may blindly pass null to the Scala closure with primitive-type " + | ||
"argument, and the closure will see the default value of the Java type for the null " + | ||
"argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. " + | ||
"You could use other typed Scala UDF APIs to avoid this problem, or set " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the error message, we should give an example to show how to use the typed Scala UDF for implementing "udf((x: Int) => x, IntegerType)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
…F by default ### What changes were proposed in this pull request? This PR proposes to throw exception by default when user use untyped UDF(a.k.a `org.apache.spark.sql.functions.udf(AnyRef, DataType)`). And user could still use it by setting `spark.sql.legacy.useUnTypedUdf.enabled` to `true`. ### Why are the changes needed? According to apache#23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default. As a result, this might change data silently and may cause correctness issue if user still expect `null` in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem. ### Does this PR introduce any user-facing change? Yeah. User will hit exception now when use untyped UDF. ### How was this patch tested? Added test and updated some tests. Closes apache#27488 from Ngone51/spark_26580_followup. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Use scala annotation deprecate to deprecate untyped scala UDF. ### Why are the changes needed? After apache#27488, it's weird to see the untyped scala UDF will fail by default without deprecation. ### Does this PR introduce any user-facing change? Yes, user will see the warning: ``` <console>:26: warning: method udf in object functions is deprecated (since 3.0.0): Untyped Scala UDF API is deprecated, please use typed Scala UDF API such as 'def udf[RT: TypeTag](f: Function0[RT]): UserDefinedFunction' instead. val myudf = udf(() => Math.random(), DoubleType) ^ ``` ### How was this patch tested? Tested manually. Closes apache#27794 from Ngone51/deprecate_untyped_scala_udf. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR proposes to throw exception by default when user use untyped UDF(a.k.a
org.apache.spark.sql.functions.udf(AnyRef, DataType)
).And user could still use it by setting
spark.sql.legacy.useUnTypedUdf.enabled
totrue
.Why are the changes needed?
According to #23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example,
val f = udf((x: Int) => x, IntegerType)
,f($"x")
will return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default.As a result, this might change data silently and may cause correctness issue if user still expect
null
in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem.Does this PR introduce any user-facing change?
Yeah. User will hit exception now when use untyped UDF.
How was this patch tested?
Added test and updated some tests.