[SPARK-47646][SQL] Make try_to_number return NULL for malformed input by HyukjinKwon · Pull Request #45771 · apache/spark

HyukjinKwon · 2024-03-29T07:27:13Z

What changes were proposed in this pull request?

This PR proposes to add NULL check after parsing the number so the output can be safely null for try_to_number expression.

import org.apache.spark.sql.functions._
val df = spark.createDataset(spark.sparkContext.parallelize(Seq("11")))
df.select(try_to_number($"value", lit("$99.99"))).show()

java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.types.Decimal.toPlainString()" because "<local7>" is null
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:894)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:894)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:368)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:332)

Why are the changes needed?

To fix the bug, and let try_to_number return NULL for malformed input as designed.

Does this PR introduce any user-facing change?

Yes, it fixes a bug. Previously, try_to_number failed with NPE.

How was this patch tested?

Unittest was added.

Was this patch authored or co-authored using generative AI tooling?

No.

HyukjinKwon · 2024-03-29T07:28:10Z

cc @bersprockets @gengliangwang @cloud-fan

cloud-fan

good catch!

HyukjinKwon · 2024-03-29T08:38:01Z

Merged to master and branch-3.5.

### What changes were proposed in this pull request? This PR proposes to add NULL check after parsing the number so the output can be safely null for `try_to_number` expression. ```scala import org.apache.spark.sql.functions._ val df = spark.createDataset(spark.sparkContext.parallelize(Seq("11"))) df.select(try_to_number($"value", lit("$99.99"))).show() ``` ``` java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.types.Decimal.toPlainString()" because "<local7>" is null at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:894) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:894) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:368) at org.apache.spark.rdd.RDD.iterator(RDD.scala:332) ``` ### Why are the changes needed? To fix the bug, and let `try_to_number` return `NULL` for malformed input as designed. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug. Previously, `try_to_number` failed with NPE. ### How was this patch tested? Unittest was added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45771 from HyukjinKwon/SPARK-47646. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit d709e20) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

bersprockets · 2024-03-29T17:30:29Z

Yikes and thanks!

Will there be a 3.4.3? This also happens in 3.4.2 as well, although it takes more work to make it happen:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.2
      /_/
         
Using Scala version 2.12.17 (Java HotSpot(TM) 64-Bit Server VM, Java 17.0.7)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sql("create or replace temp view v1(value) as values ('11')")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("cache table v1")
res1: org.apache.spark.sql.DataFrame = []

scala> val df = sql("select try_to_number(value, '$99.99') as x from v1")
df: org.apache.spark.sql.DataFrame = [x: decimal(4,2)]

scala> df.selectExpr("x + 1").show()
24/03/29 10:27:45 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 2)
java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.types.Decimal.$plus(org.apache.spark.sql.types.Decimal)" because "<local6>" is null
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)

HyukjinKwon · 2024-03-31T01:08:53Z

Yeah let me backport

This PR proposes to add NULL check after parsing the number so the output can be safely null for `try_to_number` expression. ```scala import org.apache.spark.sql.functions._ val df = spark.createDataset(spark.sparkContext.parallelize(Seq("11"))) df.select(try_to_number($"value", lit("$99.99"))).show() ``` ``` java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.types.Decimal.toPlainString()" because "<local7>" is null at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:894) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:894) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:368) at org.apache.spark.rdd.RDD.iterator(RDD.scala:332) ``` To fix the bug, and let `try_to_number` return `NULL` for malformed input as designed. Yes, it fixes a bug. Previously, `try_to_number` failed with NPE. Unittest was added. No. Closes #45771 from HyukjinKwon/SPARK-47646. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

HyukjinKwon · 2024-03-31T01:11:10Z

Merged to bracnh-3.4 too.

…function with TryToNumber ### What changes were proposed in this pull request? This patch fixes broken CI by replacing non-existing `try_to_number` function in branch-3.4. ### Why are the changes needed? #45771 backported a test to `StringFunctionsSuite` in branch-3.4 but it uses `try_to_number` which is added since Spark 3.5. So this patch fixes the broken CI: https://github.com/apache/spark/actions/runs/8494692184/job/23270175100 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #45785 from viirya/fix. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2024-03-31T22:38:06Z

For the record, this broke branch-3.4 and the following PR fixed it.

[SPARK-47646][SQL][FOLLOWUP][3.4] Replace non-existing try_to_number function with TryToNumber #45785

bersprockets · 2024-03-31T22:42:53Z

@HyukjinKwon

I should have mentioned that try_to_number exists in 3.4.2 as an SQL function but not as a scala function in functions.scala (that's why my 3.4.2 example had to use spark.sql).

HyukjinKwon · 2024-04-01T01:11:35Z

Thank you guys!

gengliangwang · 2024-04-04T19:13:58Z

Late LGTM!

try_to_number fails with NPE for malformed input

6e50bf3

github-actions bot added the SQL label Mar 29, 2024

cloud-fan approved these changes Mar 29, 2024

View reviewed changes

HyukjinKwon closed this in d709e20 Mar 29, 2024

viirya mentioned this pull request Mar 31, 2024

[SPARK-47646][SQL][FOLLOWUP][3.4] Replace non-existing try_to_number function with TryToNumber #45785

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47646][SQL] Make try_to_number return NULL for malformed input#45771

[SPARK-47646][SQL] Make try_to_number return NULL for malformed input#45771
HyukjinKwon wants to merge 1 commit intoapache:masterfrom
HyukjinKwon:SPARK-47646

HyukjinKwon commented Mar 29, 2024

Uh oh!

HyukjinKwon commented Mar 29, 2024

Uh oh!

cloud-fan left a comment

Uh oh!

HyukjinKwon commented Mar 29, 2024

Uh oh!

bersprockets commented Mar 29, 2024

Uh oh!

HyukjinKwon commented Mar 31, 2024

Uh oh!

HyukjinKwon commented Mar 31, 2024

Uh oh!

dongjoon-hyun commented Mar 31, 2024

Uh oh!

bersprockets commented Mar 31, 2024 •

edited

Loading

Uh oh!

HyukjinKwon commented Apr 1, 2024

Uh oh!

gengliangwang commented Apr 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

HyukjinKwon commented Mar 29, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented Mar 29, 2024

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Mar 29, 2024

Uh oh!

bersprockets commented Mar 29, 2024

Uh oh!

HyukjinKwon commented Mar 31, 2024

Uh oh!

HyukjinKwon commented Mar 31, 2024

Uh oh!

dongjoon-hyun commented Mar 31, 2024

Uh oh!

bersprockets commented Mar 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Apr 1, 2024

Uh oh!

gengliangwang commented Apr 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bersprockets commented Mar 31, 2024 •

edited

Loading