Skip to content

[SPARK-47646][SQL] Make try_to_number return NULL for malformed input#45771

Closed
HyukjinKwon wants to merge 1 commit intoapache:masterfrom
HyukjinKwon:SPARK-47646
Closed

[SPARK-47646][SQL] Make try_to_number return NULL for malformed input#45771
HyukjinKwon wants to merge 1 commit intoapache:masterfrom
HyukjinKwon:SPARK-47646

Conversation

@HyukjinKwon
Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This PR proposes to add NULL check after parsing the number so the output can be safely null for try_to_number expression.

import org.apache.spark.sql.functions._
val df = spark.createDataset(spark.sparkContext.parallelize(Seq("11")))
df.select(try_to_number($"value", lit("$99.99"))).show()
java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.types.Decimal.toPlainString()" because "<local7>" is null
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:894)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:894)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:368)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:332)

Why are the changes needed?

To fix the bug, and let try_to_number return NULL for malformed input as designed.

Does this PR introduce any user-facing change?

Yes, it fixes a bug. Previously, try_to_number failed with NPE.

How was this patch tested?

Unittest was added.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Mar 29, 2024
@HyukjinKwon
Copy link
Copy Markdown
Member Author

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@HyukjinKwon
Copy link
Copy Markdown
Member Author

Merged to master and branch-3.5.

HyukjinKwon added a commit that referenced this pull request Mar 29, 2024
### What changes were proposed in this pull request?

This PR proposes to add NULL check after parsing the number so the output can be safely null for `try_to_number` expression.

```scala
import org.apache.spark.sql.functions._
val df = spark.createDataset(spark.sparkContext.parallelize(Seq("11")))
df.select(try_to_number($"value", lit("$99.99"))).show()
```
```
java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.types.Decimal.toPlainString()" because "<local7>" is null
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:894)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:894)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:368)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:332)
```

### Why are the changes needed?

To fix the bug, and let `try_to_number` return `NULL` for malformed input as designed.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes a bug. Previously, `try_to_number` failed with NPE.

### How was this patch tested?

Unittest was added.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45771 from HyukjinKwon/SPARK-47646.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit d709e20)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@bersprockets
Copy link
Copy Markdown
Contributor

Yikes and thanks!

Will there be a 3.4.3? This also happens in 3.4.2 as well, although it takes more work to make it happen:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.2
      /_/
         
Using Scala version 2.12.17 (Java HotSpot(TM) 64-Bit Server VM, Java 17.0.7)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sql("create or replace temp view v1(value) as values ('11')")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("cache table v1")
res1: org.apache.spark.sql.DataFrame = []

scala> val df = sql("select try_to_number(value, '$99.99') as x from v1")
df: org.apache.spark.sql.DataFrame = [x: decimal(4,2)]

scala> df.selectExpr("x + 1").show()
24/03/29 10:27:45 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 2)
java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.types.Decimal.$plus(org.apache.spark.sql.types.Decimal)" because "<local6>" is null
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)

@HyukjinKwon
Copy link
Copy Markdown
Member Author

Yeah let me backport

HyukjinKwon added a commit that referenced this pull request Mar 31, 2024
This PR proposes to add NULL check after parsing the number so the output can be safely null for `try_to_number` expression.

```scala
import org.apache.spark.sql.functions._
val df = spark.createDataset(spark.sparkContext.parallelize(Seq("11")))
df.select(try_to_number($"value", lit("$99.99"))).show()
```
```
java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.types.Decimal.toPlainString()" because "<local7>" is null
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:894)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:894)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:368)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:332)
```

To fix the bug, and let `try_to_number` return `NULL` for malformed input as designed.

Yes, it fixes a bug. Previously, `try_to_number` failed with NPE.

Unittest was added.

No.

Closes #45771 from HyukjinKwon/SPARK-47646.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@HyukjinKwon
Copy link
Copy Markdown
Member Author

Merged to bracnh-3.4 too.

dongjoon-hyun pushed a commit that referenced this pull request Mar 31, 2024
…function with TryToNumber

### What changes were proposed in this pull request?

This patch fixes broken CI by replacing non-existing `try_to_number` function in branch-3.4.

### Why are the changes needed?

#45771 backported a test to `StringFunctionsSuite` in branch-3.4 but it uses `try_to_number` which is added since Spark 3.5.
So this patch fixes the broken CI: https://github.com/apache/spark/actions/runs/8494692184/job/23270175100

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #45785 from viirya/fix.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Copy Markdown
Member

For the record, this broke branch-3.4 and the following PR fixed it.

@bersprockets
Copy link
Copy Markdown
Contributor

bersprockets commented Mar 31, 2024

@HyukjinKwon

I should have mentioned that try_to_number exists in 3.4.2 as an SQL function but not as a scala function in functions.scala (that's why my 3.4.2 example had to use spark.sql).

@HyukjinKwon
Copy link
Copy Markdown
Member Author

Thank you guys!

@gengliangwang
Copy link
Copy Markdown
Member

Late LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants