[SPARK-39015][SQL] Remove the usage of toSQLValue(v) without an explicit type by HyukjinKwon · Pull Request #36351 · apache/spark

HyukjinKwon · 2022-04-26T05:23:50Z

What changes were proposed in this pull request?

This PR proposes to remove the the usage of toSQLValue(v) without an explicit type.

Literal(v) is intended to be used from end-users so it cannot handle our internal types such as UTF8String and ArrayBasedMapData. Using this method can lead to unexpected error messages such as:

Caused by: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE] The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
  at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:241)
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99)
  at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:45)
  ...

Since It is impossible to have the corresponding data type from the internal types as one type can map to multiple external types (e.g., Long for Timestamp, TimestampNTZ, and LongType), the removal approach was taken.

Why are the changes needed?

To provide the error messages as intended.

Does this PR introduce any user-facing change?

Yes.

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.DataTypes

val arrayStructureData = Seq(
Row(Map("hair"->"black", "eye"->"brown")),
Row(Map("hair"->"blond", "eye"->"blue")),
Row(Map()))

val mapType  = DataTypes.createMapType(StringType, StringType)

val arrayStructureSchema = new StructType().add("properties", mapType)

val mapTypeDF = spark.createDataFrame(
    spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)

spark.conf.set("spark.sql.ansi.enabled", true)
mapTypeDF.selectExpr("element_at(properties, 'hair')").show

Before:

Caused by: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE] The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
  at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:241)
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99)
  at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:45)
  ...

After:

Caused by: org.apache.spark.SparkNoSuchElementException: [MAP_KEY_DOES_NOT_EXIST] Key 'hair' does not exist. To return NULL instead, use 'try_element_at'. If necessary set spark.sql.ansi.enabled to false to bypass this error.
== SQL(line 1, position 0) ==
element_at(properties, 'hair')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

How was this patch tested?

Unittest was added. Otherwise, existing test cases should cover.

HyukjinKwon · 2022-04-26T05:23:59Z

cc @gengliangwang @MaxGekk FYI

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryErrorsBase.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/AnsiCastSuiteBase.scala

MaxGekk · 2022-04-26T08:02:16Z

Literal(v) is intended to be used from end-users so it cannot handle our internal types such as UTF8String and ArrayBasedMapData.

When I added two methods toSQLValue(), I supposed to use:

def toSQLValue(v: Any, t: DataType) with internal value
def toSQLValue(v: Any): String w/ external like String

If you need to pass somewhere an internal value, just use the first method. Why do you need to remove the second one, I didn't get.

HyukjinKwon · 2022-04-26T08:13:42Z

def toSQLValue(v: Any): String w/ external like String

Actually, I think this is error-prone (e.g., see the reported and fixed case here). With def toSQLValue(v: Any, t: DataType), you can use it everywhere without thinking about internal or external.

MaxGekk · 2022-04-26T08:54:13Z

you can use it everywhere without thinking about internal or external

Now, you have to convert String to UTF8String (maybe other values) everywhere which is inconvenient.

MaxGekk · 2022-04-26T08:56:47Z

Can't you just use correct function, and don't remove another one. And add comments to functions. Removing the second function seems like unrelated to the fix.

HyukjinKwon · 2022-04-26T08:59:04Z

Now, you have to convert String to UTF8String (maybe other values) everywhere which is inconvenient.

We actually don't need to convert. Both Literal.create(a: String, StringType) and Literal.create(a: UTF8String, StringType) work. But Liternal(s: UTF8String) doesn't work.

HyukjinKwon · 2022-04-26T09:05:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala

This is anther example of being error-prone.

HyukjinKwon · 2022-04-26T09:20:34Z

I thought that it's easier to have one function to use it everywhere vs having two to use. I can separate the fix if you still think it's arguable.

MaxGekk · 2022-04-26T09:26:15Z

I can separate the fix if you still think it's arguable.

Isn't needed, if it is possible to point out the type in all cases and toSQLValue() can handle "external" types as well.

HyukjinKwon · 2022-04-26T09:40:11Z

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L155-L164

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L500-L522

Here, all external types are listed - if they are some missing, it's a bug. They are converted to Literal(internal value).
If the given value isn't external type, they will be just Literal(v).

There seems two cases missing Char and collection.mutable.WrappedArray where both were added lately. I think they missed to fix ti here together.

HyukjinKwon · 2022-04-26T14:43:23Z

sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionAnsiErrorsSuite.scala

There's another case this PR fixes.

MaxGekk · 2022-04-27T08:50:27Z

+1, LGTM. Merging to master.
Thank you, @HyukjinKwon.

MaxGekk · 2022-04-27T08:54:02Z

@HyukjinKwon Could you backport the changes to branch-3.3, please.

HyukjinKwon · 2022-04-27T10:11:37Z

sure

…cit type This PR proposes to remove the the usage of `toSQLValue(v)` without an explicit type. `Literal(v)` is intended to be used from end-users so it cannot handle our internal types such as `UTF8String` and `ArrayBasedMapData`. Using this method can lead to unexpected error messages such as: ``` Caused by: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE] The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String. at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:241) at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99) at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:45) ... ``` Since It is impossible to have the corresponding data type from the internal types as one type can map to multiple external types (e.g., `Long` for `Timestamp`, `TimestampNTZ`, and `LongType`), the removal approach was taken. To provide the error messages as intended. Yes. ```scala import org.apache.spark.sql.Row import org.apache.spark.sql.types.StructType import org.apache.spark.sql.types.StringType import org.apache.spark.sql.types.DataTypes val arrayStructureData = Seq( Row(Map("hair"->"black", "eye"->"brown")), Row(Map("hair"->"blond", "eye"->"blue")), Row(Map())) val mapType = DataTypes.createMapType(StringType, StringType) val arrayStructureSchema = new StructType().add("properties", mapType) val mapTypeDF = spark.createDataFrame( spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema) spark.conf.set("spark.sql.ansi.enabled", true) mapTypeDF.selectExpr("element_at(properties, 'hair')").show ``` Before: ``` Caused by: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE] The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String. at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:241) at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99) at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:45) ... ``` After: ``` Caused by: org.apache.spark.SparkNoSuchElementException: [MAP_KEY_DOES_NOT_EXIST] Key 'hair' does not exist. To return NULL instead, use 'try_element_at'. If necessary set spark.sql.ansi.enabled to false to bypass this error. == SQL(line 1, position 0) == element_at(properties, 'hair') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``` Unittest was added. Otherwise, existing test cases should cover. Closes apache#36351 from HyukjinKwon/SPARK-39015. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit e49147a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…explicit type ### What changes were proposed in this pull request? This PR is a backport of #36351 This PR proposes to remove the the usage of `toSQLValue(v)` without an explicit type. `Literal(v)` is intended to be used from end-users so it cannot handle our internal types such as `UTF8String` and `ArrayBasedMapData`. Using this method can lead to unexpected error messages such as: ``` Caused by: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE] The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String. at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:241) at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99) at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:45) ... ``` Since It is impossible to have the corresponding data type from the internal types as one type can map to multiple external types (e.g., `Long` for `Timestamp`, `TimestampNTZ`, and `LongType`), the removal approach was taken. ### Why are the changes needed? To provide the error messages as intended. ### Does this PR introduce _any_ user-facing change? Yes. ```scala import org.apache.spark.sql.Row import org.apache.spark.sql.types.StructType import org.apache.spark.sql.types.StringType import org.apache.spark.sql.types.DataTypes val arrayStructureData = Seq( Row(Map("hair"->"black", "eye"->"brown")), Row(Map("hair"->"blond", "eye"->"blue")), Row(Map())) val mapType = DataTypes.createMapType(StringType, StringType) val arrayStructureSchema = new StructType().add("properties", mapType) val mapTypeDF = spark.createDataFrame( spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema) spark.conf.set("spark.sql.ansi.enabled", true) mapTypeDF.selectExpr("element_at(properties, 'hair')").show ``` Before: ``` Caused by: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE] The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String. at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:241) at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99) at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:45) ... ``` After: ``` Caused by: org.apache.spark.SparkNoSuchElementException: [MAP_KEY_DOES_NOT_EXIST] Key 'hair' does not exist. To return NULL instead, use 'try_element_at'. If necessary set spark.sql.ansi.enabled to false to bypass this error. == SQL(line 1, position 0) == element_at(properties, 'hair') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``` ### How was this patch tested? Unittest was added. Otherwise, existing test cases should cover. Closes #36375 from HyukjinKwon/SPARK-39015-3.3. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

github-actions bot added the SQL label Apr 26, 2022

MaxGekk reviewed Apr 26, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryErrorsBase.scala Outdated Show resolved Hide resolved

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/AnsiCastSuiteBase.scala Outdated Show resolved Hide resolved

HyukjinKwon force-pushed the SPARK-39015 branch from e24904f to 3233c98 Compare April 26, 2022 08:56

HyukjinKwon commented Apr 26, 2022

View reviewed changes

HyukjinKwon force-pushed the SPARK-39015 branch 2 times, most recently from b2de20a to 95b06a2 Compare April 26, 2022 09:19

HyukjinKwon force-pushed the SPARK-39015 branch from 95b06a2 to 709e7d1 Compare April 26, 2022 09:40

HyukjinKwon commented Apr 26, 2022

View reviewed changes

HyukjinKwon force-pushed the SPARK-39015 branch 3 times, most recently from 7a78184 to 02a5ea8 Compare April 26, 2022 23:56

Remove usage of toSQLValue(v) without an explicit type

02ac873

HyukjinKwon force-pushed the SPARK-39015 branch from 02a5ea8 to 02ac873 Compare April 26, 2022 23:57

MaxGekk approved these changes Apr 27, 2022

View reviewed changes

MaxGekk closed this in e49147a Apr 27, 2022

HyukjinKwon mentioned this pull request Apr 27, 2022

[SPARK-39015][SQL][3.3] Remove the usage of toSQLValue(v) without an explicit type #36375

Closed

HyukjinKwon deleted the SPARK-39015 branch January 15, 2024 00:50

Conversation

HyukjinKwon commented Apr 26, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Apr 26, 2022

Uh oh!

Uh oh!

Uh oh!

MaxGekk commented Apr 26, 2022

Uh oh!

HyukjinKwon commented Apr 26, 2022

Uh oh!

MaxGekk commented Apr 26, 2022

Uh oh!

MaxGekk commented Apr 26, 2022

Uh oh!

HyukjinKwon commented Apr 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon Apr 26, 2022

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 26, 2022

Uh oh!

MaxGekk commented Apr 26, 2022

Uh oh!

HyukjinKwon commented Apr 26, 2022

Uh oh!

HyukjinKwon Apr 26, 2022

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Apr 27, 2022

Uh oh!

MaxGekk commented Apr 27, 2022

Uh oh!

HyukjinKwon commented Apr 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HyukjinKwon commented Apr 26, 2022 •

edited

Loading