Skip to content

[SPARK-39015][SQL] Remove the usage of toSQLValue(v) without an explicit type#36351

Closed
HyukjinKwon wants to merge 1 commit intoapache:masterfrom
HyukjinKwon:SPARK-39015
Closed

[SPARK-39015][SQL] Remove the usage of toSQLValue(v) without an explicit type#36351
HyukjinKwon wants to merge 1 commit intoapache:masterfrom
HyukjinKwon:SPARK-39015

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes to remove the the usage of toSQLValue(v) without an explicit type.

Literal(v) is intended to be used from end-users so it cannot handle our internal types such as UTF8String and ArrayBasedMapData. Using this method can lead to unexpected error messages such as:

Caused by: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE] The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
  at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:241)
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99)
  at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:45)
  ...

Since It is impossible to have the corresponding data type from the internal types as one type can map to multiple external types (e.g., Long for Timestamp, TimestampNTZ, and LongType), the removal approach was taken.

Why are the changes needed?

To provide the error messages as intended.

Does this PR introduce any user-facing change?

Yes.

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.DataTypes

val arrayStructureData = Seq(
Row(Map("hair"->"black", "eye"->"brown")),
Row(Map("hair"->"blond", "eye"->"blue")),
Row(Map()))

val mapType  = DataTypes.createMapType(StringType, StringType)

val arrayStructureSchema = new StructType().add("properties", mapType)

val mapTypeDF = spark.createDataFrame(
    spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)

spark.conf.set("spark.sql.ansi.enabled", true)
mapTypeDF.selectExpr("element_at(properties, 'hair')").show

Before:

Caused by: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE] The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
  at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:241)
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99)
  at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:45)
  ...

After:

Caused by: org.apache.spark.SparkNoSuchElementException: [MAP_KEY_DOES_NOT_EXIST] Key 'hair' does not exist. To return NULL instead, use 'try_element_at'. If necessary set spark.sql.ansi.enabled to false to bypass this error.
== SQL(line 1, position 0) ==
element_at(properties, 'hair')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

How was this patch tested?

Unittest was added. Otherwise, existing test cases should cover.

@HyukjinKwon
Copy link
Member Author

cc @gengliangwang @MaxGekk FYI

@github-actions github-actions bot added the SQL label Apr 26, 2022
@MaxGekk
Copy link
Member

MaxGekk commented Apr 26, 2022

Literal(v) is intended to be used from end-users so it cannot handle our internal types such as UTF8String and ArrayBasedMapData.

When I added two methods toSQLValue(), I supposed to use:

  • def toSQLValue(v: Any, t: DataType) with internal value
  • def toSQLValue(v: Any): String w/ external like String

If you need to pass somewhere an internal value, just use the first method. Why do you need to remove the second one, I didn't get.

@HyukjinKwon
Copy link
Member Author

def toSQLValue(v: Any): String w/ external like String

Actually, I think this is error-prone (e.g., see the reported and fixed case here). With def toSQLValue(v: Any, t: DataType), you can use it everywhere without thinking about internal or external.

@MaxGekk
Copy link
Member

MaxGekk commented Apr 26, 2022

you can use it everywhere without thinking about internal or external

Now, you have to convert String to UTF8String (maybe other values) everywhere which is inconvenient.

@MaxGekk
Copy link
Member

MaxGekk commented Apr 26, 2022

Can't you just use correct function, and don't remove another one. And add comments to functions. Removing the second function seems like unrelated to the fix.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Apr 26, 2022

Now, you have to convert String to UTF8String (maybe other values) everywhere which is inconvenient.

We actually don't need to convert. Both Literal.create(a: String, StringType) and Literal.create(a: UTF8String, StringType) work. But Liternal(s: UTF8String) doesn't work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is anther example of being error-prone.

@HyukjinKwon HyukjinKwon force-pushed the SPARK-39015 branch 2 times, most recently from b2de20a to 95b06a2 Compare April 26, 2022 09:19
@HyukjinKwon
Copy link
Member Author

I thought that it's easier to have one function to use it everywhere vs having two to use. I can separate the fix if you still think it's arguable.

@MaxGekk
Copy link
Member

MaxGekk commented Apr 26, 2022

I can separate the fix if you still think it's arguable.

Isn't needed, if it is possible to point out the type in all cases and toSQLValue() can handle "external" types as well.

@HyukjinKwon
Copy link
Member Author

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L155-L164

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L500-L522

Here, all external types are listed - if they are some missing, it's a bug. They are converted to Literal(internal value).
If the given value isn't external type, they will be just Literal(v).

There seems two cases missing Char and collection.mutable.WrappedArray where both were added lately. I think they missed to fix ti here together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another case this PR fixes.

@HyukjinKwon HyukjinKwon force-pushed the SPARK-39015 branch 3 times, most recently from 7a78184 to 02a5ea8 Compare April 26, 2022 23:56
@MaxGekk
Copy link
Member

MaxGekk commented Apr 27, 2022

+1, LGTM. Merging to master.
Thank you, @HyukjinKwon.

@MaxGekk MaxGekk closed this in e49147a Apr 27, 2022
@MaxGekk
Copy link
Member

MaxGekk commented Apr 27, 2022

@HyukjinKwon Could you backport the changes to branch-3.3, please.

@HyukjinKwon
Copy link
Member Author

sure

HyukjinKwon added a commit to HyukjinKwon/spark that referenced this pull request Apr 27, 2022
…cit type

This PR proposes to remove the the usage of `toSQLValue(v)` without an explicit type.

`Literal(v)` is intended to be used from end-users so it cannot handle our internal types such as `UTF8String` and `ArrayBasedMapData`. Using this method can lead to unexpected error messages such as:

```
Caused by: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE] The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
  at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:241)
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99)
  at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:45)
  ...
```

Since It is impossible to have the corresponding data type from the internal types as one type can map to multiple external types (e.g., `Long` for `Timestamp`, `TimestampNTZ`, and `LongType`), the removal approach was taken.

To provide the error messages as intended.

Yes.

```scala
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.DataTypes

val arrayStructureData = Seq(
Row(Map("hair"->"black", "eye"->"brown")),
Row(Map("hair"->"blond", "eye"->"blue")),
Row(Map()))

val mapType  = DataTypes.createMapType(StringType, StringType)

val arrayStructureSchema = new StructType().add("properties", mapType)

val mapTypeDF = spark.createDataFrame(
    spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)

spark.conf.set("spark.sql.ansi.enabled", true)
mapTypeDF.selectExpr("element_at(properties, 'hair')").show
```

Before:

```
Caused by: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE] The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
  at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:241)
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99)
  at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:45)
  ...
```

After:

```
Caused by: org.apache.spark.SparkNoSuchElementException: [MAP_KEY_DOES_NOT_EXIST] Key 'hair' does not exist. To return NULL instead, use 'try_element_at'. If necessary set spark.sql.ansi.enabled to false to bypass this error.
== SQL(line 1, position 0) ==
element_at(properties, 'hair')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
```

Unittest was added. Otherwise, existing test cases should cover.

Closes apache#36351 from HyukjinKwon/SPARK-39015.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit e49147a)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
MaxGekk pushed a commit that referenced this pull request Apr 27, 2022
…explicit type

### What changes were proposed in this pull request?

This PR is a backport of #36351

This PR proposes to remove the the usage of `toSQLValue(v)` without an explicit type.

`Literal(v)` is intended to be used from end-users so it cannot handle our internal types such as `UTF8String` and `ArrayBasedMapData`. Using this method can lead to unexpected error messages such as:

```
Caused by: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE] The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
  at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:241)
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99)
  at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:45)
  ...
```

Since It is impossible to have the corresponding data type from the internal types as one type can map to multiple external types (e.g., `Long` for `Timestamp`, `TimestampNTZ`, and `LongType`), the removal approach was taken.

### Why are the changes needed?

To provide the error messages as intended.

### Does this PR introduce _any_ user-facing change?

Yes.

```scala
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.DataTypes

val arrayStructureData = Seq(
Row(Map("hair"->"black", "eye"->"brown")),
Row(Map("hair"->"blond", "eye"->"blue")),
Row(Map()))

val mapType  = DataTypes.createMapType(StringType, StringType)

val arrayStructureSchema = new StructType().add("properties", mapType)

val mapTypeDF = spark.createDataFrame(
    spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)

spark.conf.set("spark.sql.ansi.enabled", true)
mapTypeDF.selectExpr("element_at(properties, 'hair')").show
```

Before:

```
Caused by: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE] The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
  at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:241)
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99)
  at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:45)
  ...
```

After:

```
Caused by: org.apache.spark.SparkNoSuchElementException: [MAP_KEY_DOES_NOT_EXIST] Key 'hair' does not exist. To return NULL instead, use 'try_element_at'. If necessary set spark.sql.ansi.enabled to false to bypass this error.
== SQL(line 1, position 0) ==
element_at(properties, 'hair')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
```

### How was this patch tested?

Unittest was added. Otherwise, existing test cases should cover.

Closes #36375 from HyukjinKwon/SPARK-39015-3.3.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@HyukjinKwon HyukjinKwon deleted the SPARK-39015 branch January 15, 2024 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants