[SPARK-46187][SQL] Align codegen and non-codegen implementation of `StringDecode` #44094

MaxGekk · 2023-11-30T16:48:51Z

What changes were proposed in this pull request?

In the PR, I propose to change the implementation of interpretation mode of StringDecode and apparently of the decode function. And make it consistent to codegen. Both implementation raise the same error with of the error class INVALID_PARAMETER_VALUE.CHARSET.

Why are the changes needed?

To make codegen and non-codegen of the StringDecode expression consistent. So, users will observe the same behaviour in both modes.

Does this PR introduce any user-facing change?

Yes, if user code depends on error from decode().

How was this patch tested?

By running the following test suites:

$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z string-functions.sql"
$ build/sbt "core/testOnly *SparkThrowableSuite"
$ build/sbt "test:testOnly *.StringFunctionsSuite"

Was this patch authored or co-authored using generative AI tooling?

No.

MaxGekk · 2023-11-30T17:52:53Z

@srielau @cloud-fan One more expression where need align codegen/non-codegen and restrict encodings.

cloud-fan · 2023-11-30T23:14:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

@@ -2648,18 +2648,26 @@ case class StringDecode(bin: Expression, charset: Expression)

  protected override def nullSafeEval(input1: Any, input2: Any): Any = {
    val fromCharset = input2.asInstanceOf[UTF8String].toString
-    UTF8String.fromString(new String(input1.asInstanceOf[Array[Byte]], fromCharset))
+    try {
+      UTF8String.fromString(new String(input1.asInstanceOf[Array[Byte]], fromCharset))


since we are here, can we rewrite this expression using StaticInvoke? To avoid this problem completely.

The same to Encode

Shall we introduce the legacyCharsets and supportedCharsets too?

Shall we introduce the legacyCharsets and supportedCharsets too?

Yep, let me merge this and introduce the restrictions + tests, update of the migration guide and so on.

MaxGekk · 2023-12-01T09:41:01Z

Merging to master. Thank you, @cloud-fan and @beliefer for review.

beliefer · 2023-12-01T10:01:47Z

Merging to master. Thank you, @cloud-fan and @beliefer for review.

Late LGTM.

…tringDecode` ### What changes were proposed in this pull request? In the PR, I propose to change the implementation of interpretation mode of `StringDecode` and apparently of the `decode` function. And make it consistent to codegen. Both implementation raise the same error with of the error class `INVALID_PARAMETER_VALUE.CHARSET`. ### Why are the changes needed? To make codegen and non-codegen of the `StringDecode` expression consistent. So, users will observe the same behaviour in both modes. ### Does this PR introduce _any_ user-facing change? Yes, if user code depends on error from `decode()`. ### How was this patch tested? By running the following test suites: ``` $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z string-functions.sql" $ build/sbt "core/testOnly *SparkThrowableSuite" $ build/sbt "test:testOnly *.StringFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#44094 from MaxGekk/align-codegen-stringdecode. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

Align codegen and non-codegen implementation of Decode

9ed4272

github-actions bot added the SQL label Nov 30, 2023

Trigger build

dcaa4f6

MaxGekk changed the title ~~[WIP][SQL] Align codegen and non-codegen implementation of StringDecode~~ [SPARK-46187][SQL] Align codegen and non-codegen implementation of StringDecode Nov 30, 2023

MaxGekk marked this pull request as ready for review November 30, 2023 17:45

MaxGekk requested review from beliefer and cloud-fan November 30, 2023 19:52

cloud-fan approved these changes Nov 30, 2023

View reviewed changes

cloud-fan reviewed Nov 30, 2023

View reviewed changes

MaxGekk closed this in e93bff6 Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46187][SQL] Align codegen and non-codegen implementation of `StringDecode` #44094

[SPARK-46187][SQL] Align codegen and non-codegen implementation of `StringDecode` #44094

MaxGekk commented Nov 30, 2023

MaxGekk commented Nov 30, 2023

cloud-fan Nov 30, 2023

cloud-fan Nov 30, 2023

beliefer Dec 1, 2023

MaxGekk Dec 1, 2023 •

edited

MaxGekk commented Dec 1, 2023

beliefer commented Dec 1, 2023

[SPARK-46187][SQL] Align codegen and non-codegen implementation of StringDecode #44094

[SPARK-46187][SQL] Align codegen and non-codegen implementation of StringDecode #44094

Conversation

MaxGekk commented Nov 30, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

MaxGekk commented Nov 30, 2023

cloud-fan Nov 30, 2023

Choose a reason for hiding this comment

cloud-fan Nov 30, 2023

Choose a reason for hiding this comment

beliefer Dec 1, 2023

Choose a reason for hiding this comment

MaxGekk Dec 1, 2023 • edited

Choose a reason for hiding this comment

MaxGekk commented Dec 1, 2023

beliefer commented Dec 1, 2023

[SPARK-46187][SQL] Align codegen and non-codegen implementation of `StringDecode` #44094

[SPARK-46187][SQL] Align codegen and non-codegen implementation of `StringDecode` #44094

MaxGekk Dec 1, 2023 •

edited