Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46187][SQL] Align codegen and non-codegen implementation of StringDecode #44094

Closed
wants to merge 2 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Nov 30, 2023

What changes were proposed in this pull request?

In the PR, I propose to change the implementation of interpretation mode of StringDecode and apparently of the decode function. And make it consistent to codegen. Both implementation raise the same error with of the error class INVALID_PARAMETER_VALUE.CHARSET.

Why are the changes needed?

To make codegen and non-codegen of the StringDecode expression consistent. So, users will observe the same behaviour in both modes.

Does this PR introduce any user-facing change?

Yes, if user code depends on error from decode().

How was this patch tested?

By running the following test suites:

$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z string-functions.sql"
$ build/sbt "core/testOnly *SparkThrowableSuite"
$ build/sbt "test:testOnly *.StringFunctionsSuite"

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Nov 30, 2023
@MaxGekk MaxGekk changed the title [WIP][SQL] Align codegen and non-codegen implementation of StringDecode [SPARK-46187][SQL] Align codegen and non-codegen implementation of StringDecode Nov 30, 2023
@MaxGekk MaxGekk marked this pull request as ready for review November 30, 2023 17:45
@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 30, 2023

@srielau @cloud-fan One more expression where need align codegen/non-codegen and restrict encodings.

@@ -2648,18 +2648,26 @@ case class StringDecode(bin: Expression, charset: Expression)

protected override def nullSafeEval(input1: Any, input2: Any): Any = {
val fromCharset = input2.asInstanceOf[UTF8String].toString
UTF8String.fromString(new String(input1.asInstanceOf[Array[Byte]], fromCharset))
try {
UTF8String.fromString(new String(input1.asInstanceOf[Array[Byte]], fromCharset))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are here, can we rewrite this expression using StaticInvoke? To avoid this problem completely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same to Encode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we introduce the legacyCharsets and supportedCharsets too?

Copy link
Member Author

@MaxGekk MaxGekk Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we introduce the legacyCharsets and supportedCharsets too?

Yep, let me merge this and introduce the restrictions + tests, update of the migration guide and so on.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 1, 2023

Merging to master. Thank you, @cloud-fan and @beliefer for review.

@MaxGekk MaxGekk closed this in e93bff6 Dec 1, 2023
@beliefer
Copy link
Contributor

beliefer commented Dec 1, 2023

Merging to master. Thank you, @cloud-fan and @beliefer for review.

Late LGTM.

asl3 pushed a commit to asl3/spark that referenced this pull request Dec 5, 2023
…tringDecode`

### What changes were proposed in this pull request?
In the PR, I propose to change the implementation of interpretation mode of `StringDecode` and apparently of the `decode` function. And make it consistent to codegen. Both implementation raise the same error with of the error class `INVALID_PARAMETER_VALUE.CHARSET`.

### Why are the changes needed?
To make codegen and non-codegen of the `StringDecode` expression consistent. So, users will observe the same behaviour in both modes.

### Does this PR introduce _any_ user-facing change?
Yes, if user code depends on error from `decode()`.

### How was this patch tested?
By running the following test suites:
```
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z string-functions.sql"
$ build/sbt "core/testOnly *SparkThrowableSuite"
$ build/sbt "test:testOnly *.StringFunctionsSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#44094 from MaxGekk/align-codegen-stringdecode.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants