Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37047][SQL][FOLLOWUP] Add legacy flag for the breaking change of lpad and rpad for binary type #36103

Closed
wants to merge 6 commits into from

Conversation

anchovYu
Copy link
Contributor

@anchovYu anchovYu commented Apr 7, 2022

What changes were proposed in this pull request?

Add a legacy flag spark.sql.legacy.lpadRpadAlwaysReturnString for the breaking change introduced in #34154.

The flag is disabled by default. When it is enabled, restore the pre-change behavior that there is no special handling on BINARY input types.

Why are the changes needed?

The original commit is a breaking change, and breaking changes should be encouraged to add a flag to turn it off for smooth migration between versions.

Does this PR introduce any user-facing change?

With the default value of the conf, there is no user-facing difference.
If users turn this conf on, they can restore the pre-change behavior.

How was this patch tested?

Through unit tests.

@github-actions github-actions bot added the SQL label Apr 7, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@anchovYu anchovYu changed the title [WIP][SPARK-37047][SQL][FOLLOWUP] Add legacy flag for the breaking change of lpad and rpad for binary type [SPARK-37047][SQL][FOLLOWUP] Add legacy flag for the breaking change of lpad and rpad for binary type Apr 8, 2022
@anchovYu anchovYu marked this pull request as ready for review April 8, 2022 19:40
@anchovYu
Copy link
Contributor Author

anchovYu commented Apr 8, 2022

Hi @mkaravel @cloud-fan , would you take a look? Thank you!

@@ -3709,6 +3709,17 @@ object SQLConf {
.booleanConf
.createWithDefault(true)

val LPAD_RPAD_FOR_BINARY_TYPE =
buildConf("spark.sql.legacy.lpadRpadForBinaryType.enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the config name sounds like it goes to the legacy behavior for binary input if it's set it true. Maybe we should use false as its default value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to

val LEGACY_LPAD_RPAD_BINARY_TYPE_AS_STRING =
    buildConf("spark.sql.legacy.lpadRpadBinaryTypeAsString")

with default value false. Sounds better?

@@ -2776,7 +2776,11 @@ object functions {
* @since 3.3.0
*/
def lpad(str: Column, len: Int, pad: Array[Byte]): Column = withExpr {
BinaryPad("lpad", str.expr, lit(len).expr, lit(pad).expr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just create UnresolvedFunction here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean when the legacy is restored, and pad is binary? Why is it a UnresolvedFunction?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using UnresolveFunction, we will go through the function look up code path and your change in PadExpressionBuilderBase will kick in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks, never know that.

@@ -108,6 +111,46 @@ SELECT lpad(x'57', 5, 'abc');
SELECT rpad('abc', 5, x'57');
SELECT rpad(x'57', 5, 'abc');

-- Dsiable the lpad rpad binary breaking change
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably an overkill to test the flag like this... I think the change in StringExpressionsSuite is good enough

Seq(Literal("hi"), Literal(5), Literal("somepadding")))

if (SQLConf.get.getConf(SQLConf.LPAD_RPAD_FOR_BINARY_TYPE)) {
assert(lpadExp1.asInstanceOf[BinaryPad] ==
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't need .asInstanceOf[BinaryPad]

assert(lpadExp1.asInstanceOf[BinaryPad] ==
BinaryPad("lpad", Literal(Array[Byte]()), Literal(5), Literal(Array[Byte](0))))
assert(lpadExp2.asInstanceOf[StringLPad] ==
StringLPad(Literal("hi"), Literal(5), Literal(" ")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put the ones that have the same result before the if-else? to highlight what gets changed if we flip the config.

@anchovYu anchovYu requested a review from cloud-fan April 12, 2022 00:13
@@ -3709,6 +3709,18 @@ object SQLConf {
.booleanConf
.createWithDefault(true)

val LEGACY_LPAD_RPAD_BINARY_TYPE_AS_STRING =
buildConf("spark.sql.legacy.lpadRpadBinaryTypeAsString")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about spark.sql.legacy.lpadRpadAlwaysReturnString? If true, it's the legacy behavior and the pad function always return string even for binary inputs. If false, the function returns binary if the first argument and the padding pattern parameter are both binary type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! This sounds better. Naming is hard ..

@anchovYu anchovYu requested a review from cloud-fan April 12, 2022 06:40
@@ -74,6 +74,9 @@ SELECT btrim(encode('xxxbarxxx', 'utf-8'), encode('x', 'utf-8'));
SELECT lpad('hi', 'invalid_length');
SELECT rpad('hi', 'invalid_length');

-- Enable the lpad rpad function for binary input type
SET spark.sql.legacy.lpadRpadAlwaysReturnString=false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to change this sql file

@anchovYu
Copy link
Contributor Author

The test job running: https://github.com/anchovYu/spark/actions/runs/2156689301

@cloud-fan
Copy link
Contributor

cloud-fan commented Apr 13, 2022

LGTM, let's update the migration guide to mention this conf

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.3 (last commit only changes migration guide)

@cloud-fan cloud-fan closed this in e2683c2 Apr 13, 2022
cloud-fan pushed a commit that referenced this pull request Apr 13, 2022
…of lpad and rpad for binary type

### What changes were proposed in this pull request?
Add a legacy flag `spark.sql.legacy.lpadRpadForBinaryType.enabled` for the breaking change introduced in #34154.

The flag is enabled by default. When it is disabled, restore the pre-change behavior that there is no special handling on `BINARY` input types.

### Why are the changes needed?
The original commit is a breaking change, and breaking changes should be encouraged to add a flag to turn it off for smooth migration between versions.

### Does this PR introduce _any_ user-facing change?
With the default value of the conf, there is no user-facing difference.
If users turn this conf off, they can restore the pre-change behavior.

### How was this patch tested?
Through unit tests.

Closes #36103 from anchovYu/flags-lpad-rpad-binary.

Authored-by: Xinyi Yu <xinyi.yu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e2683c2)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@github-actions github-actions bot added the DOCS label Apr 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants