Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37047][SQL] Add lpad and rpad functions for binary strings #34154

Closed
wants to merge 7 commits into from

Conversation

mkaravel
Copy link
Contributor

@mkaravel mkaravel commented Sep 30, 2021

What changes were proposed in this pull request?

This PR overloads the lpad and rpad functions to work correctly with BINARY string inputs.

Why are the changes needed?

The current behavior of the lpad and rpad functions is problematic. BINARY string inputs get converted automatically to UTF8 strings and then padded. The result can be an invalid UTF8 string.

Does this PR introduce any user-facing change?

Yes. We are adding overloads for lpad and rpad for BINARY strings. This PR should be viewed as a breaking change in the sense that the result of lpad and rpad for BINARY string inputs is now BINARY string, as opposed to the previous behavior which was returning a UTF8 string.

How was this patch tested?

Unit tests.

@github-actions github-actions bot added the SQL label Sep 30, 2021
@mkaravel mkaravel changed the title [WIP] Add lpad and rpad functions for binary strings [WIP][SQL] Add lpad and rpad functions for binary strings Sep 30, 2021
@SparkQA
Copy link

SparkQA commented Sep 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48267/

@SparkQA
Copy link

SparkQA commented Sep 30, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48267/

@SparkQA
Copy link

SparkQA commented Sep 30, 2021

Test build #143756 has finished for PR 34154 at commit 36cf3c2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mkaravel mkaravel changed the title [WIP][SQL] Add lpad and rpad functions for binary strings [SPARK-37047][SQL] Add lpad and rpad functions for binary strings Oct 18, 2021
@SparkQA
Copy link

SparkQA commented Oct 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48850/

@sigmod
Copy link
Contributor

sigmod commented Oct 18, 2021

@SparkQA
Copy link

SparkQA commented Oct 18, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48850/

@SparkQA
Copy link

SparkQA commented Oct 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48852/

@SparkQA
Copy link

SparkQA commented Oct 18, 2021

Test build #144376 has finished for PR 34154 at commit 9324674.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 18, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48852/

@SparkQA
Copy link

SparkQA commented Oct 18, 2021

Test build #144378 has finished for PR 34154 at commit 7a52529.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48875/

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48875/

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Test build #144401 has finished for PR 34154 at commit 5a4e114.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case for this? given that we would have to justify a breaking change (of behavior that was probably not intended to work anyway though)

@mkaravel
Copy link
Contributor Author

What's the use case for this? given that we would have to justify a breaking change (of behavior that was probably not intended to work anyway though)

From my point of view, the old behavior is wrong and would have to be fixed anyways. Besides this, the main scenario is one where a user has a BINARY column with different lengths and they want to "align" all values either to the left or to the right. These overloads allow the user to do that.

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48889/

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48889/

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Test build #144415 has finished for PR 34154 at commit 1f34b1c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Since it's a breaking change, let's add an item in docs/sql-migration-guide.md

Simplified the way the default padding value for BINARY is defined.
@mkaravel
Copy link
Contributor Author

LGTM. Since it's a breaking change, let's add an item in docs/sql-migration-guide.md

Done.

@github-actions github-actions bot added the DOCS label Oct 20, 2021
@SparkQA
Copy link

SparkQA commented Oct 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48938/

@SparkQA
Copy link

SparkQA commented Oct 20, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48938/

@SparkQA
Copy link

SparkQA commented Oct 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48940/

@SparkQA
Copy link

SparkQA commented Oct 20, 2021

Test build #144465 has finished for PR 34154 at commit fdc9207.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48940/

@SparkQA
Copy link

SparkQA commented Oct 20, 2021

Test build #144467 has finished for PR 34154 at commit 9027989.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 01bdf84 Oct 21, 2021
HyukjinKwon pushed a commit that referenced this pull request Oct 25, 2021
…nd pad are different types

### What changes were proposed in this pull request?

This is a followup of #34154 . Now lpad/rpad throws class cast exception at runtime if the parameter `str` and `pad` are different types (one is STRING and the other is BINARY). This PR makes it fail during analysis.

### Why are the changes needed?

fail earlier for invalid functions.

### Does this PR introduce _any_ user-facing change?

no, the new lpad/rad change is not released yet.

### How was this patch tested?

new tests

Closes #34370 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
cloud-fan pushed a commit that referenced this pull request Apr 13, 2022
…of lpad and rpad for binary type

### What changes were proposed in this pull request?
Add a legacy flag `spark.sql.legacy.lpadRpadForBinaryType.enabled` for the breaking change introduced in #34154.

The flag is enabled by default. When it is disabled, restore the pre-change behavior that there is no special handling on `BINARY` input types.

### Why are the changes needed?
The original commit is a breaking change, and breaking changes should be encouraged to add a flag to turn it off for smooth migration between versions.

### Does this PR introduce _any_ user-facing change?
With the default value of the conf, there is no user-facing difference.
If users turn this conf off, they can restore the pre-change behavior.

### How was this patch tested?
Through unit tests.

Closes #36103 from anchovYu/flags-lpad-rpad-binary.

Authored-by: Xinyi Yu <xinyi.yu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Apr 13, 2022
…of lpad and rpad for binary type

### What changes were proposed in this pull request?
Add a legacy flag `spark.sql.legacy.lpadRpadForBinaryType.enabled` for the breaking change introduced in #34154.

The flag is enabled by default. When it is disabled, restore the pre-change behavior that there is no special handling on `BINARY` input types.

### Why are the changes needed?
The original commit is a breaking change, and breaking changes should be encouraged to add a flag to turn it off for smooth migration between versions.

### Does this PR introduce _any_ user-facing change?
With the default value of the conf, there is no user-facing difference.
If users turn this conf off, they can restore the pre-change behavior.

### How was this patch tested?
Through unit tests.

Closes #36103 from anchovYu/flags-lpad-rpad-binary.

Authored-by: Xinyi Yu <xinyi.yu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e2683c2)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants