New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37047][SQL] Add lpad and rpad functions for binary strings #34154
Conversation
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #143756 has finished for PR 34154 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Test build #144376 has finished for PR 34154 at commit
|
Kubernetes integration test status failure |
Test build #144378 has finished for PR 34154 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #144401 has finished for PR 34154 at commit
|
common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use case for this? given that we would have to justify a breaking change (of behavior that was probably not intended to work anyway though)
From my point of view, the old behavior is wrong and would have to be fixed anyways. Besides this, the main scenario is one where a user has a BINARY column with different lengths and they want to "align" all values either to the left or to the right. These overloads allow the user to do that. |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #144415 has finished for PR 34154 at commit
|
common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Since it's a breaking change, let's add an item in docs/sql-migration-guide.md
common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java
Outdated
Show resolved
Hide resolved
Simplified the way the default padding value for BINARY is defined.
Done. |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Test build #144465 has finished for PR 34154 at commit
|
Kubernetes integration test status failure |
Test build #144467 has finished for PR 34154 at commit
|
thanks, merging to master! |
…nd pad are different types ### What changes were proposed in this pull request? This is a followup of #34154 . Now lpad/rpad throws class cast exception at runtime if the parameter `str` and `pad` are different types (one is STRING and the other is BINARY). This PR makes it fail during analysis. ### Why are the changes needed? fail earlier for invalid functions. ### Does this PR introduce _any_ user-facing change? no, the new lpad/rad change is not released yet. ### How was this patch tested? new tests Closes #34370 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…of lpad and rpad for binary type ### What changes were proposed in this pull request? Add a legacy flag `spark.sql.legacy.lpadRpadForBinaryType.enabled` for the breaking change introduced in #34154. The flag is enabled by default. When it is disabled, restore the pre-change behavior that there is no special handling on `BINARY` input types. ### Why are the changes needed? The original commit is a breaking change, and breaking changes should be encouraged to add a flag to turn it off for smooth migration between versions. ### Does this PR introduce _any_ user-facing change? With the default value of the conf, there is no user-facing difference. If users turn this conf off, they can restore the pre-change behavior. ### How was this patch tested? Through unit tests. Closes #36103 from anchovYu/flags-lpad-rpad-binary. Authored-by: Xinyi Yu <xinyi.yu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…of lpad and rpad for binary type ### What changes were proposed in this pull request? Add a legacy flag `spark.sql.legacy.lpadRpadForBinaryType.enabled` for the breaking change introduced in #34154. The flag is enabled by default. When it is disabled, restore the pre-change behavior that there is no special handling on `BINARY` input types. ### Why are the changes needed? The original commit is a breaking change, and breaking changes should be encouraged to add a flag to turn it off for smooth migration between versions. ### Does this PR introduce _any_ user-facing change? With the default value of the conf, there is no user-facing difference. If users turn this conf off, they can restore the pre-change behavior. ### How was this patch tested? Through unit tests. Closes #36103 from anchovYu/flags-lpad-rpad-binary. Authored-by: Xinyi Yu <xinyi.yu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e2683c2) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR overloads the
lpad
andrpad
functions to work correctly with BINARY string inputs.Why are the changes needed?
The current behavior of the
lpad
andrpad
functions is problematic. BINARY string inputs get converted automatically to UTF8 strings and then padded. The result can be an invalid UTF8 string.Does this PR introduce any user-facing change?
Yes. We are adding overloads for
lpad
andrpad
for BINARY strings. This PR should be viewed as a breaking change in the sense that the result oflpad
andrpad
for BINARY string inputs is now BINARY string, as opposed to the previous behavior which was returning a UTF8 string.How was this patch tested?
Unit tests.