[SPARK-54731][SQL] Support Reverse expression to handle BinaryType by performing raw byte-level reversal#55051
[SPARK-54731][SQL] Support Reverse expression to handle BinaryType by performing raw byte-level reversal#55051xiaoxuandev wants to merge 2 commits into
Conversation
|
Are there any reference that supports binary with reverse? |
|
@HyukjinKwon For internal consistency within Spark, Concat, Length, and OctetLength already accept both StringType and BinaryType, so Reverse is the odd one out. For external references, PostgreSQL 18 recently added reverse() support for binary type: https://neon.com/postgresql/postgresql-18/array-bytea-improvements |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Looks fine to me. Let's make CI happy. cc @cloud-fan
|
@HyukjinKwon Thanks for the review! CI passed after a retry. |
|
does |
|
I think he provided https://neon.com/postgresql/postgresql-18/array-bytea-improvements as an example. |
cloud-fan
left a comment
There was a problem hiding this comment.
The implementation is clean and follows the established pattern used by Length, BitLength, and Concat for BinaryType support. Both interpreted and codegen paths are correct, and test coverage is good.
One minor documentation gap: the PySpark reverse docstring (python/pyspark/sql/functions/builtin.py) still says "returns a reversed string or an array with elements in reverse order" without mentioning binary. Consider updating it in a follow-up.
| since = "1.5.0", | ||
| note = """ | ||
| Reverse logic for arrays is available since 2.4.0. | ||
| Reverse logic for binary is available since 4.2.0. |
There was a problem hiding this comment.
nit: The usage and note sections mention binary, but the examples section doesn't include a binary example. Length and BitLength both demonstrate binary usage with x'hex' syntax. Consider adding one here, e.g.:
> SELECT _FUNC_(x'CAFE');
FE CA
There was a problem hiding this comment.
Updated, thanks for the review!
…forming raw byte-level reversal ### What changes were proposed in this pull request? When Spark's `reverse` function is called on a BinaryType column, it previously threw a type mismatch error because BinaryType was not in the accepted input types. This patch adds BinaryType support to the Reverse expression. The changes: 1. Add BinaryType to the inputTypes TypeCollection in Reverse. 2. Add a BinaryType match case in the interpreted execution path (doReverse) using Scala's Array.reverse. 3. Add a BinaryType match case in the code generation path (doGenCode) with a dedicated binaryCodeGen method that performs byte-level reversal via a for loop. 4. Update @ExpressionDescription to document binary support (since 4.2.0), fix parameter name from `array` to `expr`. 5. Update data type mismatch error test expectations to include BINARY in the accepted types. 6. Add end-to-end SQL test for reverse on BinaryType in DataFrameFunctionsSuite covering both interpreted and codegen paths. ### Why are the changes needed? Without this fix, calling `reverse` on a BinaryType column fails with a type error. This is inconsistent with other Spark functions like `length` and `concat` that already support BinaryType. ### Does this PR introduce _any_ user-facing change? Yes. The `reverse` function now accepts BinaryType input and returns the byte-reversed binary value. Previously this would throw an AnalysisException. ### How was this patch tested? - Unit tests in CollectionExpressionsSuite: multi-byte reversal, empty byte array, single byte, byte array containing 0x00, null input. - End-to-end test in DataFrameFunctionsSuite: reverse on BinaryType column through both interpreted and codegen (cached) paths. - Updated data type mismatch tests to reflect the new accepted type set. - Scalastyle and ExpressionsSchemaSuite pass. ### Was this patch authored or co-authored using generative AI tooling? Yes, co-authored with Kiro.
|
Merged to master. |
What changes were proposed in this pull request?
When Spark's
reversefunction is called on a BinaryType column, it previously threw a type mismatch error because BinaryType was not in the accepted input types. This patch adds BinaryType support to the Reverse expression.The changes:
Why are the changes needed?
Without this fix, calling
reverseon a BinaryType column fails with a type error. This is inconsistent with other Spark functions likelengthandconcatthat already support BinaryType.Does this PR introduce any user-facing change?
Yes. The
reversefunction now accepts BinaryType input and returns the byte-reversed binary value. Previously this would throw an AnalysisException.How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
Yes, co-authored with Kiro.