Fix input handling for encoding functions & various refactors #18754

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

Jefffrey wants to merge 2 commits into apache:main from Jefffrey:refactor-encoding

+372 −383

Contributor

Jefffrey commented Nov 17, 2025 •

edited

Loading

Which issue does this PR close?

Part of #12725

Rationale for this change

Started refactoring encoding functions (encode, decode) to remove user defined signature (per linked issue). However, discovered in the process it was bugged in handling of certain inputs. For example on main we get these errors:

DataFusion CLI v51.0.0
> select encode(arrow_cast(column1, 'LargeUtf8'), 'hex') from values ('a'), ('b');
Internal error: Function 'encode' returned value of type 'Utf8' while the following type was promised at planning time and expected: 'LargeUtf8'.
This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues
> select encode(arrow_cast(column1, 'LargeBinary'), 'hex') from values ('a'), ('b');
Internal error: Function 'encode' returned value of type 'Utf8' while the following type was promised at planning time and expected: 'LargeUtf8'.
This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues
> select encode(arrow_cast(column1, 'BinaryView'), 'hex') from values ('a'), ('b');
Error during planning: Execution error: Function 'encode' user-defined coercion failed with "Error during planning: 1st argument should be Utf8 or Binary or Null, got BinaryView" No function matches the given name and argument types 'encode(BinaryView, Utf8)'. You might need to add explicit type casts.
        Candidate functions:
        encode(UserDefined)

LargeUtf8/LargeBinary array inputs are broken
BinaryView input not supported (but Utf8View input is supported)

So went about fixing this input handling as well as doing various refactors to try simplify the code.

(I also discovered #18746 in the process of this refactor).

What changes are included in this PR?

Refactor signatures away from user defined to the signature coercion API; importantly, we now accept only binary inputs, letting string inputs be coerced by type coercion. This simplifies the internal code of encode/decode to only need to consider binary inputs, instead of duplicating essentially exact code for string inputs (given for string inputs we just grabbed the underlying bytes anyway)

Consolidating the inner functions used by encode/decode to try simplify/inline where possible.

Are these changes tested?

Added new SLTs.

Are there any user-facing changes?

No.


          Fix input handling for encoding functions & various refactors

d139756

github-actions bot added sqllogictest functions labels

Jefffrey commented

View reviewed changes

Contributor Author

Jefffrey left a comment

Test failure should be fixed by #18750

datafusion/functions/src/encoding/inner.rs

    
                  pub fn new() -> Self {

                      Self {

                          signature: Signature::user_defined(Volatility::Immutable),

                          signature: Signature::coercible(

Contributor Author

Jefffrey Nov 17, 2025

Signature change here; same as for decode

datafusion/functions/src/encoding/inner.rs

    
              fn hex_encode(input: &[u8]) -> String {

                  hex::encode(input)

              fn encode_array(array: &ArrayRef, encoding: Encoding) -> Result<ColumnarValue> {

Contributor Author

Jefffrey Nov 17, 2025

These encode_scalar, encode_array, decode_scalar, decode_array methods should now be simpler as we've removed consideration of string types (expect signature to coerce them to binary for us)

datafusion/functions/src/encoding/inner.rs

Comment on lines +269 to 282

    
              /// Estimate how many bytes are actually represented by the array; in case the

              /// the array slices it's internal buffer, this returns the byte size of that slice

              /// but not the byte size of the entire buffer.

              ///

              /// This is an estimation only as it can estimate higher if null slots are non-zero

              /// sized.

              fn estimate_byte_data_size<O: OffsetSizeTrait>(array: &GenericBinaryArray<O>) -> usize {

                  let offsets = array.value_offsets();

                  // Unwraps are safe as should always have 1 element in offset buffer

                  let start = *offsets.first().unwrap();

                  let end = *offsets.last().unwrap();

                  let data_size = end - start;

                  data_size.as_usize()

              }

Contributor Author

Jefffrey Nov 17, 2025

Previously we just took the length of the values buffers; this estimation should be more conservative in the best case, and worst case we'd just have the same estimation as before

datafusion/functions/src/encoding/inner.rs

Comment on lines +292 to +294

    
                          // Don't know if there is a more strict upper bound we can infer

                          // for view arrays byte data size.

                          encoding.decode_array::<_, i32>(&array, array.get_buffer_memory_size())

Contributor Author

Jefffrey Nov 17, 2025

For views I don't know if we can do a better estimation of how many actual bytes are represented by the array

datafusion/functions/src/encoding/inner.rs

Comment on lines +319 to +322

    
              impl TryFrom<&ColumnarValue> for Encoding {

                  type Error = DataFusionError;

                  fn try_from(encoding: &ColumnarValue) -> Result<Self> {

Contributor Author

Jefffrey Nov 17, 2025

I merge the FromStr into this, since ColumnarValues are pretty much the main thing we care about parsing encoding from (no need for a two step of extracting string from ColumnarValue then parsing that string; just consolidate into single From)

datafusion/functions/src/encoding/inner.rs

    
                  }

                  // We reserved an upper bound size for the values buffer, but we only use the actual size

                  values.truncate(total_bytes_decoded);

                  let binary_array = GenericBinaryArray::<OutputOffset>::try_new(

Contributor Author

Jefffrey Nov 17, 2025

Previously we always returned BinaryArray which is what led to the issues for large types; now we consider if the input was large and return accordingly (views are considered as small type)


          fix regex

1bb0e77

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions sqllogictest