Skip to content

Conversation

@Jefffrey
Copy link
Contributor

@Jefffrey Jefffrey commented Nov 17, 2025

Which issue does this PR close?

Part of #12725

Rationale for this change

Started refactoring encoding functions (encode, decode) to remove user defined signature (per linked issue). However, discovered in the process it was bugged in handling of certain inputs. For example on main we get these errors:

DataFusion CLI v51.0.0
> select encode(arrow_cast(column1, 'LargeUtf8'), 'hex') from values ('a'), ('b');
Internal error: Function 'encode' returned value of type 'Utf8' while the following type was promised at planning time and expected: 'LargeUtf8'.
This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues
> select encode(arrow_cast(column1, 'LargeBinary'), 'hex') from values ('a'), ('b');
Internal error: Function 'encode' returned value of type 'Utf8' while the following type was promised at planning time and expected: 'LargeUtf8'.
This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues
> select encode(arrow_cast(column1, 'BinaryView'), 'hex') from values ('a'), ('b');
Error during planning: Execution error: Function 'encode' user-defined coercion failed with "Error during planning: 1st argument should be Utf8 or Binary or Null, got BinaryView" No function matches the given name and argument types 'encode(BinaryView, Utf8)'. You might need to add explicit type casts.
        Candidate functions:
        encode(UserDefined)
  • LargeUtf8/LargeBinary array inputs are broken
  • BinaryView input not supported (but Utf8View input is supported)

So went about fixing this input handling as well as doing various refactors to try simplify the code.

(I also discovered #18746 in the process of this refactor).

What changes are included in this PR?

Refactor signatures away from user defined to the signature coercion API; importantly, we now accept only binary inputs, letting string inputs be coerced by type coercion. This simplifies the internal code of encode/decode to only need to consider binary inputs, instead of duplicating essentially exact code for string inputs (given for string inputs we just grabbed the underlying bytes anyway)

Consolidating the inner functions used by encode/decode to try simplify/inline where possible.

Are these changes tested?

Added new SLTs.

Are there any user-facing changes?

No.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Nov 17, 2025
Copy link
Contributor Author

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test failure should be fixed by #18750

pub fn new() -> Self {
Self {
signature: Signature::user_defined(Volatility::Immutable),
signature: Signature::coercible(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signature change here; same as for decode


fn hex_encode(input: &[u8]) -> String {
hex::encode(input)
fn encode_array(array: &ArrayRef, encoding: Encoding) -> Result<ColumnarValue> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These encode_scalar, encode_array, decode_scalar, decode_array methods should now be simpler as we've removed consideration of string types (expect signature to coerce them to binary for us)

Comment on lines +269 to 282
/// Estimate how many bytes are actually represented by the array; in case the
/// the array slices it's internal buffer, this returns the byte size of that slice
/// but not the byte size of the entire buffer.
///
/// This is an estimation only as it can estimate higher if null slots are non-zero
/// sized.
fn estimate_byte_data_size<O: OffsetSizeTrait>(array: &GenericBinaryArray<O>) -> usize {
let offsets = array.value_offsets();
// Unwraps are safe as should always have 1 element in offset buffer
let start = *offsets.first().unwrap();
let end = *offsets.last().unwrap();
let data_size = end - start;
data_size.as_usize()
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we just took the length of the values buffers; this estimation should be more conservative in the best case, and worst case we'd just have the same estimation as before

Comment on lines +292 to +294
// Don't know if there is a more strict upper bound we can infer
// for view arrays byte data size.
encoding.decode_array::<_, i32>(&array, array.get_buffer_memory_size())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For views I don't know if we can do a better estimation of how many actual bytes are represented by the array

Comment on lines +319 to +322
impl TryFrom<&ColumnarValue> for Encoding {
type Error = DataFusionError;

fn try_from(encoding: &ColumnarValue) -> Result<Self> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merge the FromStr into this, since ColumnarValues are pretty much the main thing we care about parsing encoding from (no need for a two step of extracting string from ColumnarValue then parsing that string; just consolidate into single From)

}
// We reserved an upper bound size for the values buffer, but we only use the actual size
values.truncate(total_bytes_decoded);
let binary_array = GenericBinaryArray::<OutputOffset>::try_new(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we always returned BinaryArray which is what led to the issues for large types; now we consider if the input was large and return accordingly (views are considered as small type)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant