-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Fix input handling for encoding functions & various refactors #18754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test failure should be fixed by #18750
| pub fn new() -> Self { | ||
| Self { | ||
| signature: Signature::user_defined(Volatility::Immutable), | ||
| signature: Signature::coercible( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signature change here; same as for decode
|
|
||
| fn hex_encode(input: &[u8]) -> String { | ||
| hex::encode(input) | ||
| fn encode_array(array: &ArrayRef, encoding: Encoding) -> Result<ColumnarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These encode_scalar, encode_array, decode_scalar, decode_array methods should now be simpler as we've removed consideration of string types (expect signature to coerce them to binary for us)
| /// Estimate how many bytes are actually represented by the array; in case the | ||
| /// the array slices it's internal buffer, this returns the byte size of that slice | ||
| /// but not the byte size of the entire buffer. | ||
| /// | ||
| /// This is an estimation only as it can estimate higher if null slots are non-zero | ||
| /// sized. | ||
| fn estimate_byte_data_size<O: OffsetSizeTrait>(array: &GenericBinaryArray<O>) -> usize { | ||
| let offsets = array.value_offsets(); | ||
| // Unwraps are safe as should always have 1 element in offset buffer | ||
| let start = *offsets.first().unwrap(); | ||
| let end = *offsets.last().unwrap(); | ||
| let data_size = end - start; | ||
| data_size.as_usize() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we just took the length of the values buffers; this estimation should be more conservative in the best case, and worst case we'd just have the same estimation as before
| // Don't know if there is a more strict upper bound we can infer | ||
| // for view arrays byte data size. | ||
| encoding.decode_array::<_, i32>(&array, array.get_buffer_memory_size()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For views I don't know if we can do a better estimation of how many actual bytes are represented by the array
| impl TryFrom<&ColumnarValue> for Encoding { | ||
| type Error = DataFusionError; | ||
|
|
||
| fn try_from(encoding: &ColumnarValue) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merge the FromStr into this, since ColumnarValues are pretty much the main thing we care about parsing encoding from (no need for a two step of extracting string from ColumnarValue then parsing that string; just consolidate into single From)
| } | ||
| // We reserved an upper bound size for the values buffer, but we only use the actual size | ||
| values.truncate(total_bytes_decoded); | ||
| let binary_array = GenericBinaryArray::<OutputOffset>::try_new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we always returned BinaryArray which is what led to the issues for large types; now we consider if the input was large and return accordingly (views are considered as small type)
Which issue does this PR close?
Part of #12725
Rationale for this change
Started refactoring encoding functions (
encode,decode) to remove user defined signature (per linked issue). However, discovered in the process it was bugged in handling of certain inputs. For example on main we get these errors:So went about fixing this input handling as well as doing various refactors to try simplify the code.
(I also discovered #18746 in the process of this refactor).
What changes are included in this PR?
Refactor signatures away from user defined to the signature coercion API; importantly, we now accept only binary inputs, letting string inputs be coerced by type coercion. This simplifies the internal code of encode/decode to only need to consider binary inputs, instead of duplicating essentially exact code for string inputs (given for string inputs we just grabbed the underlying bytes anyway)
Consolidating the inner functions used by encode/decode to try simplify/inline where possible.
Are these changes tested?
Added new SLTs.
Are there any user-facing changes?
No.