-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Allow base64 encoding of fixedsizebinary arrays #18950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
maxburke
wants to merge
1
commit into
apache:main
Choose a base branch
from
urbanlogiq:base64-fsb
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+44
−1
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -29,7 +29,9 @@ use base64::{ | |
| Engine as _, | ||
| }; | ||
| use datafusion_common::{ | ||
| cast::{as_generic_binary_array, as_generic_string_array}, | ||
| cast::{ | ||
| as_fixed_size_binary_array, as_generic_binary_array, as_generic_string_array, | ||
| }, | ||
| not_impl_err, plan_err, | ||
| utils::take_function_args, | ||
| }; | ||
|
|
@@ -246,6 +248,9 @@ fn encode_process(value: &ColumnarValue, encoding: Encoding) -> Result<ColumnarV | |
| DataType::Utf8View => encoding.encode_utf8_array::<i32>(a.as_ref()), | ||
| DataType::Binary => encoding.encode_binary_array::<i32>(a.as_ref()), | ||
| DataType::LargeBinary => encoding.encode_binary_array::<i64>(a.as_ref()), | ||
| DataType::FixedSizeBinary(_) => { | ||
| encoding.encode_fixed_size_binary_array(a.as_ref()) | ||
| } | ||
| other => exec_err!( | ||
| "Unsupported data type {other:?} for function encode({encoding})" | ||
| ), | ||
|
|
@@ -265,6 +270,9 @@ fn encode_process(value: &ColumnarValue, encoding: Encoding) -> Result<ColumnarV | |
| ), | ||
| ScalarValue::LargeBinary(a) => Ok(encoding | ||
| .encode_large_scalar(a.as_ref().map(|v: &Vec<u8>| v.as_slice()))), | ||
| ScalarValue::FixedSizeBinary(_, a) => Ok( | ||
| encoding.encode_scalar(a.as_ref().map(|v: &Vec<u8>| v.as_slice())) | ||
| ), | ||
| other => exec_err!( | ||
| "Unsupported data type {other:?} for function encode({encoding})" | ||
| ), | ||
|
|
@@ -401,6 +409,15 @@ impl Encoding { | |
| Ok(ColumnarValue::Array(array)) | ||
| } | ||
|
|
||
| fn encode_fixed_size_binary_array(self, value: &dyn Array) -> Result<ColumnarValue> { | ||
| let input_value = as_fixed_size_binary_array(value)?; | ||
| let array: ArrayRef = match self { | ||
| Self::Base64 => encode_to_array!(base64_encode, input_value), | ||
| Self::Hex => encode_to_array!(hex_encode, input_value), | ||
| }; | ||
| Ok(ColumnarValue::Array(array)) | ||
| } | ||
|
|
||
| fn encode_utf8_array<T>(self, value: &dyn Array) -> Result<ColumnarValue> | ||
| where | ||
| T: OffsetSizeTrait, | ||
|
|
@@ -553,3 +570,29 @@ fn decode(args: &[ColumnarValue]) -> Result<ColumnarValue> { | |
| }?; | ||
| decode_process(expression, encoding) | ||
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { | ||
| #[test] | ||
| fn test_encode_fsb() { | ||
| use super::*; | ||
|
|
||
| let value = vec![0u8; 16]; | ||
| let array = arrow::array::FixedSizeBinaryArray::try_from_sparse_iter_with_size( | ||
| vec![Some(value)].into_iter(), | ||
| 16, | ||
| ) | ||
| .unwrap(); | ||
| let value = ColumnarValue::Array(Arc::new(array)); | ||
|
|
||
| let ColumnarValue::Array(result) = | ||
| encode_process(&value, Encoding::Base64).unwrap() | ||
| else { | ||
| panic!("unexpected value"); | ||
| }; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To cover the missing usage of let encode_func = EncodeFunc::new();
let args = vec![
ColumnarValue::Array(Arc::new(array)),
ColumnarValue::Scalar(ScalarValue::Utf8(Some("base64".to_string()))),
];
// This will test the full path including type checking
let result = encode_func.invoke_with_args(datafusion_expr::ScalarFunctionArgs {
args,
number_rows: 1,
return_type: &DataType::Utf8,
}).unwrap();
let ColumnarValue::Array(result) = result else {
panic!("unexpected value");
};
... |
||
|
|
||
| let string_array = result.as_any().downcast_ref::<StringArray>().unwrap(); | ||
| let result_value = string_array.value(0); | ||
| assert_eq!(result_value, "AAAAAAAAAAAAAAAAAAAAAA"); | ||
| } | ||
| } | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be added as an SLT test instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? SLT tests are quite unapproachable for those who don't work in datafusion regularly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it is very necessary, because it reveals that the signature of the function doesn't actually accept FixedSizeBinary arrays, something calling the function directly in code like this wouldn't make obvious.
While SLTs can be harder to write, we prefer it because they are less verbose, and they can test functions with a more stable interface (SQL) rather than their internal APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something you would be able to help out with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at the README and see some other tests for reference:
datafusion/datafusion/sqllogictest/test_files/encoding.slt
Lines 50 to 57 in 769f367
datafusion/datafusion/sqllogictest/test_files/binary.slt
Lines 116 to 117 in 769f367
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I don't have the bandwidth to dig into SLT tests at this time. I have provided this MR, it contains tests, and in the past this has been sufficient for Datafusion. You may take it from here as you wish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell this PR doesn't actually fix anything; have you tested this according to your use case for the issue you raised? (And I don't mean the unit tests you introduced here because those are internal tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How so? Before, calling encode on fixedsizebinary types would have returned an unsupported data type error.
We've been using in production since April when we first made the change: urbanlogiq@9dfd561
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some possible tests: