-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Allow base64 encoding of fixedsizebinary arrays #18950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| #[cfg(test)] | ||
| mod tests { | ||
| #[test] | ||
| fn test_encode_fsb() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be added as an SLT test instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? SLT tests are quite unapproachable for those who don't work in datafusion regularly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it is very necessary, because it reveals that the signature of the function doesn't actually accept FixedSizeBinary arrays, something calling the function directly in code like this wouldn't make obvious.
While SLTs can be harder to write, we prefer it because they are less verbose, and they can test functions with a more stable interface (SQL) rather than their internal APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something you would be able to help out with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at the README and see some other tests for reference:
datafusion/datafusion/sqllogictest/test_files/encoding.slt
Lines 50 to 57 in 769f367
| # Arrays tests | |
| query T | |
| SELECT encode(bin_field, 'hex') FROM test ORDER BY num; | |
| ---- | |
| 616263 | |
| 7177657177 | |
| NULL | |
| 8f50d3f60eae370ddbf85c86219c55108a350165 |
datafusion/datafusion/sqllogictest/test_files/binary.slt
Lines 116 to 117 in 769f367
| arrow_cast(column1, 'FixedSizeBinary(3)') as "column1", | |
| arrow_cast(column2, 'FixedSizeBinary(3)') as "column2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I don't have the bandwidth to dig into SLT tests at this time. I have provided this MR, it contains tests, and in the past this has been sufficient for Datafusion. You may take it from here as you wish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell this PR doesn't actually fix anything; have you tested this according to your use case for the issue you raised? (And I don't mean the unit tests you introduced here because those are internal tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell this PR doesn't actually fix anything
How so? Before, calling encode on fixedsizebinary types would have returned an unsupported data type error.
We've been using in production since April when we first made the change: urbanlogiq@9dfd561
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some possible tests:
# test for FixedSizeBinary support for encode
statement ok
CREATE TABLE test_fsb AS
SELECT arrow_cast(X'0123456789ABCDEF', 'FixedSizeBinary(8)') as fsb_col;
query TT
SELECT
encode(fsb_col, 'base64') AS fsb_base64,
encode(fsb_col, 'hex') AS fsb_hex
FROM test_fsb;
----
ASNFZ4mrze8 0123456789abcdef
# Test with NULL
query T
SELECT encode(arrow_cast(NULL, 'FixedSizeBinary(8)'), 'base64');
----
NULL
| LargeUtf8 => LargeUtf8, | ||
| Utf8View => Utf8, | ||
| Binary => Utf8, | ||
| LargeBinary => LargeUtf8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FixedSizeBinary should be added here, no ?
| LargeBinary => LargeUtf8, | |
| FixedSizeBinary(_) => Utf8, |
| } | ||
| DataType::LargeUtf8 => Ok(vec![DataType::LargeUtf8, DataType::Utf8]), | ||
| DataType::Binary => Ok(vec![DataType::Binary, DataType::Utf8]), | ||
| DataType::LargeBinary => Ok(vec![DataType::LargeBinary, DataType::Utf8]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FixedSizeBinary should be added here too
| DataType::LargeBinary => Ok(vec![DataType::LargeBinary, DataType::Utf8]), | |
| DataType::FixedSizeBinary(size) => Ok(vec\![DataType::FixedSizeBinary(*size), DataType::Utf8]), |
| #[cfg(test)] | ||
| mod tests { | ||
| #[test] | ||
| fn test_encode_fsb() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some possible tests:
# test for FixedSizeBinary support for encode
statement ok
CREATE TABLE test_fsb AS
SELECT arrow_cast(X'0123456789ABCDEF', 'FixedSizeBinary(8)') as fsb_col;
query TT
SELECT
encode(fsb_col, 'base64') AS fsb_base64,
encode(fsb_col, 'hex') AS fsb_hex
FROM test_fsb;
----
ASNFZ4mrze8 0123456789abcdef
# Test with NULL
query T
SELECT encode(arrow_cast(NULL, 'FixedSizeBinary(8)'), 'base64');
----
NULL
| encode_process(&value, Encoding::Base64).unwrap() | ||
| else { | ||
| panic!("unexpected value"); | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To cover the missing usage of return_type and coerce_types you need something like:
let encode_func = EncodeFunc::new();
let args = vec![
ColumnarValue::Array(Arc::new(array)),
ColumnarValue::Scalar(ScalarValue::Utf8(Some("base64".to_string()))),
];
// This will test the full path including type checking
let result = encode_func.invoke_with_args(datafusion_expr::ScalarFunctionArgs {
args,
number_rows: 1,
return_type: &DataType::Utf8,
}).unwrap();
let ColumnarValue::Array(result) = result else {
panic!("unexpected value");
};
...
Which issue does this PR close?
Rationale for this change
We need to be able to base64 encode fixedsizebinary types.
Are these changes tested?
Yes
Are there any user-facing changes?
No