-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Improve pruning logic by replacing arrow::compute::cast with cast_column and add tests for struct statistics and error cases
#17637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…dling in statistics
…tatistics_record_batch
…ired columns creation
… validation in tests
…afe casting and sanitization for invalid bytes
…larity in statistics tests
… enhanced functionality
…yViewArray statistics
…8 bytes in BinaryViewArray
… clarify behavior with invalid UTF-8
…ing invalid UTF-8
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kosiew
Can you be more clear what the purpose of this PR is? Specificalyl what usecase are you solving?
When the statistics arrived as binary (e.g. Binary,
LargeBinary, or BinaryView) but the target column type was UTF-8
(e.g. Utf8, LargeUtf8, Utf8View), invalid UTF-8 byte sequences
could produce unchecked/invalid string values or require special
handling. Using a safe cast path ensures invalid UTF-8 is converted to
NULL rather than producing incorrect values or panics.
Is this a case you have seen in practice? I think the existing cast kernel handles invalid ut8f already
Basically, I worry this PR adds code / complexity but doesn't solve a real problem
| DataType::BinaryView => { | ||
| let binary_view = array.as_binary_view(); | ||
|
|
||
| // Check if all bytes are already valid UTF-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems redundant with calling https://docs.rs/arrow/latest/arrow/compute/fn.cast_with_options.html with CastOptions::safe = true 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the existing cast kernel handles invalid ut8f already
It does not handle BinaryView
#[test]
fn test_arrow_cast_binaryview_to_utf8view_fails_with_invalid_utf8() {
// Arrow's BinaryView -> Utf8View casting fails even with safe=true when
// encountering invalid UTF-8, unlike other binary array types which
// convert invalid UTF-8 to null with safe casting.
use arrow::compute::kernels::cast::{cast_with_options, CastOptions};
// Create BinaryView with invalid UTF-8
let binary_data = vec![
Some("valid".as_bytes()),
Some(&[0xf0, 0x28, 0x8c, 0x28]), // invalid UTF-8 sequence
Some("also_valid".as_bytes()),
];
let binary_view_array: ArrayRef = Arc::new(BinaryViewArray::from(binary_data));
// Try casting with safe=false (should fail)
let cast_options = CastOptions::default(); // safe=false by default
let result =
cast_with_options(&binary_view_array, &DataType::Utf8View, &cast_options);
assert!(
result.is_err(),
"Expected BinaryView->Utf8View cast to fail with safe=false"
);
assert!(
result
.unwrap_err()
.to_string()
.contains("Encountered non-UTF-8 data"),
"Error should mention non-UTF-8 data"
);
// Try casting with safe=true (should still fail for BinaryView!)
let mut safe_cast_options = CastOptions::default();
safe_cast_options.safe = true;
let safe_result = cast_with_options(
&binary_view_array,
&DataType::Utf8View,
&safe_cast_options,
);
assert!(
safe_result.is_err(),
"BinaryView->Utf8View cast fails even with safe=true (unlike other binary types)"
);
assert!(
safe_result
.unwrap_err()
.to_string()
.contains("Encountered non-UTF-8 data"),
"Safe cast error should also mention non-UTF-8 data"
);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that is a great find and I am sorry I did not understand that context.
This sounds like a bug in upstream arrow -- I think we should file a ticket in arrow-rs and then leave a reference to that ticket in this implementation (so we can remove the datafusion code when the corresponding code in arrow-rs is released_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create the issue.
…_binary_array_for_utf8
Which issue does this PR close?
Partial fix for #16579
This is part of a series smaller PRs to replace #17281
Rationale for this change
The pruning logic previously used
arrow::compute::castdirectly whenconverting statistics arrays into the target field types. This caused
two problems:
When the statistics arrived as binary (e.g.
Binary,LargeBinary, orBinaryView) but the target column type was UTF-8(e.g.
Utf8,LargeUtf8,Utf8View), invalid UTF-8 byte sequencescould produce unchecked/invalid string values or require special
handling. Using a safe cast path ensures invalid UTF-8 is converted to
NULLrather than producing incorrect values or panics.Struct-typed statistics were not always handled robustly. Nested
struct casting and validation required a number of defensive checks
and helpers to ensure the statistics layout matches the requested
struct layout (including useful errors when incompatible).
This change centralizes casting behaviour using
datafusion_common::cast_column, adds special handling forBinaryViewto sanitize invalid UTF-8 bytes ahead of casting, andintroduces a set of unit tests that exercise:
BinaryViewspecial-casesanitization)
These changes make pruning more robust when reading
metadata/statistics from formats such as Parquet and ensure
invalid/inconsistent statistics do not silently propagate incorrect
values.
What changes are included in this PR?
Replace
arrow::compute::cast(&array, data_type)usages in pruningwith
cast_column(&array, stat_field, &cast_options)so thatDataFusion's centralized casting rules are applied. The cast path
respects
CastOptionsand enables safe casting where appropriate.Use
DEFAULT_CAST_OPTIONSand enablesafecasting when the sourcestatistics type is binary (
Binary,LargeBinary,BinaryView)and the requested/target type is a string type (
Utf8,LargeUtf8,Utf8View). This converts invalid UTF-8 byte sequences toNULLinstead of producing invalid string values.
Add
sanitize_binary_array_for_utf8indatafusion/common/src/nested_struct.rsto pre-process aBinaryViewArrayso invalid UTF-8 sequences are explicitly convertedto nulls before casting. For other binary array representations the
existing Arrow safe casts are sufficient and the array is returned
unchanged.
Add logic to
datafusion/pruning/src/pruning_predicate.rstodetermine when
safecasting is required, constructcast_options,and pass them to
cast_column.Add numerous unit tests for pruning:
test_build_statistics_large_binary_to_large_utf8_invalid_bytestest_build_statistics_binary_view_to_utf8_view_invalid_bytestest_build_statistics_binary_view_to_utf8_view_valid_bytestest_build_statistics_struct_columntest_build_statistics_invalid_utf8_safe_casttest_build_statistics_nested_invalid_utf8_safe_castincompatible layouts, non-struct sources, and inconsistent
lengths.
Minor refactors and imports across
nested_struct.rsandpruning_predicate.rs(e.g.AsArray as _,DataTypeusage, smallhelpers such as
make_struct_required_colsin tests).Are these changes tested?
Yes — this PR adds and updates unit tests under
pruningtests tocover both happy-path and error-path behaviours:
to
NULLwhen casting to string types, including special handlingof
BinaryViewArray.correctly to the requested struct layout, and that incompatible
layouts produce clear, descriptive errors.
mismatched statistics lengtherrorscenario.
All tests were added to the existing test modules and exercise the
build_statistics_record_batchlogic used by pruning.Are there any user-facing changes?
No direct user-facing API changes are expected. This work affects
internal pruning/statistics conversion logic and unit tests. The
change improves robustness of pruning when reading file-format
statistics and avoids incorrect string values resulting from invalid
UTF-8 bytes in statistics metadata.
If there are consumers of internal
castbehaviour in custom codepaths, they should prefer
datafusion_common::cast_columnforconsistency with DataFusion's casting semantics.
Notes for reviewers
datafusion/common/src/nested_struct.rsforsanitize_binary_array_for_utf8and theUtf8Viewhandling incast_columnbranching.datafusion/pruning/src/pruning_predicate.rschanges,especially the decision logic that toggles
cast_options.safewhenbinary stats are being cast to string types.
BinaryViewArrayandBinaryArrayto confirm the intendedbehaviour.