Skip to content

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Sep 18, 2025

Which issue does this PR close?

Partial fix for #16579

This is part of a series smaller PRs to replace #17281

Rationale for this change

The pruning logic previously used arrow::compute::cast directly when
converting statistics arrays into the target field types. This caused
two problems:

  1. When the statistics arrived as binary (e.g. Binary,
    LargeBinary, or BinaryView) but the target column type was UTF-8
    (e.g. Utf8, LargeUtf8, Utf8View), invalid UTF-8 byte sequences
    could produce unchecked/invalid string values or require special
    handling. Using a safe cast path ensures invalid UTF-8 is converted to
    NULL rather than producing incorrect values or panics.

  2. Struct-typed statistics were not always handled robustly. Nested
    struct casting and validation required a number of defensive checks
    and helpers to ensure the statistics layout matches the requested
    struct layout (including useful errors when incompatible).

This change centralizes casting behaviour using
datafusion_common::cast_column, adds special handling for
BinaryView to sanitize invalid UTF-8 bytes ahead of casting, and
introduces a set of unit tests that exercise:

  • Binary → UTF-8 safe casts (including BinaryView special-case
    sanitization)
  • Nested struct statistic handling
  • Several error cases that previously produced unclear failures

These changes make pruning more robust when reading
metadata/statistics from formats such as Parquet and ensure
invalid/inconsistent statistics do not silently propagate incorrect
values.

What changes are included in this PR?

  • Replace arrow::compute::cast(&array, data_type) usages in pruning
    with cast_column(&array, stat_field, &cast_options) so that
    DataFusion's centralized casting rules are applied. The cast path
    respects CastOptions and enables safe casting where appropriate.

  • Use DEFAULT_CAST_OPTIONS and enable safe casting when the source
    statistics type is binary (Binary, LargeBinary, BinaryView)
    and the requested/target type is a string type (Utf8, LargeUtf8,
    Utf8View). This converts invalid UTF-8 byte sequences to NULL
    instead of producing invalid string values.

  • Add sanitize_binary_array_for_utf8 in
    datafusion/common/src/nested_struct.rs to pre-process a
    BinaryViewArray so invalid UTF-8 sequences are explicitly converted
    to nulls before casting. For other binary array representations the
    existing Arrow safe casts are sufficient and the array is returned
    unchanged.

  • Add logic to datafusion/pruning/src/pruning_predicate.rs to
    determine when safe casting is required, construct cast_options,
    and pass them to cast_column.

  • Add numerous unit tests for pruning:

    • test_build_statistics_large_binary_to_large_utf8_invalid_bytes
    • test_build_statistics_binary_view_to_utf8_view_invalid_bytes
    • test_build_statistics_binary_view_to_utf8_view_valid_bytes
    • test_build_statistics_struct_column
    • test_build_statistics_invalid_utf8_safe_cast
    • test_build_statistics_nested_invalid_utf8_safe_cast
    • Struct-related negative tests verifying informative errors for
      incompatible layouts, non-struct sources, and inconsistent
      lengths.
  • Minor refactors and imports across nested_struct.rs and
    pruning_predicate.rs (e.g. AsArray as _, DataType usage, small
    helpers such as make_struct_required_cols in tests).

Are these changes tested?

Yes — this PR adds and updates unit tests under pruning tests to
cover both happy-path and error-path behaviours:

  • Tests ensure invalid UTF-8 bytes in binary statistics are converted
    to NULL when casting to string types, including special handling
    of BinaryViewArray.
  • Tests validate nested struct statistics are preserved and cast
    correctly to the requested struct layout, and that incompatible
    layouts produce clear, descriptive errors.
  • Additional tests cover the mismatched statistics length error
    scenario.

All tests were added to the existing test modules and exercise the
build_statistics_record_batch logic used by pruning.

Are there any user-facing changes?

No direct user-facing API changes are expected. This work affects
internal pruning/statistics conversion logic and unit tests. The
change improves robustness of pruning when reading file-format
statistics and avoids incorrect string values resulting from invalid
UTF-8 bytes in statistics metadata.

If there are consumers of internal cast behaviour in custom code
paths, they should prefer datafusion_common::cast_column for
consistency with DataFusion's casting semantics.


Notes for reviewers

  • Look at datafusion/common/src/nested_struct.rs for
    sanitize_binary_array_for_utf8 and the Utf8View handling in
    cast_column branching.
  • Review datafusion/pruning/src/pruning_predicate.rs changes,
    especially the decision logic that toggles cast_options.safe when
    binary stats are being cast to string types.
  • The new tests show examples of both valid and invalid inputs for
    BinaryViewArray and BinaryArray to confirm the intended
    behaviour.

…afe casting and sanitization for invalid bytes
@github-actions github-actions bot added the common Related to common crate label Sep 18, 2025
@kosiew kosiew marked this pull request as ready for review September 18, 2025 09:07
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kosiew

Can you be more clear what the purpose of this PR is? Specificalyl what usecase are you solving?

When the statistics arrived as binary (e.g. Binary,
LargeBinary, or BinaryView) but the target column type was UTF-8
(e.g. Utf8, LargeUtf8, Utf8View), invalid UTF-8 byte sequences
could produce unchecked/invalid string values or require special
handling. Using a safe cast path ensures invalid UTF-8 is converted to
NULL rather than producing incorrect values or panics.

Is this a case you have seen in practice? I think the existing cast kernel handles invalid ut8f already

Basically, I worry this PR adds code / complexity but doesn't solve a real problem

DataType::BinaryView => {
let binary_view = array.as_binary_view();

// Check if all bytes are already valid UTF-8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant with calling https://docs.rs/arrow/latest/arrow/compute/fn.cast_with_options.html with CastOptions::safe = true 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the existing cast kernel handles invalid ut8f already

It does not handle BinaryView

#[test]
    fn test_arrow_cast_binaryview_to_utf8view_fails_with_invalid_utf8() {
        // Arrow's BinaryView -> Utf8View casting fails even with safe=true when
        // encountering invalid UTF-8, unlike other binary array types which
        // convert invalid UTF-8 to null with safe casting.
        use arrow::compute::kernels::cast::{cast_with_options, CastOptions};

        // Create BinaryView with invalid UTF-8
        let binary_data = vec![
            Some("valid".as_bytes()),
            Some(&[0xf0, 0x28, 0x8c, 0x28]), // invalid UTF-8 sequence
            Some("also_valid".as_bytes()),
        ];
        let binary_view_array: ArrayRef = Arc::new(BinaryViewArray::from(binary_data));

        // Try casting with safe=false (should fail)
        let cast_options = CastOptions::default(); // safe=false by default
        let result =
            cast_with_options(&binary_view_array, &DataType::Utf8View, &cast_options);
        assert!(
            result.is_err(),
            "Expected BinaryView->Utf8View cast to fail with safe=false"
        );
        assert!(
            result
                .unwrap_err()
                .to_string()
                .contains("Encountered non-UTF-8 data"),
            "Error should mention non-UTF-8 data"
        );

        // Try casting with safe=true (should still fail for BinaryView!)
        let mut safe_cast_options = CastOptions::default();
        safe_cast_options.safe = true;
        let safe_result = cast_with_options(
            &binary_view_array,
            &DataType::Utf8View,
            &safe_cast_options,
        );
        assert!(
            safe_result.is_err(),
            "BinaryView->Utf8View cast fails even with safe=true (unlike other binary types)"
        );
        assert!(
            safe_result
                .unwrap_err()
                .to_string()
                .contains("Encountered non-UTF-8 data"),
            "Safe cast error should also mention non-UTF-8 data"
        );

    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that is a great find and I am sorry I did not understand that context.

This sounds like a bug in upstream arrow -- I think we should file a ticket in arrow-rs and then leave a reference to that ticket in this implementation (so we can remove the datafusion code when the corresponding code in arrow-rs is released_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants