Skip to content

Skip defensive copy when unpacking dictionary arrays in UnpackOrClone mode #3435

@andygrove

Description

@andygrove

Summary

After #3401 set CometScanWrapper.isFfiSafe = true, there is a small performance regression for dictionary-encoded columns (most notably SmallInt, which almost always stays dictionary-encoded in Parquet due to its low cardinality).

Root Cause

In copy_or_unpack_array (native/core/src/execution/operators/copy.rs), the dictionary unpacking path always performs a defensive copy_array after cast_with_options, regardless of the CopyMode:

DataType::Dictionary(_, value_type) => {
    let options = CastOptions::default();
    // We need to copy the array after `cast` because arrow-rs `take` kernel which is used
    // to unpack dictionary array might reuse the input array's null buffer.
    Ok(copy_array(&cast_with_options(
        array,
        value_type.as_ref(),
        &options,
    )?))
}

In UnpackOrClone mode (used when isFfiSafe=true), the caller has already transferred ownership of the original array's memory to native, so null buffer reuse from the take kernel is safe. The extra deep copy is unnecessary.

Proposed Fix

Make the defensive copy conditional on the mode:

DataType::Dictionary(_, value_type) => {
    let options = CastOptions::default();
    let unpacked = cast_with_options(array, value_type.as_ref(), &options)?;
    if mode == &CopyMode::UnpackOrDeepCopy {
        // We need to copy the array after `cast` because arrow-rs `take` kernel which is
        // used to unpack dictionary array might reuse the input array's null buffer.
        Ok(copy_array(&unpacked))
    } else {
        // In UnpackOrClone mode, the caller owns the original array's memory so
        // null buffer reuse from the take kernel is safe.
        Ok(unpacked)
    }
}

This eliminates the redundant deep copy of the full flat array for dictionary-encoded columns, restoring performance parity with the previous code path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions