-
Notifications
You must be signed in to change notification settings - Fork 283
Description
Summary
After #3401 set CometScanWrapper.isFfiSafe = true, there is a small performance regression for dictionary-encoded columns (most notably SmallInt, which almost always stays dictionary-encoded in Parquet due to its low cardinality).
Root Cause
In copy_or_unpack_array (native/core/src/execution/operators/copy.rs), the dictionary unpacking path always performs a defensive copy_array after cast_with_options, regardless of the CopyMode:
DataType::Dictionary(_, value_type) => {
let options = CastOptions::default();
// We need to copy the array after `cast` because arrow-rs `take` kernel which is used
// to unpack dictionary array might reuse the input array's null buffer.
Ok(copy_array(&cast_with_options(
array,
value_type.as_ref(),
&options,
)?))
}In UnpackOrClone mode (used when isFfiSafe=true), the caller has already transferred ownership of the original array's memory to native, so null buffer reuse from the take kernel is safe. The extra deep copy is unnecessary.
Proposed Fix
Make the defensive copy conditional on the mode:
DataType::Dictionary(_, value_type) => {
let options = CastOptions::default();
let unpacked = cast_with_options(array, value_type.as_ref(), &options)?;
if mode == &CopyMode::UnpackOrDeepCopy {
// We need to copy the array after `cast` because arrow-rs `take` kernel which is
// used to unpack dictionary array might reuse the input array's null buffer.
Ok(copy_array(&unpacked))
} else {
// In UnpackOrClone mode, the caller owns the original array's memory so
// null buffer reuse from the take kernel is safe.
Ok(unpacked)
}
}This eliminates the redundant deep copy of the full flat array for dictionary-encoded columns, restoring performance parity with the previous code path.