Skip to content

Conversation

@tobixdev
Copy link
Contributor

Which issue does this PR close?

We should open a follow-up issue that tracks re-use FixedSizeBinaryArray::new_null after upgrading arrow-rs.

With this PR, we could backport it to #18843 if we wish.

Alternatively, we could also wait for a fix by arrow-rs.

Rationale for this change

As we will have to wait for another arrow-rs update to get the fix for the issue apache/arrow-rs#8901 , we have a temporaray fix that basically calls the same operations from DataFusion's code. Once the fix is in the arrow-rs codebase, we can re-use FixedSizeBinaryArray::new_null.

What changes are included in this PR?

  • A test. (assert_eq!(result.as_fixed_size_binary().values().len(), 10); returned 0 before)
  • Directly allocate the values buffer in DataFusion.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, the buffer will now be allocated, zeroed, and set to the expected length. Same as in the arrow-rs fix.

@tobixdev
Copy link
Contributor Author

Hmm seems this change causes problems with DataType::FixedSizeBinary(0) as it divides by zero in FixedSizeBinaryArray::try_new.

    /// Create a new [`FixedSizeBinaryArray`] from the provided parts, returning an error on failure
    ///
    /// # Errors
    ///
    /// * `size < 0`
    /// * `values.len() / size != nulls.len()`
    pub fn try_new(
        size: i32,
        values: Buffer,
        nulls: Option<NullBuffer>,
    ) -> Result<Self, ArrowError> {
        let data_type = DataType::FixedSizeBinary(size);
        let s = size.to_usize().ok_or_else(|| {
            ArrowError::InvalidArgumentError(format!("Size cannot be negative, got {size}"))
        })?;

        let len = values.len() / s; <--- crash

I don't see any way to create a new FixedSizeBinaryArray with zero length using try_new. Is this on-purpose? The comments says 0 should be fine. Is this a bug in arrow-rs?

We maybe could create a FixedSizeBinaryArray using the From<ArrayData> implementation.

Any preferences on how we should proceed? As I am unsure whether we even want to fix this issue on the DataFusion side (or wait for an arrow-rs update), I'll wait another opinion.

@Jefffrey
Copy link
Contributor

I don't see any way to create a new FixedSizeBinaryArray with zero length using try_new. Is this on-purpose? The comments says 0 should be fine. Is this a bug in arrow-rs?

This does seem like a bug; for reference, pyarrow allows it:

>>> import pyarrow as pa
>>> pa.binary(0)
FixedSizeBinaryType(fixed_size_binary[0])
>>> pa.array([], type=pa.binary(0))
<pyarrow.lib.FixedSizeBinaryArray object at 0x102030b80>
[]
>>> pa.array([], type=pa.binary(0)).type
FixedSizeBinaryType(fixed_size_binary[0])

It would be nice to upstream a proper fix to arrow-rs and then can wait for it here; otherwise if we need a fix for DataFusion now we can do a workaround as you describe:

We maybe could create a FixedSizeBinaryArray using the From<ArrayData> implementation.

@tobixdev
Copy link
Contributor Author

@Jefffrey Thanks for checking against pyarrow!

Here is a PR that fixes the issue in arrow-rs.

apache/arrow-rs#8927

I also found out that we can create zero-sized item arrays using the builder API. I'll give that a try.

…SizeBinaryArray in `ScalarValue::to_array_of_size`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ScalarValue::to_array_of_size Does Not Correctly Allocate a Values Buffer for ScalarValue::FixedSizeBinary

2 participants