Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong size assertion in arrow_buffer::builder::NullBufferBuilder::new_from_buffer #5445

Closed
tijmenr opened this issue Feb 29, 2024 · 5 comments · Fixed by #5489
Closed

Wrong size assertion in arrow_buffer::builder::NullBufferBuilder::new_from_buffer #5445

tijmenr opened this issue Feb 29, 2024 · 5 comments · Fixed by #5489
Labels
arrow Changes to the arrow crate bug good first issue Good for newcomers help wanted

Comments

@tijmenr
Copy link

tijmenr commented Feb 29, 2024

Describe the bug
In:

assert!(len < capacity);

The assertion check is len < capacity, where len is the number of boolean null/non-null bit values, and capacity is buffer.len() * 8, with buffer.len() the size of the buffer containing those bits. If len is a multiple of 8, say 8b, then the buffer used to store it has length (buffer.len()) b, and len == capacity. This valid situation fails the assertion check.

To Reproduce

// Assume pa: PrimitiveArray holding values including nulls, with length a multiple of 8
let b = pa.into_builder().expect("into_builder")

thread 'main' panicked at .../.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-buffer-50.0.0/src/builder/null.rs:57:9:
assertion failed: len < capacity
stack backtrace:
...
2: core::panicking::panic
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:127:5
3: arrow_buffer::builder::null::NullBufferBuilder::new_from_buffer
4: arrow_array::builder::primitive_builder::PrimitiveBuilder::new_from_buffer::{{closure}}
at .../.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-50.0.0/src/builder/primitive_builder.rs:164:27
5: core::option::Option::map
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/option.rs:1072:29
6: arrow_array::builder::primitive_builder::PrimitiveBuilder::new_from_buffer
at .../.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-50.0.0/src/builder/primitive_builder.rs:163:35
7: arrow_array::array::primitive_array::PrimitiveArray::into_builder
at .../.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-50.0.0/src/array/primitive_array.rs:946:46

Expected behavior
into_builder() succeeds.

Additional context
In my use case, I am parsing a text file containing number values (or some value indicating "null"), and store these in a PrimitiveBuilder (for later converting to an Array, and further processing). I do not know in advance what number type a series of values has, they might for example all be integers, but it could also be that the first (few) ones look like integers, but later ones are floats. So, I start out with e.g. a PrimitiveBuilder, parse the next string value into an int, if it parses ok I add the value, if the parse fails I try to parse it as e.g. an f32. If that succeeds, then apparently this series of values is actually a series of floats, not ints. So I want to basically cast/convert the builder with the already collected values to another type, and continue my parsing.

One approach I tried:

use num::cast::{AsPrimitive, NumCast};

trait Caster<U> {
    fn cast(&mut self) -> PrimitiveBuilder<U> where U: ArrowPrimitiveType; 
}

impl<T, U> Caster<U> for PrimitiveBuilder<T>
where T: ArrowPrimitiveType,  U: ArrowPrimitiveType, T::Native: AsPrimitive<U::Native>
{
    fn cast(&mut self) -> PrimitiveBuilder<U> {
        let src_array = self.finish();
        let dst_array = src_array.unary::<_, U>(AsPrimitive::<U::Native>::as_);
        src_array.into_builder().expect("Converting array to builder")
    }
}

Here, if the source contained nulls, the into_builder returns an Err(...), I assume related to how the unary function clones the NullBuffer. It would be nice if into_builder handles this better, by creating a new null buffer if it cannot reuse the existing one.

The next approach I tried:

use num::cast::NumCast;

trait Caster<U> {
    fn _cast(&mut self) -> PrimitiveBuilder<U> where U: ArrowPrimitiveType, U::Native: NumCast;
}

impl<T, U> Caster<U> for PrimitiveBuilder<T>
where T: ArrowPrimitiveType, T::Native: NumCast, U: ArrowPrimitiveType, U::Native: NumCast
{
    fn cast(&mut self) -> PrimitiveBuilder<U> {
        let src_array = self.finish();
        let dst_array = src_array.unary_opt::<_, U>(num::cast::cast::<T::Native, U::Native>);
        src_array.into_builder().expect("Converting array to builder")
    }
}

This panics due to the len < capacity assertion, if the source contains nulls and has a length divisible by 8.

@tijmenr tijmenr added the bug label Feb 29, 2024
@tijmenr
Copy link
Author

tijmenr commented Feb 29, 2024

For reference, I'm using the following workaround now:

trait Caster<U> {
    fn cast(&mut self) -> PrimitiveBuilder<U> where U: ArrowPrimitiveType; 
}

impl<T, U> Caster<U> for PrimitiveBuilder<T>
where
    T: ArrowPrimitiveType,
    U: ArrowPrimitiveType,
    T::Native: AsPrimitive<U::Native>,
{
    fn cast(&mut self) -> PrimitiveBuilder<U>
    {
        let len = self.len();
        let mut builder = PrimitiveBuilder::with_capacity(len);
        builder.extend(
            self.finish().into_iter().map(|v| v.map(AsPrimitive::<U::Native>::as_))
        );
        builder
    }
}

@tustvold
Copy link
Contributor

Are you saying the check should be <=?

@tijmenr
Copy link
Author

tijmenr commented Feb 29, 2024

Yes, indeed, it should be <=.
See also https://arrow.apache.org/rust/src/arrow_buffer/builder/boolean.rs.html#32-39 and onwards, which describes the buffer underlying the NullBuffer.

@tustvold
Copy link
Contributor

Seems like a fairly straightforward fix, fortunately it is overly pessimistic so this is not a soundness issue

@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'arrow'} from #5460

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug good first issue Good for newcomers help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants