Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove null_count from ArrayData::try_new() #911

Closed
alamb opened this issue Nov 4, 2021 · 10 comments · Fixed by #1721
Closed

Remove null_count from ArrayData::try_new() #911

alamb opened this issue Nov 4, 2021 · 10 comments · Fixed by #1721
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Nov 4, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Re #817 ; As @jhorstmann pointed out on #810 at https://github.com/apache/arrow-rs/pull/810/files#r742905920

ArrayData::try_new() function includes an optional null_count argument and a validity buffer. If these numbers differ, wrong results can occur (and maybe also undefined behavior) in some later operation.

Note that as part of #817 we will be validating that the bitmaps are consistent with the declared null_count

Describe the solution you'd like

Since most callers pass None anyway at which point we calculate the number, if we simply removed the option to pass in an inconsistent value we would avoid a class of inconsistencies I would suggest to remove the null_count parameter from try_new to completely avoid of inconsistencies.

Describe alternatives you've considered
N/A

Additional context
Add any other context or screenshots about the feature request here.

@alamb alamb added arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog api-change Changes to the arrow API labels Nov 4, 2021
@HaoYang670
Copy link
Contributor

If we need to implement this, I can have a try!

@alamb
Copy link
Contributor Author

alamb commented May 16, 2022

If we need to implement this, I can have a try!

I think it is still worth trying 👍

@HaoYang670
Copy link
Contributor

HaoYang670 commented May 17, 2022

Something I find is that we have to move the check of null_bit_buffer size from ArrayData::validate to ArrayData::new_unchecked:

    pub unsafe fn new_unchecked(
        data_type: DataType,
        len: usize,
        null_bit_buffer: Option<Buffer>,
        offset: usize,
        buffers: Vec<Buffer>,
        child_data: Vec<ArrayData>,
    ) -> Self {
...
        if let Some(null_bit_map) = null_bitmap.as_ref() {
            let null_bit_buffer = null_bit_map.buffer_ref();
            let needed_len = bit_util::ceil(len + offset, 8);
            if null_bit_buffer.len() < needed_len {
                return Err(ArrowError::InvalidArgumentError(format!(
                    "null_bit_buffer size too small. got {} needed {}",
                    null_bit_buffer.len(),
                    needed_len
                )));
            }
        }
        let null_count = count_nulls(null_bit_buffer.as_ref(), offset, len);
...

Otherwise, we may panic if the null_bit_buffer is too short.

@HaoYang670
Copy link
Contributor

I find the above bug in the test :

    #[test]
    #[should_panic(expected = "null_bit_buffer size too small. got 1 needed 2")]
    fn test_bitmap_too_small() {
        let buffer = make_i32_buffer(9);
        let null_bit_buffer = Buffer::from(vec![0b11111111]);

        ArrayData::try_new(
            DataType::Int32,
            9,
            Some(0),
            Some(null_bit_buffer),
            0,
            vec![buffer],
            vec![],
        )
        .unwrap();
    }

Here we will not calculate the null count in new_unchecked because we have provided a null count Some(0). And when we validate the array data, we will get the error ArrayError(null_bit_buffer size too small. ...).

However, if we set null_count to be None in this test, we will panic at range end index 2 out of range for slice of length 1 when calling count_nulls(null_bit_buffer.as_ref(), offset, len) in new_unchecked.

@HaoYang670
Copy link
Contributor

Personally, my suggestions are:

  1. move the check of the size of null_bit_buffer into new_unchecked.
  2. Change the return type of new_unchecked from Self to Result<Self>.

@alamb
Copy link
Contributor Author

alamb commented May 17, 2022

cc @tustvold

@tustvold
Copy link
Contributor

tustvold commented May 17, 2022

I think I'm missing something, new_unchecked is unsafe because it trusts the caller. Making it fallible / perform validation seems counter to that, and I'm not sure I follow why it is necessary? Isn't the issue in try_new?

@HaoYang670
Copy link
Contributor

HaoYang670 commented May 18, 2022

Isn't the issue in try_new?

Sorry, my mistake. I think you are right.
Maybe we should modify ArrayData::try_new like that:

    pub fn try_new(
        ...
    ) -> Result<Self> {
        check_size_of_null_bit_buffer()?

        let new_self = unsafe {
            Self::new_unchecked(
                ...
            )
        };

        // As the data is not trusted, do a full validation of its contents
        new_self.validate_full()?;
        Ok(new_self)
    }

@HaoYang670
Copy link
Contributor

I will fix this in #1707

@alamb
Copy link
Contributor Author

alamb commented May 23, 2022

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants