Skip to content

Prevent ArrayData validation length overflow#9816

Merged
alamb merged 2 commits into
apache:mainfrom
alamb:codex/arraydata-offset-validation-overflow
Apr 27, 2026
Merged

Prevent ArrayData validation length overflow#9816
alamb merged 2 commits into
apache:mainfrom
alamb:codex/arraydata-offset-validation-overflow

Conversation

@alamb
Copy link
Copy Markdown
Contributor

@alamb alamb commented Apr 25, 2026

Which issue does this PR close?

  • None.

Rationale for this change

ArrayData validation used unchecked usize arithmetic when combining array lengths and offsets. In optimized builds, very large lengths could wrap these calculations and allow invalid ArrayData metadata to pass validation.

What changes are included in this PR?

This adds checked arithmetic for length plus offset calculations in ArrayData validation, including offset-buffer validation and related typed-buffer sizing paths.

Are these changes tested?

Yes. This adds regression coverage for overflowing offset-buffer and typed-buffer length calculations.

Validated with:

cargo test -p arrow-data overflow --release

Are there any user-facing changes?

Invalid ArrayData whose length and offset cannot be represented without overflow now returns an validation error consistently across build modes. There are no API changes.

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Apr 25, 2026
Comment thread arrow-data/src/data.rs
/// A thread-safe, shared reference to the Arrow array data.
pub type ArrayDataRef = Arc<ArrayData>;

fn checked_len_plus_offset(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea is to validate / error on overflow with a common helper

Copy link
Copy Markdown
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread arrow-data/src/data.rs
if let Some(null_bit_buffer) = null_bit_buffer.as_ref() {
let needed_len = bit_util::ceil(len + offset, 8);
let len_plus_offset = checked_len_plus_offset(&data_type, len, offset)?;
let needed_len = bit_util::ceil(len_plus_offset, 8);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking -- this bit_util::ceil cannot overflow because it always performs a division whose output is smaller than the input (so adding one cannot overflow). Or, when dividing by 1, the result is returned unchanged (no need to add one)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right --

pub fn ceil(value: usize, divisor: usize) -> usize {
value.div_ceil(divisor)
}

Uses div_ceil, https://doc.rust-lang.org/std/primitive.usize.html#method.div_ceil

Calculates the smallest value greater than or equal to self that is a multiple of rhs.

though maybe we should just remove bit_util::ceil and call div_ceil directly now that it is in stable Rust 🤔 ( as a follow on PR)

Comment thread arrow-data/src/data.rs

// This should have been checked as part of `validate()` prior
// to calling `validate_full()` but double check to be sure
assert!(buffer.len() / mem::size_of::<T>() >= required_len);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: Is this a potential panic we should think about turning to an error instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it is a potential panic (though as the code says it "should not be possible")

making it an internal error of some sort would likely be the more defensive strategy

I can do that in a follow on PR

Copy link
Copy Markdown
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @scovich

Comment thread arrow-data/src/data.rs
if let Some(null_bit_buffer) = null_bit_buffer.as_ref() {
let needed_len = bit_util::ceil(len + offset, 8);
let len_plus_offset = checked_len_plus_offset(&data_type, len, offset)?;
let needed_len = bit_util::ceil(len_plus_offset, 8);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right --

pub fn ceil(value: usize, divisor: usize) -> usize {
value.div_ceil(divisor)
}

Uses div_ceil, https://doc.rust-lang.org/std/primitive.usize.html#method.div_ceil

Calculates the smallest value greater than or equal to self that is a multiple of rhs.

though maybe we should just remove bit_util::ceil and call div_ceil directly now that it is in stable Rust 🤔 ( as a follow on PR)

Comment thread arrow-data/src/data.rs

// This should have been checked as part of `validate()` prior
// to calling `validate_full()` but double check to be sure
assert!(buffer.len() / mem::size_of::<T>() >= required_len);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it is a potential panic (though as the code says it "should not be possible")

making it an internal error of some sort would likely be the more defensive strategy

I can do that in a follow on PR

@alamb alamb merged commit 710e68e into apache:main Apr 27, 2026
26 checks passed
@alamb
Copy link
Copy Markdown
Contributor Author

alamb commented Apr 27, 2026

Thank you for the review @scovich

alamb added a commit to alamb/arrow-rs that referenced this pull request May 5, 2026
- None.

`ArrayData` validation used unchecked `usize` arithmetic when combining
array lengths and offsets. In optimized builds, very large lengths could
wrap these calculations and allow invalid `ArrayData` metadata to pass
validation.

This adds checked arithmetic for length plus offset calculations in
`ArrayData` validation, including offset-buffer validation and related
typed-buffer sizing paths.

Yes. This adds regression coverage for overflowing offset-buffer and
typed-buffer length calculations.

Validated with:

```bash
cargo test -p arrow-data overflow --release
```

Invalid `ArrayData` whose length and offset cannot be represented
without overflow now returns an validation error consistently across
build modes. There are no API changes.
alamb added a commit that referenced this pull request May 6, 2026
…#9914)

- Part of #9857
- Fixes #9900 in 56.x releases

This PR:
- Backports #9816 from @alamb to
the `56_maintenance` line
alamb added a commit that referenced this pull request May 6, 2026
…#9925)

- Part of #9858
- Fixes #9900 in 57.x releases

This PR:
- Backports #9816 from @alamb to
the `57_maintenance` line
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants