Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate arguments to ArrayData::new and null bit buffer and buffers #810

Merged
merged 7 commits into from Nov 8, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Sep 30, 2021

Which issue does this PR close?

Part of #817

Rationale for this change

This is a step towards improving the security posture of arrow-rs and resolving RUSTSEC vulnerabilities by validating arguments to ArrayData::try_new(). This particular PR adds basic size and offset sanity checking.

I apologize for the massive PR, but the checks and tests are fairly repetitive.

See discussion on #817 for more details

What changes are included in this PR?

  1. Add basic offset argument validation to ArrayData::try_new() based on the checks in the C++ implementation, kindly pointed out (and contributed) by @pitrou

Planned for follow on PRs:

Are there any user-facing changes?

  1. ArrayData::try_new() may return an Err now on some (invalid) inputs rather than succeeding

@alamb alamb changed the title Validate arguments to ArrayData::new: null bit buffer and buffers (WIP) Validate arguments to ArrayData::new and null bit buffer and buffers (WIP) Sep 30, 2021
@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 30, 2021
buffers.len()
);
// if min_size is zero, may not have buffers (e.g. NullArray)
assert!(min_size == 0 || buffers[0].len() >= min_size,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core validation check.

@@ -869,7 +869,7 @@ mod tests {
#[test]
fn test_primitive_array_builder() {
// Test building a primitive array with ArrayData builder and offset
let buf = Buffer::from_slice_ref(&[0, 1, 2, 3, 4]);
let buf = Buffer::from_slice_ref(&[0i32, 1, 2, 3, 4, 5, 6]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests need to be updated because they were creating invalid Buffers (this one for example had an len of 5 and offset of 2 but only passed in an array 5 long)

@alamb
Copy link
Contributor Author

alamb commented Sep 30, 2021

@jorgecarleitao / @nevi-me / @jhorstmann do you have any concerns with this approach before I spent more time filling in the details for nested / compound structures (List, Dictionary, etc)?

For some of the structures, the actual data will likely need to be inspected (to ensure, for example, that the values in the offsets buffer of utf8 arrays are valid).

I am not sure about the performance impact of doing these validations -- if it turns out to be too great, I can imagine having a "trusted" version of ArrayData::new() (only callable within the arrow crate) or feature flagging the validation so it can be disabled for those users who chose to do so

@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2021

Codecov Report

Merging #810 (92041b0) into master (62934e9) will increase coverage by 0.01%.
The diff coverage is 85.51%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #810      +/-   ##
==========================================
+ Coverage   82.29%   82.31%   +0.01%     
==========================================
  Files         168      168              
  Lines       48028    48409     +381     
==========================================
+ Hits        39527    39847     +320     
- Misses       8501     8562      +61     
Impacted Files Coverage Δ
arrow/src/compute/kernels/cast.rs 94.81% <ø> (ø)
arrow/src/datatypes/datatype.rs 65.95% <75.00%> (+1.02%) ⬆️
arrow/src/array/data.rs 79.01% <84.11%> (+4.08%) ⬆️
arrow/src/array/array_binary.rs 93.13% <100.00%> (+0.06%) ⬆️
arrow/src/array/array_boolean.rs 94.53% <100.00%> (ø)
arrow/src/array/array_list.rs 95.52% <100.00%> (ø)
arrow/src/array/array_map.rs 81.50% <100.00%> (ø)
arrow/src/array/array_primitive.rs 94.09% <100.00%> (+0.03%) ⬆️
arrow/src/array/array_union.rs 90.97% <100.00%> (+0.04%) ⬆️
arrow/src/compute/util.rs 98.90% <100.00%> (+<0.01%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 62934e9...92041b0. Read the comment docs.

@alamb
Copy link
Contributor Author

alamb commented Oct 2, 2021

@jhorstmann
Copy link
Contributor

The C++ validation looks extensive and includes offsets and dictionary keys, so basing the logic on that makes sense.

These offsets and dictionary keys are actually my main performance concern, I think we want at least an ArrayData::new_unchecked for usage in Array::slice and ArrayData::slice. If the original offsets or keys were valid, then the slice can be assumed to also be valid. I'd also like this unsafe method to be public for anyone wanting to implement kernels outside of arrow.

@alamb
Copy link
Contributor Author

alamb commented Oct 4, 2021

Sounds good @jhorstmann -- the C++ version includes two validation methods: one that basically checks buffer sizes and one that checks the offsets, etc I will attempt to mirror the same.

Will ping you when I have this PR ready for review

@alamb alamb force-pushed the alamb/safe_array_data_construction branch from 33a54ee to 73b249d Compare October 6, 2021 14:42
@github-actions github-actions bot added the arrow-flight Changes to the arrow-flight crate label Oct 6, 2021
@nevi-me nevi-me self-requested a review October 7, 2021 00:36
@alamb alamb force-pushed the alamb/safe_array_data_construction branch from 005c173 to 68d9a66 Compare October 8, 2021 18:56
@alamb
Copy link
Contributor Author

alamb commented Oct 8, 2021

This PR is not ready for review yet, but when it is I will change it to not a draft

@alamb alamb marked this pull request as draft October 8, 2021 18:56
@alamb alamb force-pushed the alamb/safe_array_data_construction branch 2 times, most recently from 1207c89 to b270574 Compare October 14, 2021 22:05
@alamb alamb force-pushed the alamb/safe_array_data_construction branch from add6e56 to a63b01e Compare October 24, 2021 11:43
@alamb alamb force-pushed the alamb/safe_array_data_construction branch from a63b01e to d3afda6 Compare October 24, 2021 12:27
@alamb
Copy link
Contributor Author

alamb commented Oct 24, 2021

Update: I have all the existing tests now passing

Still TODO:


// Test binary array with offset
let array_data = ArrayData::builder(DataType::Binary)
.len(4)
.len(2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the array data in this case is only 4 elements long, so using an offset of 1 with len 4 is incorrect (goes off the end of the array data). The same with the test below

.len(5)
.build()
.unwrap();
let data = unsafe {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build() now fails the earlier validation check -- so to keep the test checking the internal BooleanArray checks need to use unsafe

Note that it might be best to eventually remove all Array specific checks (which I think will be redundant) in favor of consolidated checks in ArrayData.rs but I don't want to consider doing that until I have the validation checks completed

let list_data = ArrayData::builder(list_data_type)
.len(3)
.len(2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a similar invalid test that the new checks identified -- value_data has only three items in it, so doing offset = 1 and len = 3 can potentially read off the end. This change corrects the len to be within bounds.

There is a very similar (and thus fixed) test in array_list.rs and one in array_map.rs

.add_child_data(value_data)
.build()
.unwrap();
let list_data = unsafe {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is checking the ListArray specific error, but since the new ArrayData validation checks now catch the problem we need to switch to using build_unchecked() to exercise the same code path

// https://github.com/apache/arrow-rs/issues/814
// https://github.com/apache/arrow-rs/issues/85
if matches!(&self.data_type, DataType::Union(..)) {
return Ok(());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "safe" way here would be to fail validation always for Union to signal to the user that the validation checks are not correct.

I think a better story would be to fix #85 and #814 . I am torn on what to do in this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally settled on leaving UnionArray unvalidated for this PR so that I can backport it to the 6.x release line (it is backwards compatible) and in a PR that is backward incompatible I will fixup the UnionArray implementation (and validation)

@alamb alamb force-pushed the alamb/safe_array_data_construction branch from 21b9685 to b37c8a1 Compare October 30, 2021 12:20
@@ -137,7 +137,10 @@ impl UnionArray {
}
}

Ok(Self::new(type_ids, value_offsets, child_arrays, bitmap))
let new_self = Self::new(type_ids, value_offsets, child_arrays, bitmap);
new_self.data().validate()?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of #85 I will clean up the union array validation

@alamb
Copy link
Contributor Author

alamb commented Oct 30, 2021

@jhorstmann, @paddyhoran @nevi-me or @jorgecarleitao might I trouble one of you for a review of this PR? I know it is large, but I think it is important and fairly mechanical in terms of validating the creation of ArrayData

@alamb
Copy link
Contributor Author

alamb commented Oct 30, 2021

FYI the MIRI failure is likely the same as #879 -- I'll plan to look at that shortly if no one else gets to it

@alamb alamb force-pushed the alamb/safe_array_data_construction branch from b37c8a1 to 4c5e3be Compare November 2, 2021 20:37
}
}

if self.null_count > self.len {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering what could happen if the null_count actually gives a different number than the validity buffer and whether this could lead to undefined behavior in some later operation. Most callers pass None anyway, so we calculate a number which is guaranteed to be in range. I would suggest to remove the null_count parameter from try_new to completely avoid of inconsistencies.

Kernels that want to avoid the overhead of counting bits could use the unsafe new_unchecked method. I see it is currently set by the cast_array_data function. To make that one safe while avoiding bit counting, we could just validate that the from and to layouts are equal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering what could happen if the null_count actually gives a different number than the validity buffer and whether this could lead to undefined behavior in some later operation. Most callers pass None anyway, so we calculate a number which is guaranteed to be in range.

I suspect (but can not prove) that if null_count is inconsistent with the validity buffer then there may be incorrect answers but not undefined behavior (in the Rust sense).

My plan will be:

  1. In a separate PR (as it will not be backwards compatible) remove the null_count from try_new(): Remove null_count from ArrayData::try_new() #911
  2. in the PR where I add the full on data checking (validate_full()) ensure that the declared null count matches the validity bitmap

arrow/src/array/data.rs Outdated Show resolved Hide resolved
@@ -477,6 +477,15 @@ impl DataType {
)
}

/// Returns true if this type is integral: (UInt*, Unit*).
pub fn is_integer(t: &DataType) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be only used for validating dictionary key types, maybe rename to is_dictionary_key_type and link the ArrowDictionaryKeyType trait in the comment.

Copy link
Contributor Author

@alamb alamb Nov 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 8acd8d2

@jhorstmann
Copy link
Contributor

I think this looks good. Might be worthwhile to convert some of the current new_unchecked usages to try_new to increase coverage.

@alamb
Copy link
Contributor Author

alamb commented Nov 4, 2021

I think this looks good. Might be worthwhile to convert some of the current new_unchecked usages to try_new to increase coverage.

At the moment, almost all of the tests in arrow use ArrayData::try_new(...).unwrap() so there is pretty good coverage of this code already (I found several bugs this way :) )

The remaining uses of new_unchecked() are either in compute kernels where performance seems to be critical or in tests which are explicitly checking array construction validation.

What I plan as part of implementing validate_full() is to remove all array specific type checks and the tests to use ArrayData::try_new()

arrow/src/array/data.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor Author

alamb commented Nov 5, 2021

Thanks @jhorstmann for the review. I think this PR is now ready to go and will plan to merge it early next week unless anyone objects or would like more time to review.

@alamb alamb added the security label Nov 8, 2021
@alamb alamb merged commit 74b520c into apache:master Nov 8, 2021
@alamb alamb deleted the alamb/safe_array_data_construction branch November 8, 2021 19:11
alamb added a commit that referenced this pull request Nov 9, 2021
…810)

* Validate arguments to ArrayData::new: null bit buffer and buffers

* REname is_int_type to is_dictionary_key_type()

* Correctly handle self.offset in offsets buffer

* Consolidate checks

* Fix test output
@alamb alamb added the cherry-picked PR that was backported to active release (will be included in maintenance release) label Nov 9, 2021
alamb added a commit that referenced this pull request Nov 9, 2021
…810)

* Validate arguments to ArrayData::new: null bit buffer and buffers

* REname is_int_type to is_dictionary_key_type()

* Correctly handle self.offset in offsets buffer

* Consolidate checks

* Fix test output
alamb added a commit that referenced this pull request Nov 9, 2021
…810) (#936)

* Validate arguments to ArrayData::new: null bit buffer and buffers

* REname is_int_type to is_dictionary_key_type()

* Correctly handle self.offset in offsets buffer

* Consolidate checks

* Fix test output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate cherry-picked PR that was backported to active release (will be included in maintenance release) security
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants