Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some unsound unsafe code in arrow and attempt to improve safety docs. #6021

Closed
wants to merge 1 commit into from

Conversation

veluca93
Copy link
Contributor

@veluca93 veluca93 commented Jul 8, 2024

This is a first step towards fixing #6020.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 8, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @veluca93 -- I looked at some of these changes -- some places definitely look like there could problems, though I am not quite sure

I wonder if you would be willing to break this PR into smaller independent PRs for review

for example,

  1. the changes to parquet might go in one PR
  2. Changes for list/map arrays in another
  3. changes for dictionary array in another

This would both be easier to review and let us merge the parts that get consensus while we work through the others

cc @XiangpengHao and @tustvold @jhorstmann for your comments

Comment on lines +318 to +323
let len_bytes = buf.get(self.offset..self.offset + 4);
let len = match len_bytes {
None => {
return Err(ParquetError::EOF("eof decoding byte array".into()));
}
Some(l) => u32::from_le_bytes(l.try_into().unwrap()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this any more sound? it seems to me this is simply does the same check via using slice

BTW I think you can write this kind of code more concisely like

Suggested change
let len_bytes = buf.get(self.offset..self.offset + 4);
let len = match len_bytes {
None => {
return Err(ParquetError::EOF("eof decoding byte array".into()));
}
Some(l) => u32::from_le_bytes(l.try_into().unwrap()),
let Some(len_bytes) = buf.get(self.offset..self.offset + 4) else {
return Err(ParquetError::EOF("eof decoding byte array".into()));
};
let len = u32::from_le_bytes(len_bytes.try_into().unwrap()),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was trying to avoid the version with let .. else because I was not sure on the MSRV policies of arrow-rs :-)

As for being more sound: the original version is unsound if self.offset + 4 overflows (because it ends up calling .get_unchecked with an empty range, which is UB). In general, the new version achieves the same result without unsafe, which is IMO always a good idea (unless there are big performance issues).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think allocation sizes in rust are limited to isize::MAX, so offset+4 should never overflow.

Personally, I find the previous version with the simple if a bit easier to read. The unsafe can probably be removed since the compiler should be able to remove the bounds checks on its own:

let len = u32::from_le_bytes(buf[offset..offset+4].try_into().unwrap())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is also fine IMO - I just wanted to remove the unsafe, I attempted to do so without introducing new potential branches but if we believe the compiler to remove them (or they are not perf-critical) then this is indeed the clearest option.


let start_offset = self.offset + 4;
let end_offset = start_offset + len as usize;
if end_offset > buf.len() {
if end_offset > buf.len() || end_offset < start_offset {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since offset is always increasing (it is formed by start offset + a usize how could it ever be smaller than start offset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It normally cannot, except in case of overflows :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we then rather use checked_add in the previous line and handle and return an error on overflow? Or use wrapping_add and keep this check, so it is clearer to the reader that overflow is being considered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably checked_add is the best choice (as it is more explicit). Commenting that self.offset is always at most isize::MAX, hence there is no overflow, would also be fine IMO.

}

// SAFETY: bounds are checked above, block_id comes from a call to append_block, and
// we are dealing with byte arrays with no alignment requirements.
// FIXME: what about UTF8 arrays if validate_utf8 is false?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate_utf8 is false only for ByteViewArray (that can have arbitrary bytes) -- for StringViewArray. validate_utf8 is always true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW here are some more obsessive tests showing this is correct; #6023

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is up to the caller to ensure - i.e. the caller must ensure (i.e. when calling new) that validate_utf8 is true if the data does not come from a StringViewArray. If that is correct, then this leads to potential UB if the caller does not uphold this contract, which means that new should be an unsafe function and this property should ideally be documented as a safety invariant for the struct.
But I may have misunderstood something :-)

@@ -435,6 +435,10 @@ impl BitReader {
/// This function panics if
/// - `num_bits` is larger than the bit-capacity of `T`
///
// FIXME: soundness issue - this method can be used to write arbitrary bytes to any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you provide an example where this leads to unsound behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safe code could implement, for example, FromBytes on Box. Then constructing a BitReader from a buffer of 0 Bytes, and calling get_batch on a slice of Box would end up creating a Box containing a null pointer, which is UB.

The issue here is that this code assumes (AFAIU) that all bit patterns are valid for T: FromBytes, and this is true for the implementations of FromBytes here - but safe code could easily make it not true anymore. Thus, probably the best solution is to make FromBytes an unsafe trait.

@@ -461,6 +465,9 @@ impl BitReader {
match size_of::<T>() {
1 => {
let ptr = batch.as_mut_ptr() as *mut u8;
// SAFETY: batch is properly aligned and sized. Caller guarantees that T
// can be safely seen as a slice of bytes through FromBytes bound
// (FIXME: not actually true right now)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ptr is derived from &mut [T], so it is properly aligned and sized. Can you give an example to show that this is unsound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is indeed OK - see my previous comment for the issue here, which is not with constructing the slice but with writing to it.

@veluca93
Copy link
Contributor Author

veluca93 commented Jul 8, 2024

Thanks @veluca93 -- I looked at some of these changes -- some places definitely look like there could problems, though I am not quite sure

I wonder if you would be willing to break this PR into smaller independent PRs for review

for example,

  1. the changes to parquet might go in one PR
  2. Changes for list/map arrays in another
  3. changes for dictionary array in another

This would both be easier to review and let us merge the parts that get consensus while we work through the others

cc @XiangpengHao and @tustvold @jhorstmann for your comments

Happy to split the PR up!

@veluca93
Copy link
Contributor Author

veluca93 commented Jul 8, 2024

I split up the PR into #6024 and #6025, also removing a bunch of involuntary formatting changes. I see that in the meantime there have been a few changes to view_buffer and byte_view_array - perhaps I should leave those changes for later, once the implementation settles down?

@alamb
Copy link
Contributor

alamb commented Jul 9, 2024

Thank you @veluca93 -- I hope to find time tomorrow to review the new PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants