Fix some unsound unsafe code in arrow and attempt to improve safety docs. #6021

veluca93 · 2024-07-08T17:57:28Z

This is a first step towards fixing #6020.

alamb

Thanks @veluca93 -- I looked at some of these changes -- some places definitely look like there could problems, though I am not quite sure

I wonder if you would be willing to break this PR into smaller independent PRs for review

for example,

the changes to parquet might go in one PR
Changes for list/map arrays in another
changes for dictionary array in another

This would both be easier to review and let us merge the parts that get consensus while we work through the others

cc @XiangpengHao and @tustvold @jhorstmann for your comments

alamb · 2024-07-08T18:46:25Z

parquet/src/arrow/array_reader/byte_view_array.rs

+            let len_bytes = buf.get(self.offset..self.offset + 4);
+            let len = match len_bytes {
+                None => {
+                    return Err(ParquetError::EOF("eof decoding byte array".into()));
+                }
+                Some(l) => u32::from_le_bytes(l.try_into().unwrap()),


Is this any more sound? it seems to me this is simply does the same check via using slice

BTW I think you can write this kind of code more concisely like

Suggested change

let len_bytes = buf.get(self.offset..self.offset + 4);

let len = match len_bytes {

None => {

return Err(ParquetError::EOF("eof decoding byte array".into()));

}

Some(l) => u32::from_le_bytes(l.try_into().unwrap()),

let Some(len_bytes) = buf.get(self.offset..self.offset + 4) else {

return Err(ParquetError::EOF("eof decoding byte array".into()));

};

let len = u32::from_le_bytes(len_bytes.try_into().unwrap()),

Yes, I was trying to avoid the version with let .. else because I was not sure on the MSRV policies of arrow-rs :-)

As for being more sound: the original version is unsound if self.offset + 4 overflows (because it ends up calling .get_unchecked with an empty range, which is UB). In general, the new version achieves the same result without unsafe, which is IMO always a good idea (unless there are big performance issues).

I think allocation sizes in rust are limited to isize::MAX, so offset+4 should never overflow.

Personally, I find the previous version with the simple if a bit easier to read. The unsafe can probably be removed since the compiler should be able to remove the bounds checks on its own:

let len = u32::from_le_bytes(buf[offset..offset+4].try_into().unwrap())

This code is also fine IMO - I just wanted to remove the unsafe, I attempted to do so without introducing new potential branches but if we believe the compiler to remove them (or they are not perf-critical) then this is indeed the clearest option.

alamb · 2024-07-08T18:47:02Z

parquet/src/arrow/array_reader/byte_view_array.rs


            let start_offset = self.offset + 4;
            let end_offset = start_offset + len as usize;
-            if end_offset > buf.len() {
+            if end_offset > buf.len() || end_offset < start_offset {


since offset is always increasing (it is formed by start offset + a usize how could it ever be smaller than start offset?

It normally cannot, except in case of overflows :-)

should we then rather use checked_add in the previous line and handle and return an error on overflow? Or use wrapping_add and keep this check, so it is clearer to the reader that overflow is being considered.

Probably checked_add is the best choice (as it is more explicit). Commenting that self.offset is always at most isize::MAX, hence there is no overflow, would also be fine IMO.

alamb · 2024-07-08T18:47:51Z

parquet/src/arrow/array_reader/byte_view_array.rs

            }

+            // SAFETY: bounds are checked above, block_id comes from a call to append_block, and
+            // we are dealing with byte arrays with no alignment requirements.
+            // FIXME: what about UTF8 arrays if validate_utf8 is false?


validate_utf8 is false only for ByteViewArray (that can have arbitrary bytes) -- for StringViewArray. validate_utf8 is always true

BTW here are some more obsessive tests showing this is correct; #6023

If I understand correctly, this is up to the caller to ensure - i.e. the caller must ensure (i.e. when calling new) that validate_utf8 is true if the data does not come from a StringViewArray. If that is correct, then this leads to potential UB if the caller does not uphold this contract, which means that new should be an unsafe function and this property should ideally be documented as a safety invariant for the struct.
But I may have misunderstood something :-)

alamb · 2024-07-08T18:49:57Z

parquet/src/util/bit_util.rs

@@ -435,6 +435,10 @@ impl BitReader {
    /// This function panics if
    /// - `num_bits` is larger than the bit-capacity of `T`
    ///
+    // FIXME: soundness issue - this method can be used to write arbitrary bytes to any


can you provide an example where this leads to unsound behavior?

Safe code could implement, for example, FromBytes on Box. Then constructing a BitReader from a buffer of 0 Bytes, and calling get_batch on a slice of Box would end up creating a Box containing a null pointer, which is UB.

The issue here is that this code assumes (AFAIU) that all bit patterns are valid for T: FromBytes, and this is true for the implementations of FromBytes here - but safe code could easily make it not true anymore. Thus, probably the best solution is to make FromBytes an unsafe trait.

XiangpengHao · 2024-07-08T19:21:29Z

parquet/src/util/bit_util.rs

@@ -461,6 +465,9 @@ impl BitReader {
        match size_of::<T>() {
            1 => {
                let ptr = batch.as_mut_ptr() as *mut u8;
+                // SAFETY: batch is properly aligned and sized. Caller guarantees that T
+                // can be safely seen as a slice of bytes through FromBytes bound
+                // (FIXME: not actually true right now)


ptr is derived from &mut [T], so it is properly aligned and sized. Can you give an example to show that this is unsound?

Yep, this is indeed OK - see my previous comment for the issue here, which is not with constructing the slice but with writing to it.

veluca93 · 2024-07-08T22:21:18Z

Thanks @veluca93 -- I looked at some of these changes -- some places definitely look like there could problems, though I am not quite sure

I wonder if you would be willing to break this PR into smaller independent PRs for review

for example,

the changes to parquet might go in one PR

Changes for list/map arrays in another

changes for dictionary array in another

This would both be easier to review and let us merge the parts that get consensus while we work through the others

cc @XiangpengHao and @tustvold @jhorstmann for your comments

Happy to split the PR up!

veluca93 · 2024-07-08T22:58:20Z

I split up the PR into #6024 and #6025, also removing a bunch of involuntary formatting changes. I see that in the meantime there have been a few changes to view_buffer and byte_view_array - perhaps I should leave those changes for later, once the implementation settles down?

alamb · 2024-07-09T22:04:02Z

Thank you @veluca93 -- I hope to find time tomorrow to review the new PRs

Fix some unsound unsafe code and attempt to improve safety docs.

0c5456f

github-actions bot added the parquet Changes to the parquet crate label Jul 8, 2024

alamb reviewed Jul 8, 2024

View reviewed changes

XiangpengHao reviewed Jul 8, 2024

View reviewed changes

alamb mentioned this pull request Jul 9, 2024

DataFusion weekly project plan (Andrew Lamb) - July 8, 2024 apache/datafusion#11334

Closed

9 tasks

This was referenced Jul 15, 2024

DataFusion weekly project plan (Andrew Lamb) - July 15, 2024 apache/datafusion#11474

Closed

DataFusion weekly project plan (Andrew Lamb) - July 22, 2024 apache/datafusion#11601

Closed

alamb marked this pull request as draft August 28, 2024 11:04

tustvold closed this Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix some unsound unsafe code in arrow and attempt to improve safety docs. #6021

Fix some unsound unsafe code in arrow and attempt to improve safety docs. #6021

veluca93 commented Jul 8, 2024

alamb left a comment

alamb Jul 8, 2024

veluca93 Jul 8, 2024

jhorstmann Jul 9, 2024

veluca93 Jul 9, 2024

alamb Jul 8, 2024

veluca93 Jul 8, 2024

jhorstmann Jul 9, 2024

veluca93 Jul 9, 2024

alamb Jul 8, 2024

alamb Jul 8, 2024

veluca93 Jul 8, 2024

alamb Jul 8, 2024

veluca93 Jul 8, 2024

XiangpengHao Jul 8, 2024

veluca93 Jul 8, 2024

veluca93 commented Jul 8, 2024

veluca93 commented Jul 8, 2024

alamb commented Jul 9, 2024

Fix some unsound unsafe code in arrow and attempt to improve safety docs. #6021

Fix some unsound unsafe code in arrow and attempt to improve safety docs. #6021

Conversation

veluca93 commented Jul 8, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

veluca93 commented Jul 8, 2024

veluca93 commented Jul 8, 2024

alamb commented Jul 9, 2024