Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10692: [Rust] Removed undefined behavior derived from null pointers #8997

Closed
wants to merge 5 commits into from
Closed

ARROW-10692: [Rust] Removed undefined behavior derived from null pointers #8997

wants to merge 5 commits into from

Conversation

jorgecarleitao
Copy link
Member

Currently, our allocation code is not guaranteeing that the std::mem::alloc was successful, by checking for whether the returned pointer was not null. Passing null pointers to buffers is dangerous, specially given that Buffers currently expose them without any checks.

This PR is a series of modifications that removes the possibility of having null pointers:

  • Made most of our pointers NonNull and panic whenever a null pointer tries to sneak to a buffer (either via FFI or a failed allocation)
  • Guard against overflow of a pointer address during allocations (relevant for 32 bit systems)
  • remove the possibility of a null pointer to be on RawPtrBox, flags RawPtrBox::new as unsafe and documents the invariants necessary to a sound usage of RawPtrBox.
  • Made all methods in memory expect and output a NonNull

All these changes were highly motivated by the code in Rust's std::alloc, and how it deals with these edge cases.

The main consequence of these changes is that our buffers no longer hold null pointers, which allow us to implement Deref<[u8]> (done in this PR), and treat Buffer as very similar to an immutable Vec<u8> (and MutableBuffer closer to Vec<u8>). In this direction, this PR renames a bunch of methods:

  • MutableBuffer::data_mut -> MutableBuffer::as_slice_mut
  • MutableBuffer::data -> MutableBuffer::as_slice
  • Buffer::data -> Buffer::as_slice
  • Buffer::raw_data -> Buffer::as_ptr
  • RawPtrBox::get -> RawPtrBox::as_ptr

The rational for these names come from Vec::as_slice_mut, Vec::as_slice, Vec::as_ptr and NonNull::as_ptr respectively.

@github-actions
Copy link

@codecov-io
Copy link

codecov-io commented Dec 23, 2020

Codecov Report

Merging #8997 (a6531ec) into master (51672b2) will increase coverage by 0.00%.
The diff coverage is 95.51%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #8997   +/-   ##
=======================================
  Coverage   82.61%   82.61%           
=======================================
  Files         202      202           
  Lines       50048    50055    +7     
=======================================
+ Hits        41347    41354    +7     
  Misses       8701     8701           
Impacted Files Coverage Δ
...ion-testing/src/bin/arrow-json-integration-test.rs 0.00% <0.00%> (ø)
rust/arrow/src/array/array_boolean.rs 86.50% <85.71%> (-0.22%) ⬇️
rust/arrow/src/array/array_primitive.rs 91.64% <85.71%> (-0.05%) ⬇️
rust/arrow/src/bytes.rs 58.33% <87.50%> (+4.27%) ⬆️
rust/parquet/src/arrow/record_reader.rs 96.27% <92.30%> (+0.02%) ⬆️
rust/arrow/src/buffer.rs 98.21% <92.72%> (-0.01%) ⬇️
rust/arrow/src/memory.rs 98.14% <96.55%> (-1.86%) ⬇️
rust/arrow/src/array/array_binary.rs 90.61% <100.00%> (ø)
rust/arrow/src/array/array_list.rs 93.10% <100.00%> (-0.02%) ⬇️
rust/arrow/src/array/array_string.rs 90.27% <100.00%> (ø)
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 51672b2...a6531ec. Read the comment docs.

@github-actions github-actions bot added needs-rebase A PR that needs to be rebased by the author and removed needs-rebase A PR that needs to be rebased by the author labels Dec 24, 2020
@jorgecarleitao
Copy link
Member Author

@vertexclique , would you mind taking a look at this? You are an expert around these. :)

@github-actions github-actions bot added needs-rebase A PR that needs to be rebased by the author and removed needs-rebase A PR that needs to be rebased by the author labels Dec 24, 2020
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @jorgecarleitao. I can't say I am an expert in this code / area of rust, but I read this PR carefully and I think I understand all the changes and they make sense to me.

In your PR description, you are basically saying this code is bringing MutableBuffer closer to the implementation of Vec... I wonder then, how is the work in this PR related to #8796 (aka actually using Vec<u8> as the underlying memory source)?

/// This function panics if:
/// * `ptr` is null
/// * `ptr` is not aligned to a slice of type `T`. This is guaranteed if it was built from a slice of type `T`.
pub(super) unsafe fn new(ptr: *const u8) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice change -- to move the alignment assertion into the RawPtrBox::new as that now makes the callsites clearer as well as they can't forget to ensure alignment. 👍

rust/arrow/src/buffer.rs Show resolved Hide resolved
@@ -919,24 +962,18 @@ mod tests {

#[test]
fn test_from_raw_parts() {
let buf = unsafe { Buffer::from_raw_parts(null_mut(), 0, 0) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this test removed? Because it is no longer possible to create a buffer from a null pointer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly :)

@jorgecarleitao
Copy link
Member Author

@alamb , thanks a lot for taking the time to review this. This one is challenging to review.

wrt to the vec<u8>, my current hypothesis is that the alignment does improve performance when SIMD is on (as the benches show in that PR). As such, my current direction is to get the best of both worlds: migrate the relevant parts of std::rawbytes::RawBytes and std::alloc to memory.rs (as that code is currently unstable) to still allocate aligned, but use the well studied code from rust's std. This is also related to #9016 .

Basically, we have a performance problem in the reallocation code. The following is the result of 4 runs:

mutable                 time:   [929.26 us 931.88 us 935.42 us]                    
mutable prepared        time:   [1.0682 ms 1.0693 ms 1.0709 ms]                              
from_slice              time:   [4.4857 ms 4.5043 ms 4.5247 ms]                        
from_slice prepared     time:   [1.4358 ms 1.4406 ms 1.4467 ms]                                 
  1. start with an empty mutable and grow it.
  2. start with a mutable with the correct capacity and grows (i.e. extend without re-allocation).
  3. do the same with a vec<u8> and at the end of all use Buffer::from
  4. same as 2 and at the end of all use Buffer::from

The fact that there is no difference between 1 and 2 but a 3.5x difference between 3 and 4 shows that we are doing something wrong.

@vertexclique
Copy link
Contributor

I am going to check out this code and have a look today @jorgecarleitao

@alamb
Copy link
Contributor

alamb commented Dec 31, 2020

The full set of Rust CI tests did not run on this PR :(

Can you please rebase this PR against apache/master to pick up the changes in #9056 so that they do?

I apologize for the inconvenience.

@jorgecarleitao jorgecarleitao deleted the clean_buffer branch January 1, 2021 09:17
@jorgecarleitao
Copy link
Member Author

@Dandandan @alamb @nevi-me @jhorstmann : this was merged and had some changes to the names of public methods of the Buffer and MutableBuffer (they are now similar to the ones in std::Vec. Just fyi, as you have been working with them :)

// * The pointers are non-null by construction
// * alignment asserted above
// Unsoundness
// * There is no guarantee that the memory regions do are non-overalling, but `memcpy` requires this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since buffer is freshly allocated it should not be possible for the memory regions to overlap

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This was copied from a previous version and it slipped. :/

GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…ters

Currently, our allocation code is not guaranteeing that the `std::mem::alloc` was successful, by checking for whether the returned pointer was not null. Passing null pointers to buffers is dangerous, specially given that Buffers currently expose them without any checks.

This PR is a series of modifications that removes the possibility of having null pointers:

* Made most of our pointers `NonNull` and panic whenever a null pointer tries to sneak to a buffer (either via FFI or a failed allocation)
* Guard against overflow of a pointer address during allocations (relevant for 32 bit systems)
* remove the possibility of a null pointer to be on `RawPtrBox`, flags `RawPtrBox::new` as `unsafe` and documents the invariants necessary to a sound usage of `RawPtrBox`.
* Made all methods in `memory` expect and output a `NonNull`

All these changes were highly motivated by the code in Rust's `std::alloc`, and how it deals with these edge cases.

The main consequence of these changes is that our buffers no longer hold null pointers, which allow us to implement `Deref<[u8]>` (done in this PR), and treat `Buffer` as very similar to an immutable `Vec<u8>` (and `MutableBuffer` closer to `Vec<u8>`). In this direction, this PR renames a bunch of methods:

* `MutableBuffer::data_mut -> MutableBuffer::as_slice_mut`
* `MutableBuffer::data -> MutableBuffer::as_slice`
* `Buffer::data -> Buffer::as_slice`
* `Buffer::raw_data -> Buffer::as_ptr`
* `RawPtrBox::get -> RawPtrBox::as_ptr`

The rational for these names come from `Vec::as_slice_mut`, `Vec::as_slice`, `Vec::as_ptr` and `NonNull::as_ptr` respectively.

Closes apache#8997 from jorgecarleitao/clean_buffer

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants