Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace ArrayData::new() with ArrayData::try_new() and unsafe ArrayData::new_unchecked #822

Merged
merged 3 commits into from
Oct 13, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 9, 2021

Which issue does this PR close?

Part of #817

Rationale for this change

This PR is a step towards making arrow-rs Rust safe and resolving open RUSTSEC issues.

  • ArrayData::new() is fundamentally unsafe (in the Rust sense) as it relies on the user to pass in valid data or else allows undefined behavior. The API is easy to misuse and should be marked as unsafe to reflect this. See Validate arguments to ArrayData::try_new() #817 for more background.

Builds on @jhorstmann 's work in #813

What changes are included in this PR?

  • Removes ArrayData::new()
  • Creates unsafe ArrayData::new_unchecked() and ArrayData::try_new()
  • Makes ArrayDataBuilder::build() fallible
  • Adds unsafe ArrayDataBuilder::build_unchecked()

Note:

  • This PR contains only the proposed API changes. It does not contain any validation. I plan to add the extra validation as a follow on PR (I have it partly implemented in Validate arguments to ArrayData::new and null bit buffer and buffers #810)
    ** Splitting the changes into several PRs I think will help with reviews
    ** I would like to ensure the API changes are included it arrow-rs 6.0 (planning to make a release candidate in the next week or so). We can then add additional validation in 6.1, 6.2, etc as they will be non breaking API changes.
  • Other than the API changes noted above, the PR is entirely mechanical to use the new API

Are there any user-facing changes?

Yes -- the APIs for creating ArrayData are different. This should not affect any users who create Arrays directly, only those using the lower level APIs.

@alamb alamb added arrow Changes to the arrow crate api-change Changes to the arrow API labels Oct 9, 2021
@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 9, 2021
/// contents of the buffers (e.g. that string offsets for UTF8 arrays
/// are within the length of the buffer).
pub fn validate(&self) -> Result<()> {
// will be filled in a subsequent PR
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#810 has an initial set of checks

@@ -264,6 +278,53 @@ impl ArrayData {
}
}

/// Create a new ArrayData, validating that the provided buffers
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API changes in data.rs are the core changes in this PR -- everything else is a mechanical changes to use the new APIs.

@alamb alamb marked this pull request as draft October 9, 2021 15:31
@codecov-commenter
Copy link

codecov-commenter commented Oct 9, 2021

Codecov Report

Merging #822 (9718f64) into master (cc9e285) will increase coverage by 0.02%.
The diff coverage is 90.06%.

❗ Current head 9718f64 differs from pull request most recent head 92e4789. Consider uploading reports for the commit 92e4789 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #822      +/-   ##
==========================================
+ Coverage   82.54%   82.56%   +0.02%     
==========================================
  Files         168      168              
  Lines       47910    47988      +78     
==========================================
+ Hits        39545    39622      +77     
- Misses       8365     8366       +1     
Impacted Files Coverage Δ
arrow/src/array/array_map.rs 81.50% <ø> (ø)
arrow/src/array/equal/mod.rs 93.45% <ø> (ø)
arrow/src/array/equal/utils.rs 74.00% <ø> (ø)
arrow/src/compute/kernels/limit.rs 100.00% <ø> (ø)
arrow/src/json/writer.rs 91.92% <ø> (ø)
arrow/src/record_batch.rs 92.65% <ø> (ø)
arrow/src/util/integration_util.rs 70.10% <ø> (ø)
integration-testing/src/lib.rs 0.00% <0.00%> (ø)
parquet/src/arrow/arrow_writer.rs 98.06% <ø> (ø)
parquet/src/arrow/levels.rs 83.56% <ø> (ø)
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc9e285...92e4789. Read the comment docs.

@alamb alamb marked this pull request as ready for review October 9, 2021 16:47
@alamb
Copy link
Contributor Author

alamb commented Oct 9, 2021

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes look good, thanks a lot for investing the time in this @alamb

I think we should have some notes around the usages of unsafe: why is it sound in this place?

let arr_data = ArrayDataBuilder::new(DataType::Int32)
.add_buffer(Buffer::from(v))
.build();
let arr_data = unsafe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a safety note here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful to note that this is (very old) benchmark crate seemingly last modified by @andygrove and @sunchao https://github.com/apache/arrow-rs/blame/master/arrow/benches/array_from_vec.rs#L29-L38

To be honest it does not look safe to me. I will try and rewrite it

arrow/src/array/array.rs Show resolved Hide resolved
@alamb
Copy link
Contributor Author

alamb commented Oct 9, 2021

I think we should have some notes around the usages of unsafe: why is it sound in this place?

It is a good idea @Dandandan -- I will attempt to do so. To be honest I am not sure why all the references are legitimate uses of unsafe.

I don't think this PR has made the code any more or less safe / unsafe than it was before. However, it is now clearer where assumptions are made (they are all annotated with unsafe). I sort of imagine over time people can chip away at removing the unsafe possible

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me as well and I agree with @Dandandan that we should add safety notes.

@alamb
Copy link
Contributor Author

alamb commented Oct 10, 2021

Changes look good to me as well and I agree with @Dandandan that we should add safety notes.

@Dandandan and @houqp --

First I want to emphasize that this PR does not change the safety of the arrow-rs implementation -- the code is as safe/unsafe before this PR as it is after this PR.

I agree that all unsafe annotations in the arrow codebase should be evaluated for soundness eventually.

However, I propose not requiring such annotations for this PR because:

  1. My goal with this PR is not to make arrow-rs totally safe, it is to set the API up so that we can incrementally make it safer going forward (among other things resolving the RUSTSEC security advisories against arrow-rs)
  2. All code newly marked as unsafe was previously reviewed in other PRs
  3. Manually verifying all locations both requires codebase expertise which is quite rare, and is subject to human fallibility -- I would rather have a compiler or runtime code check rather than rely on additional human review.
  4. Existing MIRI test coverage, while not perfect, does offer a measure of protection against undefined behavior.
  5. I think there is some sense of urgency to get this change in so it can be included in arrow 6.0,

Thus, I propose a multi-pronged approach:

  1. When adding runtime validation checks to ArrayData::try_new() -- also add a CI check (or maybe always in debug mode) a check that does ArrayData validation even in ArrayData::new_unchecked
  2. Perhaps crowd source (somehow) the existing tickets for adding safety annotations (see list and when someone works on those, the unsafe uses of ArrayData::new_unchecked can be included.
  3. Re-evaluate if it is time to bring in arrow2 ideas to the main arrow-rs repo.

I will admit that part of my reason for not wanting to try and annotate all uses of unsafe is that doing so is a large / thankless a task, and I suspect the prospect of doing so might have contributed to @jorgecarleitao's choice to focus his time working on arrow2 instead). I think my own time is more productively spent adding runtime verification and/or helping integrate arrow2 (or its approach or whatever) rather than manually evaluating the existing arrow-rs code base.

@houqp
Copy link
Member

houqp commented Oct 10, 2021

I agree with you @alamb 👍

arrow/src/array/array.rs Show resolved Hide resolved
arrow/src/array/array_boolean.rs Show resolved Hide resolved
arrow/src/array/array_primitive.rs Show resolved Hide resolved
arrow/src/compute/kernels/arithmetic.rs Show resolved Hide resolved
@houqp
Copy link
Member

houqp commented Oct 10, 2021

Tried adding some safety notes, I have to agree that the task is quite intense :D I also suggest we focus on filling gaps in arrow2 instead of retrofitting arrow-rs, it's a huge undertaking that @jorgecarleitao has already tried and decided that the effort is not worth it.

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan -- are you OK with merging this PR given the rationale listed in #822 (comment) (you have marked the PR as changes requested)

@jhorstmann are you OK with this PR?

I would like to merge this in and then create an arrow 6.0.0 release candidate (and hopefully unblock the next downstream release of DataFusion)

I plan to make ArrayData::try_new() safer with additional validation (released as part of 6.1.0)

@Dandandan
Copy link
Contributor

@alamb

I agree with your points. The PR already improves on the current state (::new leads to UB when used improperly and should be marked unsafe).

@Dandandan
Copy link
Contributor

@Dandandan -- are you OK with merging this PR given the rationale listed in #822 (comment) (you have marked the PR as changes requested)

Hehe 😃 I was just looking at it. Yeah merging as is would be great.

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also looked at it, and agree with merging as is

@alamb
Copy link
Contributor Author

alamb commented Oct 13, 2021

Given the feedback I plan to rebase this and merge it in

@alamb alamb merged commit 058da05 into apache:master Oct 13, 2021
@alamb alamb deleted the alamb/arraydata_unsafe branch October 13, 2021 17:54
@jhorstmann
Copy link
Contributor

Thanks @alamb! I was away from a computer for a few days, this looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants