Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Union Array to add UnionMode, match latest Arrow Spec, and rename new -> unsafe new_unchecked() #885

Merged
merged 4 commits into from
Jan 2, 2022

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 30, 2021

(Built on #810, so to see the changes proposed in this PR just look at the last commit)

Which issue does this PR close?

Closes #814
Closes #85
Related to #817

Rationale for this change

Follow Arrow spec for UnionArray, make it possible to validate with generic code added in #810, and conform to Rust safety conventions.

See https://arrow.apache.org/docs/format/Columnar.html#union-layout for a description of the Union layout

What changes are included in this PR?

  1. Rename UnionArray::new to unsafe UnionArray::new_unchecked() to follow standard Rust safety conventions
  2. Add UnionMode to DataType::Union so the type carries information on expected format
  3. Update null values offset handling to match https://arrow.apache.org/docs/format/Columnar.html#union-layout
  4. Add validation in ArrayData::validate

Are there any user-facing changes?

  1. UnionArray::new has been renamed to unsafe UnionArray::new_unchecked()
  2. Must specify Dense or Sparse when creating a DataType::Union
  3. Accepted UnionArray format is different

@alamb alamb added the api-change Changes to the arrow API label Oct 30, 2021
@github-actions github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Oct 30, 2021
@@ -65,7 +65,7 @@ impl UnionArray {
/// In both of the cases above we are accepting `Buffer`'s which are assumed to be representing
/// `i8` and `i32` values respectively. `Buffer` objects are untyped and no attempt is made
/// to ensure that the data provided is valid.
pub fn new(
pub unsafe fn new_unchecked(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the API change

pub fn try_new(
type_ids: Buffer,
value_offsets: Option<Buffer>,
child_arrays: Vec<(Field, ArrayRef)>,
bitmap: Option<Buffer>,
) -> Result<Self> {
if let Some(b) = &value_offsets {
let nulls = count_nulls(bitmap.as_ref(), 0, type_ids.len());
if ((type_ids.len() - nulls) * 4) != b.len() {
if ((type_ids.len()) * 4) != b.len() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the core format change -- the length of type_ids must match the length of the values (rather than skipping nulls)

// Note sparse unions only have one buffer (u8) type_ids,
// and dense unions have 2 (type_ids as well as offsets).
// https://github.com/apache/arrow-rs/issues/85
DataType::Union(_, mode) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can now validate the buffer layout based on DataType!

@codecov-commenter
Copy link

codecov-commenter commented Oct 30, 2021

Codecov Report

Merging #885 (c8737a1) into master (fe106e0) will increase coverage by 0.02%.
The diff coverage is 93.69%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #885      +/-   ##
==========================================
+ Coverage   82.33%   82.35%   +0.02%     
==========================================
  Files         169      169              
  Lines       49719    49788      +69     
==========================================
+ Hits        40936    41003      +67     
- Misses       8783     8785       +2     
Impacted Files Coverage Δ
arrow/src/array/equal/mod.rs 93.13% <ø> (ø)
arrow/src/array/equal/utils.rs 74.00% <ø> (ø)
arrow/src/datatypes/datatype.rs 66.38% <0.00%> (ø)
parquet/src/arrow/arrow_writer.rs 98.07% <ø> (ø)
parquet/src/arrow/levels.rs 82.87% <0.00%> (ø)
parquet/src/arrow/schema.rs 85.56% <ø> (ø)
arrow/src/array/builder.rs 86.49% <80.00%> (-0.05%) ⬇️
arrow/src/array/array_union.rs 90.76% <86.66%> (-0.22%) ⬇️
arrow/src/array/data.rs 82.28% <96.96%> (+1.14%) ⬆️
arrow/src/array/array.rs 85.25% <100.00%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe106e0...c8737a1. Read the comment docs.

@github-actions github-actions bot added the arrow-flight Changes to the arrow-flight crate label Nov 8, 2021
@alamb alamb changed the title Update Union Array to match Arrow Spec, rename new -> unsafe new_unchecked() Update Union Array to add UnionMode, match latest Arrow Spec, and rename new -> unsafe new_unchecked() Nov 8, 2021
@alamb alamb added the stale-pr label Dec 20, 2021
@alamb
Copy link
Contributor Author

alamb commented Dec 20, 2021

Marking PRs that are over a month old as stale -- please let us know if there is additional work planned or if we should close them.

@alamb
Copy link
Contributor Author

alamb commented Dec 20, 2021

(I do have some idea in my head I can finish this one up)

arrow/src/array/array_union.rs Outdated Show resolved Hide resolved
@alamb alamb marked this pull request as ready for review December 20, 2021 17:05
@alamb
Copy link
Contributor Author

alamb commented Dec 20, 2021

Ok, I got the yak shaving bug and cleaned up this PR and marked it ready for review. 🙏

@alamb alamb removed the stale-pr label Dec 21, 2021
@alamb
Copy link
Contributor Author

alamb commented Dec 29, 2021

@jimexist @paddyhoran @nevi-me -- might one of you have time to review this PR? It revamps how UnionArray is supported to conform to the modern Arrow spec, and adds additional validation. I would like to get it in for the 7.0.0 release in a few weeks time

@paddyhoran
Copy link
Contributor

Sorry I've been so quiet on Arrow. I'll find some time to review in the coming days.

Copy link
Contributor

@paddyhoran paddyhoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @alamb

///
/// The `type_ids` `Buffer` should contain `i8` values. These values should be greater than
/// zero and must be less than the number of children provided in `child_arrays`. These values
/// are used to index into the `child_arrays`.
///
/// The `value_offsets` `Buffer` is only provided in the case of a dense union, sparse unions
/// should use `None`. If provided the `value_offsets` `Buffer` should contain `i32` values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you removed the period at the end of this line by mistake, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. Good catch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c8737a1

@alamb alamb merged commit 89d5273 into apache:master Jan 2, 2022
@alamb alamb deleted the alamb/union_fix branch January 2, 2022 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Dense/Sparse annotation to DataType::Union Implement Union validity bitmap changes from ARROW-9222
3 participants