-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Union Array to add UnionMode
, match latest Arrow Spec, and rename new
-> unsafe new_unchecked()
#885
Conversation
@@ -65,7 +65,7 @@ impl UnionArray { | |||
/// In both of the cases above we are accepting `Buffer`'s which are assumed to be representing | |||
/// `i8` and `i32` values respectively. `Buffer` objects are untyped and no attempt is made | |||
/// to ensure that the data provided is valid. | |||
pub fn new( | |||
pub unsafe fn new_unchecked( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is the API change
pub fn try_new( | ||
type_ids: Buffer, | ||
value_offsets: Option<Buffer>, | ||
child_arrays: Vec<(Field, ArrayRef)>, | ||
bitmap: Option<Buffer>, | ||
) -> Result<Self> { | ||
if let Some(b) = &value_offsets { | ||
let nulls = count_nulls(bitmap.as_ref(), 0, type_ids.len()); | ||
if ((type_ids.len() - nulls) * 4) != b.len() { | ||
if ((type_ids.len()) * 4) != b.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is the core format change -- the length of type_ids
must match the length of the values
(rather than skipping nulls)
// Note sparse unions only have one buffer (u8) type_ids, | ||
// and dense unions have 2 (type_ids as well as offsets). | ||
// https://github.com/apache/arrow-rs/issues/85 | ||
DataType::Union(_, mode) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can now validate the buffer layout based on DataType
!
Codecov Report
@@ Coverage Diff @@
## master #885 +/- ##
==========================================
+ Coverage 82.33% 82.35% +0.02%
==========================================
Files 169 169
Lines 49719 49788 +69
==========================================
+ Hits 40936 41003 +67
- Misses 8783 8785 +2
Continue to review full report at Codecov.
|
3ace1e2
to
ca55696
Compare
new
-> unsafe new_unchecked()
UnionMode
, match latest Arrow Spec, and rename new
-> unsafe new_unchecked()
Marking PRs that are over a month old as stale -- please let us know if there is additional work planned or if we should close them. |
(I do have some idea in my head I can finish this one up) |
8cde1ed
to
a094eb7
Compare
a094eb7
to
b481581
Compare
Ok, I got the yak shaving bug and cleaned up this PR and marked it ready for review. 🙏 |
@jimexist @paddyhoran @nevi-me -- might one of you have time to review this PR? It revamps how UnionArray is supported to conform to the modern Arrow spec, and adds additional validation. I would like to get it in for the 7.0.0 release in a few weeks time |
Sorry I've been so quiet on Arrow. I'll find some time to review in the coming days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @alamb
/// | ||
/// The `type_ids` `Buffer` should contain `i8` values. These values should be greater than | ||
/// zero and must be less than the number of children provided in `child_arrays`. These values | ||
/// are used to index into the `child_arrays`. | ||
/// | ||
/// The `value_offsets` `Buffer` is only provided in the case of a dense union, sparse unions | ||
/// should use `None`. If provided the `value_offsets` `Buffer` should contain `i32` values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you removed the period at the end of this line by mistake, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. Good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in c8737a1
(Built on #810, so to see the changes proposed in this PR just look at the last commit)Which issue does this PR close?
Closes #814
Closes #85
Related to #817
Rationale for this change
Follow Arrow spec for
UnionArray
, make it possible to validate with generic code added in #810, and conform to Rust safety conventions.See https://arrow.apache.org/docs/format/Columnar.html#union-layout for a description of the Union layout
What changes are included in this PR?
UnionArray::new
tounsafe UnionArray::new_unchecked()
to follow standard Rust safety conventionsUnionMode
toDataType::Union
so the type carries information on expected formatAre there any user-facing changes?
UnionArray::new
has been renamed tounsafe UnionArray::new_unchecked()
Dense
orSparse
when creating aDataType::Union
UnionArray
format is different