Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Null Mask Handling in ArrayData, UnionArray, and MapArray #1589

Merged
merged 5 commits into from
Apr 28, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Apr 19, 2022

Which issue does this PR close?

Closes #1599
Closes #1598
Closes #1596
Closes #1590
Closes #1591
Closes #626

Rationale for this change

See tickets, unfortunately these changes all ended up interacting with each other and so I've lumped them into a single PR.

What changes are included in this PR?

ArrayData Equality

The currently ArrayData equality code attempts to construct a logical bitmask, that propogates the nullability of the parent down to its children. Certain arrays, however, may have a different number of children than their parent, e.g. dense union array or list arrays. This requires recomputing the mask to factor this in. In some cases, this was computing a mask that took account of the offset of the child, but in others it was not. I also don't think any of the codepaths took account for if the child also contained an offset.

Taking a step back, none of the codepaths are actually comparing runs of nulls, instead they fall back to iterating per value if the array contains nulls. For example,

 // get a ref of the null buffer bytes, to use in testing for nullness
let lhs_null_bytes = lhs.null_buffer().as_ref().unwrap().as_slice();
let rhs_null_bytes = rhs.null_buffer().as_ref().unwrap().as_slice();
(0..len).all(|i| {
    let lhs_pos = lhs_start + i;
    let rhs_pos = rhs_start + i;

    let lhs_is_null = !get_bit(lhs_null_bytes, lhs_pos + lhs.offset());
    let rhs_is_null = !get_bit(rhs_null_bytes, rhs_pos + rhs.offset());

    lhs_is_null
        || (lhs_is_null == rhs_is_null)
            && equal_range(
                lhs_values,
                rhs_values,
                lhs_keys[lhs_pos].to_usize().unwrap(),
                rhs_keys[rhs_pos].to_usize().unwrap(),
                1,
            )
})

As a result we don't actually need this complexity, and by removing it we avoid ambiguity about what offset the null mask has, what the null mask is, etc...

UnionArray Nullability

Whilst working on this I realised that UnionArray's don't actually have a parent validity mask, as this change would have implications for the equality logic I lumped it into this PR.

Are there any user-facing changes?

Array equality should be more correct, and UnionArrays slightly less broken

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 19, 2022

// Calculate a list child's logical bitmap/buffer
#[inline]
fn logical_list_bitmap<OffsetSize: OffsetSizeTrait>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need this code, as we simply skip over null slices

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you mean we skip over null slices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only call equal_ranges on the non null ranges

@@ -282,8 +282,7 @@ fn equal_range(
rhs_start: usize,
len: usize,
) -> bool {
utils::base_equal(lhs, rhs)
Copy link
Contributor Author

@tustvold tustvold Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utils::base_equal does two things:

  1. Compares that the data types are equal
  2. Compares that the lengths are equal

The first is not necessarily correct, as MapArrays are agnostic to field names. The second is also not necessarily correct in the presence of non-empty null slices. I therefore decided to just remove this and simplify the code.

Ultimately it isn't clear to me what this check could catch that wouldn't represent some violation of a variant somewhere else. If a nested type is inconsistent with the length, type, etc... of its parent then you've got UB anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually equal already does utils::base_equal. I guess it is okay to remove it from here, although I'm not pretty sure.

@@ -47,8 +47,13 @@ pub(super) fn equal_nulls(
rhs_start: usize,
len: usize,
) -> bool {
let lhs_null_count = count_nulls(lhs_nulls, lhs_start, len);
let rhs_null_count = count_nulls(rhs_nulls, rhs_start, len);
let lhs_null_count = count_nulls(lhs_nulls, lhs_start + lhs.offset(), len);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equal_bits a few lines is correctly taking into account the offset, but this line wasn't => excitement

@@ -137,16 +142,6 @@ pub(super) fn child_logical_null_buffer(
Bitmap::from(Buffer::from(vec![0b11111111; ceil]))
});
match parent_data.data_type() {
DataType::List(_) | DataType::Map(_, _) => Some(logical_list_bitmap::<i32>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed, as null slices are skipped over in parent

@tustvold tustvold marked this pull request as draft April 19, 2022 21:49
@tustvold
Copy link
Contributor Author

tustvold commented Apr 19, 2022

More stuff is broken 😭

Various UnionArray fixes (apache#1598) (apache#1596) (apache#1591) (apache#1590)

Fix handling of null masks in ArrayData equality (apache#1599)
@tustvold tustvold changed the title Fix ListArray and StructArray equality (#626) Fix Null Mask Handling in ArrayData Equality And UnionArray Apr 20, 2022
@tustvold tustvold changed the title Fix Null Mask Handling in ArrayData Equality And UnionArray Fix Null Mask Handling in ArrayData And UnionArray Apr 20, 2022
@tustvold tustvold marked this pull request as ready for review April 20, 2022 10:54
@tustvold
Copy link
Contributor Author

I will continue to add tests, and file bugs for the various things this fixes, but I think the core is ready for review now

@codecov-commenter
Copy link

codecov-commenter commented Apr 20, 2022

Codecov Report

Merging #1589 (f67b650) into master (786792c) will increase coverage by 0.16%.
The diff coverage is 88.65%.

@@            Coverage Diff             @@
##           master    #1589      +/-   ##
==========================================
+ Coverage   82.88%   83.04%   +0.16%     
==========================================
  Files         193      193              
  Lines       55307    55276      -31     
==========================================
+ Hits        45839    45902      +63     
+ Misses       9468     9374      -94     
Impacted Files Coverage Δ
arrow/src/ipc/writer.rs 80.56% <ø> (-1.25%) ⬇️
arrow/src/array/data.rs 83.05% <50.00%> (-0.20%) ⬇️
arrow/src/array/equal/mod.rs 96.09% <70.76%> (+2.10%) ⬆️
arrow/src/array/transform/union.rs 83.33% <78.94%> (+8.33%) ⬆️
arrow/src/array/equal/structure.rs 95.00% <90.00%> (-5.00%) ⬇️
arrow/src/array/builder.rs 86.79% <94.11%> (+0.10%) ⬆️
arrow/src/array/equal/list.rs 95.77% <95.55%> (+16.39%) ⬆️
arrow/src/array/array_union.rs 90.71% <96.55%> (ø)
arrow/src/array/equal/boolean.rs 97.14% <100.00%> (ø)
arrow/src/array/equal/decimal.rs 100.00% <100.00%> (ø)
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 786792c...f67b650. Read the comment docs.

@viirya
Copy link
Member

viirya commented Apr 20, 2022

Thank you @tustvold for fixing these. I will begin to review this today.

@@ -522,29 +516,29 @@ mod tests {
match i {
0 => {
let slot = slot.as_any().downcast_ref::<Int32Array>().unwrap();
assert!(!union.is_null(i));
assert!(!slot.is_null(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good to check null in the current slot, but does union.is_null(index) not work anymore? Can we check both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nullability is only a property of the slot, and not the Union itself, at least that is my interpretation of the specification

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://arrow.apache.org/docs/format/Columnar.html#union-layout

Says

Unlike other data types, unions do not have their own validity bitmap. Instead, the nullness of each slot is determined exclusively by the child arrays which are composed to create the union.

Which seems consistent 👍

However, it seems like UnionArray doesn't override is_null to take this into account. Filed #1625 to track

let mut builder = UnionBuilder::new_sparse(2);
builder.append::<Float32Type>("a", 1.0).unwrap();
let err = builder.append::<Int32Type>("a", 1).unwrap_err().to_string();
assert!(err.contains("Attempt to write col \"a\" with type Int32 doesn't match existing type Float32"), "{}", err);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 1897 to 1898
/// A builder for the bitmap if required (for Sparse Unions)
bitmap_builder: Option<BooleanBufferBuilder>,
bitmap_builder: BooleanBufferBuilder,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment seems out-of-date now?

Comment on lines 2156 to 2157
.null_bit_buffer(bitmap_builder.finish());
// .build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// .build(); seems redundant and can be removed?

Comment on lines -2188 to -2193
match bitmap_builder {
Some(mut bb) => arr_data_builder
.null_bit_buffer(bb.finish())
.build_unchecked(),
None => arr_data_builder.build_unchecked(),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more clear now.

Comment on lines +1324 to +1325
/// Can contain a null bitmask
pub can_contain_null_mask: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the C++ structure definition doesn't have this, I think it is fine to have it here.

Comment on lines +33 to +34
let lhs_null_count = count_nulls(lhs.null_buffer(), lhs_start + lhs.offset(), len);
let rhs_null_count = count_nulls(rhs.null_buffer(), rhs_start + rhs.offset(), len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, do we assume offset is zero previously?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in lots of places 😅

@@ -282,8 +282,7 @@ fn equal_range(
rhs_start: usize,
len: usize,
) -> bool {
utils::base_equal(lhs, rhs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually equal already does utils::base_equal. I guess it is okay to remove it from here, although I'm not pretty sure.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me and makes some places more clear and simpler. It is better to have more eyes on the changes too.

@alamb
Copy link
Contributor

alamb commented Apr 21, 2022

I will try and review this more carefully tomorrow. cc @jhorstmann who might also be interested

@jhorstmann
Copy link
Contributor

Looks good. This is some really gnarly code and last time I tried to improve it I gave up since it was never clear whether the start variable already include offsets or where the offsets have to be applied.

I have two tests in #1499 where assert_eq was not working, the one comparing list arrays works with these changes, I'm currently looking into the other one comparing two struct arrays.

@jhorstmann
Copy link
Contributor

After the suggestion above, my remaining test failure was because of the nullable flag on StructArray fields and not directly related to the equality comparison. Whether that flag needs to be equal for two struct or map arrays to be considered equal could be discussed, but it seems also some kernels like nullif would need to update the flag to be consistent.

@tustvold
Copy link
Contributor Author

tustvold commented Apr 23, 2022

Whether that flag needs to be equal for two struct or map arrays to be considered equal could be discussed

Yes I have found the handling of this field to be very inconsistent, imo it shouldn't be possible to associate a null bitmask with a field that says it isn't nullable. Would you be able to create a ticket?

I have two tests in #1499 where assert_eq was not working

I presume you mean that the previous behaviour was wrong, and is now fixed, and not that this has broken behaviour that was previously correct?

Whether that flag needs to be equal for two struct or map arrays to be considered equal could be discussed

I think the logic here will consider them not equal, are you suggesting they should be considered equal?

@jhorstmann
Copy link
Contributor

Yes I have found the handling of this field to be very inconsistent, imo it shouldn't be possible to associate a null bitmask with a field that says it isn't nullable. Would you be able to create a ticket?

I created #1611 for validating the nullable flag on StructArray. Not sure how good my description there is, feel free to add more information or other ideas.

I presume you mean that the previous behaviour was wrong, and is now fixed, and not that this has broken behaviour that was previously correct?

Sorry, my description probably was not very clear. In one test I had to compare individual elements of a ListArray because the previous equality implementation did not work with different offsets. With the changes from this PR that test can now use a simple assert_eq.

The other test with StructArray now also works with these changes and assert_eq, the other missing piece was explicitly creating the array with nullable fields.

Whether we should change the comparision semantic to exclude the nullable flag is probably out of scope for this PR, I don't have a strong opinion either way.

A big "Thank You" for diving into this code and fixing it!

@alamb alamb added the api-change Changes to the arrow API label Apr 28, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this code a thorough review -- I am not the biggest expert in the low level details of what is going on internally but the changes made sense to me.

I also carefully checked all test changes and they looked good, so assuming the tests have good coverage I feel confident in this change.

Of course, our work to work to clean up nullness is likely not done with this PR but it I think things are better.

My only concerns are:

  1. the change to append_null that requires a field name (as that is a public facing API). However, we can always change it in the future I suppose.

  2. UnionArray::is_null incorrect #1625 seems non bad -- I think we could remove several changes in this PR (so it could call union.is_null() rather than slot.is_null() if we fixed that

Thank you @tustvold @viirya and @jhorstmann for driving this

cc @bjchambers who I think also has had much fun with offsets and @nevi-me who spent a bunch of time with this code I think

@@ -522,29 +516,29 @@ mod tests {
match i {
0 => {
let slot = slot.as_any().downcast_ref::<Int32Array>().unwrap();
assert!(!union.is_null(i));
assert!(!slot.is_null(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://arrow.apache.org/docs/format/Columnar.html#union-layout

Says

Unlike other data types, unions do not have their own validity bitmap. Instead, the nullness of each slot is determined exclusively by the child arrays which are composed to create the union.

Which seems consistent 👍

However, it seems like UnionArray doesn't override is_null to take this into account. Filed #1625 to track

@@ -1894,23 +1894,19 @@ struct FieldData {
values_buffer: Option<MutableBuffer>,
/// The number of array slots represented by the buffer
slots: usize,
/// A builder for the bitmap if required (for Sparse Unions)
bitmap_builder: Option<BooleanBufferBuilder>,
/// A builder for the null bitmap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

};
self.len += 1;
Ok(())
pub fn append_null<T: ArrowPrimitiveType>(&mut self, type_name: &str) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize from an implementation perspective why you chose to require appending a null by field name, but it seems kind of a messy API -- what do you think about potentially appending a null to the first child or something by default?

Perhaps something to think about for a follow on PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short answer is that UnionArray nulls are typed and so two arrays are not equal if they have nulls in different child arrays.

There is also the case of appending a null to an empty UnionArray...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense -- tried to encode that insight in comments as part of #1628

@@ -560,7 +554,7 @@ mod tests {
builder.append::<Int32Type>("a", 1).unwrap();
builder.append::<Int64Type>("c", 3).unwrap();
builder.append::<Int32Type>("a", 10).unwrap();
builder.append_null().unwrap();
builder.append_null::<Int32Type>("a").unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is strange to require a field name to append nulls -- I left a comment on how to improve things perhaps in the future

@@ -621,6 +621,13 @@ impl ArrayData {
// Check that the data layout conforms to the spec
let layout = layout(&self.data_type);

if !layout.can_contain_null_mask && self.null_bitmap.is_some() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -39,8 +36,8 @@ pub(super) fn fixed_binary_equal(
let lhs_values = &lhs.buffers()[0].as_slice()[lhs.offset() * size..];
let rhs_values = &rhs.buffers()[0].as_slice()[rhs.offset() * size..];

let lhs_null_count = count_nulls(lhs_nulls, lhs_start, len);
let rhs_null_count = count_nulls(rhs_nulls, rhs_start, len);
let lhs_null_count = count_nulls(lhs.null_buffer(), lhs_start + lhs.offset(), len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes a lot more sense to look at the buffer's nulls directly

builder.append::<Int32Type>("A", 34).unwrap();
let array = builder.build().unwrap();

let filter_array = BooleanArray::from(vec![true, false, true, false]);
let c = filter(&array, &filter_array).unwrap();
let filtered = c.as_any().downcast_ref::<UnionArray>().unwrap();

let mut builder = UnionBuilder::new_dense(1);
let mut builder = UnionBuilder::new_sparse(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was changed because the name of the test is test_filter_union_array_sparse_with_nulls but we were using a dense array previously

@alamb
Copy link
Contributor

alamb commented Apr 28, 2022

After some thought, I would like to merge this PR to handle all the undefined behavior bugs

While there are more potential follow on things to do to improve union and null handling, this seems like a significant improvement.

Unless there are any objections, I will plan to merge it later today

@alamb
Copy link
Contributor

alamb commented Apr 28, 2022

🚀

@alamb alamb merged commit 37085d2 into apache:master Apr 28, 2022
@alamb alamb changed the title Fix Null Mask Handling in ArrayData And UnionArray Fix Null Mask Handling in ArrayData, UnionArray, and MapArray Apr 28, 2022
alamb added a commit to alamb/arrow-rs that referenced this pull request Apr 28, 2022
alamb added a commit that referenced this pull request Apr 29, 2022
* Update version to 13.0.0

* Draft changelog

* updates

* moar

* cleanup

* remove duplicated entry, update for #1589

* Final updates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
5 participants