Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ArrayEqual for UnionArray #1469

Merged
merged 8 commits into from Mar 31, 2022
Merged

Implement ArrayEqual for UnionArray #1469

merged 8 commits into from Mar 31, 2022

Conversation

viirya
Copy link
Member

@viirya viirya commented Mar 21, 2022

Which issue does this PR close?

Closes #67.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 21, 2022
@viirya viirya changed the title Issue 67 Implement ArrayEqual for UnionArray Mar 21, 2022
@codecov-commenter
Copy link

codecov-commenter commented Mar 21, 2022

Codecov Report

Merging #1469 (c4ee09a) into master (c5442cf) will increase coverage by 0.06%.
The diff coverage is 92.95%.

@@            Coverage Diff             @@
##           master    #1469      +/-   ##
==========================================
+ Coverage   82.68%   82.75%   +0.06%     
==========================================
  Files         188      190       +2     
  Lines       54361    54702     +341     
==========================================
+ Hits        44951    45270     +319     
- Misses       9410     9432      +22     
Impacted Files Coverage Δ
arrow/src/array/builder.rs 86.68% <50.00%> (-0.05%) ⬇️
arrow/src/array/equal/utils.rs 74.45% <50.00%> (-3.24%) ⬇️
arrow/src/array/equal/mod.rs 93.98% <98.78%> (+0.65%) ⬆️
arrow/src/array/equal/union.rs 100.00% <100.00%> (ø)
arrow/src/datatypes/datatype.rs 66.40% <0.00%> (-0.40%) ⬇️
arrow/src/datatypes/field.rs 53.79% <0.00%> (-0.31%) ⬇️
arrow/src/array/transform/mod.rs 86.46% <0.00%> (-0.12%) ⬇️
arrow/src/compute/kernels/filter.rs 88.00% <0.00%> (ø)
arrow/src/compute/kernels/length.rs 100.00% <0.00%> (ø)
arrow/src/ffi_stream.rs 79.03% <0.00%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5442cf...c4ee09a. Read the comment docs.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good, I've left some comments on how I think it could be made a lot faster which you can take or leave

};

// Checks if corresponding slots in two UnionArrays are same data types
fn equal_types(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to be equal the two arrays must have the same schema and therefore lhs_fields == rhs_fields I think therefore this could simply just compare the type_id arrays for equality directly?

Copy link
Member Author

@viirya viirya Mar 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, even type_ids are different, isn't it still possible that the union arrays are equal?

For example, the fields are int, int, float, string. If lhs type_ids is 1, but rhs type_id is 0, they are both int. We don't know if the actual element at child arrays are equal or not.

So here I compare if the type_ids are the same type or not, not its values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And two union arrays could be equal even if two fields/type_ids are different. For example, we can change the field position and type_id, but the resulting union array is still the same.

Copy link
Contributor

@tustvold tustvold Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of two fields with the same type, I believe these must still have distinct field names and therefore there is a logical difference between values stored in one vs the other. I'm not sure about this though, possibly worth an upstream clarification 🤔

On the case of different field ordering the arrays can never compare equal as they have different data types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equal_types was removed now.

}

#[allow(clippy::too_many_arguments)]
fn equal_sparse(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the time we are calling this method we have established that

  • the types are equal
  • the null buffers are equal
  • the type ids arrays are equal

As such I'm surprised this can't just do a standard equality comparison of the child data, which will take into null buffers of the children

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to above question. As field and type_id can be changed in position, but the resulting union array is still the same, we cannot directly compare them, but can only compare child array corresponding to type_id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it really does come down to "what does it mean for two UnionArrays to be equal"?

Can it ever be true that UnionArray(Int, Utf8) be equal to UnionArray(Utf8, Int)?

Given my reading of the cpp ArrayEquals code in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.cc it seems to me that two arrays are only ever "equal" if their datatypes are (exactly) equal.

Maybe we could follow the cpp implementation -- both for consistency as well as for simplicity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, if it is the case, we can simplify the comparison here. What I have in mind and implemented here can be thought as logically equal. That means corresponding slots of two union arrays are equal, but their physical layout may be different -- type_ids and fields can be different separately.

If we only want to compare the physical layout, then fields and type_ids are required to be exactly equal.

Thanks for pointing out the cpp implementation. We should follow it, I agree.

I will revise this change for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equal_sparse was simplified to compare corresponding child_data at given positions.

lhs_slots_types.zip(rhs_slots_types).all(|(l, r)| l == r)
}

fn equal_dense(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much like equal_sparse below, by the time we are calling this method we have already established equality of the type_ids, I therefore wonder if we can simply compute equality of the offsets arrays, followed by computing equality of the underlying child arrays.

This appears to be similar to what is done for list_equal 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even type_ids equality is established, lhs and rhs start positions might be different. So we cannot simply compare offsets arrays and child arrays for dense case.

lhs.data_type() == rhs.data_type() && lhs.len() == rhs.len()
let equal_type = match (lhs.data_type(), rhs.data_type()) {
(DataType::Union(l_fields, l_mode), DataType::Union(r_fields, r_mode)) => {
// Defer field datatype check to `union_equal`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because even the lhs fields and rhs fields are different, the two union arrays are still possible equal. We cannot use fields to decide if two union arrays are equal or not. Fields and type_ids are co-related.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec describes a union as an ordered sequence of types with names. My interpretation of this is that the fields are order-sensitive and must be identical for the types to be the same...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised to compare lhs and rhs fields here.

DataType::Union(_, mode) => {
match mode {
UnionMode::Sparse => {
// See the logic of `DataType::Struct` in `child_logical_null_buffer`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this could be broken out into a function? I also wonder if it might be possible to implement more efficiently by truncating the longer buffer 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it as a function now.

@alamb
Copy link
Contributor

alamb commented Mar 27, 2022

I will try and review this PR more carefully tomorrow

@viirya
Copy link
Member Author

viirya commented Mar 27, 2022

Seems I missed some comments, I will address them soon. Thanks.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a nice step forward to me. Thank you @viirya

@alamb alamb merged commit 0575a94 into apache:master Mar 31, 2022
@viirya
Copy link
Member Author

viirya commented Mar 31, 2022

Thanks @alamb @tustvold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement ArrayEqual for UnionArray
4 participants