-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Incorrect hash value for Scalars with sliced child data (ignores offset) #35360
Comments
Thanks for the report! I can confirm this on the last development version as well, and looking into our Lines 154 to 165 in 05a61d6
First, this will just loop through the child data and hash those. But I think that this then doesn't take into account the correct length / offset in case your array is sliced (in which case the child data correspond to the full un-sliced data. For example, for a StructArray, getting a field does not just access the child_data, but slices the child data: arrow/cpp/src/arrow/array/array_nested.cc Lines 584 to 598 in 05a61d6
Now, in addition you can also see from the first snippet we actually only check the length and null count of an array, and not the values inside it ... So that means we actually ignore the content and give the same hash for different scalars:
|
@felipecrv @benibus Does one of you want to take a look here? |
Currently investigating this. Fix won't be trivial. For instance, hashing the validity bitmap while considering the array offset requires some ingenuity to be made efficient as the offset doesn't always point to a byte-aligned bit. |
Is it fair to say this can only be made efficient by having some kind of Rolling Hash on the stream of bits? |
The bug description is convoluted so here is a simpler reproducer: >>> a = pa.array([[{'a': 5}, {'a': 6}], [{'a': 7}, None]])
>>> b = pa.array([[{'a': 7}, None]])
>>> a[1]
<pyarrow.ListScalar: [{'a': 7}, None]>
>>> b[0]
<pyarrow.ListScalar: [{'a': 7}, None]>
>>> a[1] == b[0]
True
>>> hash(a[1]) == hash(b[0])
False |
@felipecrv Let's not overdo this. We don't need to hash everything, and the null bitmap can be ignored if it makes things simpler. |
Hashing anything other than the validity bitmap buffer can be super tricky as the null values can be anything within the value buffers. It would also require inference of bit widths for each type, so I maintained the hashing simple by hashing only validity buffers as it is now. The part where I might have overdone things a bit is that I figured a way to make a hash function for bitmaps that can consider offsets. It didn't have to be a rolling hash, only involved shifts and rotations before data is fed into the hash mixing and careful handling of leading/trailing bits. I wrote very comprehensive tests for it. |
…() (#35814) ### Rationale for this change A fix for #35360. ### What changes are included in this PR? - [x] A hash function that can hash bitmaps - [x] The fix for hashes of equal scalars sometimes not being equal because of offsets ### Are these changes tested? Yes. By unit tests. ### Are there any user-facing changes? No. * Closes: #35360 Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
Summary
When a pyarrow
ListArray
orFixedSizeListArray
has a struct type, it is possible to run into a condition when two equal scalars have different hash values. It violates the contract for python hash function stating "The only required property is that objects which compare equal have the same hash value"https://docs.python.org/3/reference/datamodel.html#object.__hash__
Below is the smallest reproducible example that demonstrates this issue. This example is for
FixedSizeListArray
but it affectsListSizeArray
too.Environment
Windows 10
python=3.11.2
pyarrow=11.0.0
Details
Now we have two equal arrays, where the first element is valid (not-null).
Equality check for the first element passes
But their hash values are different
Component(s)
Python
The text was updated successfully, but these errors were encountered: