Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Incorrect hash value for Scalars with sliced child data (ignores offset) #35360

Closed
mosalx opened this issue Apr 27, 2023 · 7 comments · Fixed by #35814
Closed

[C++] Incorrect hash value for Scalars with sliced child data (ignores offset) #35360

mosalx opened this issue Apr 27, 2023 · 7 comments · Fixed by #35814

Comments

@mosalx
Copy link

mosalx commented Apr 27, 2023

Summary

When a pyarrow ListArray or FixedSizeListArray has a struct type, it is possible to run into a condition when two equal scalars have different hash values. It violates the contract for python hash function stating "The only required property is that objects which compare equal have the same hash value"
https://docs.python.org/3/reference/datamodel.html#object.__hash__

Below is the smallest reproducible example that demonstrates this issue. This example is for FixedSizeListArray but it affects ListSizeArray too.

Environment

Windows 10
python=3.11.2
pyarrow=11.0.0

Details

import pyarrow as pa

# initial array
_type = pa.list_(pa.struct([('a', pa.int32())]), list_size=1)
array = pa.array([[{'a': 1}], [{'a': 1}], [{'a': 1}], None], type=_type)

# make a deep copy of the last two elements. This involves copying all array buffers 
# and truncating unused bytes due to array offset. For simplicity, I am not copying all buffers
# (`field` array is not copied). This step was omitted to keep the example small
chunk = array[2:]
child = chunk.values[chunk.offset:]  # StructArray
field = child.field('a')  # Int32Array

# create a copy of `child`. Validity buffer could be set to None, it would not change the outcome
validity_buffer_child = pa.array([True, False, False, False, False, False, False, False]).buffers()[1]
child_copy = type(child).from_buffers(child.type, length=len(child), 
                                      buffers=[validity_buffer_child], 
                                      children=[field])
assert child_copy.equals(child)

# create a copy of `chunk` using `child_copy` made above
validity_buffer_chunk = pa.array([True, False, False, False, False, False, False, False]).buffers()[1]
chunk_copy = pa.FixedSizeListArray.from_buffers(type=chunk.type, length=len(chunk), 
                                                buffers=[validity_buffer_chunk], 
                                                children=[child_copy])
assert chunk_copy.equals(chunk)

Now we have two equal arrays, where the first element is valid (not-null).
Equality check for the first element passes

assert chunk_copy[0] == chunk[0]  # Ok

But their hash values are different

assert hash(chunk_copy[0]) == hash(chunk[0])  # AssertionError

Component(s)

Python

@jorisvandenbossche jorisvandenbossche changed the title Equal scalars may have different hash values [C++] Incorrect hash value for Scalars with sliced child data (ignores offset) Apr 28, 2023
@jorisvandenbossche
Copy link
Member

Thanks for the report!

I can confirm this on the last development version as well, and looking into our Scalar::Hash implementation, it seems that when the scalar is backed by an array (as is the case for a ListArray), our hashing implementation is a bit too simple:

Status ArrayHash(const ArrayData& a) {
RETURN_NOT_OK(StdHash(a.length) & StdHash(a.GetNullCount()));
if (a.buffers[0] != nullptr) {
// We can't visit values without unboxing the whole array, so only hash
// the null bitmap for now.
RETURN_NOT_OK(BufferHash(*a.buffers[0]));
}
for (const auto& child : a.child_data) {
RETURN_NOT_OK(ArrayHash(*child));
}
return Status::OK();
}

First, this will just loop through the child data and hash those. But I think that this then doesn't take into account the correct length / offset in case your array is sliced (in which case the child data correspond to the full un-sliced data. For example, for a StructArray, getting a field does not just access the child_data, but slices the child data:

const std::shared_ptr<Array>& StructArray::field(int i) const {
std::shared_ptr<Array> result = std::atomic_load(&boxed_fields_[i]);
if (!result) {
std::shared_ptr<ArrayData> field_data;
if (data_->offset != 0 || data_->child_data[i]->length != data_->length) {
field_data = data_->child_data[i]->Slice(data_->offset, data_->length);
} else {
field_data = data_->child_data[i];
}
std::shared_ptr<Array> result = MakeArray(field_data);
std::atomic_store(&boxed_fields_[i], result);
return boxed_fields_[i];
}
return boxed_fields_[i];
}

Now, in addition you can also see from the first snippet we actually only check the length and null count of an array, and not the values inside it ... So that means we actually ignore the content and give the same hash for different scalars:

In [69]: hash(pa.scalar([{'a': 1}]))
Out[69]: -285312971393311483

In [70]: hash(pa.scalar([{'a': 2}]))
Out[70]: -285312971393311483

@pitrou
Copy link
Member

pitrou commented May 24, 2023

@felipecrv @benibus Does one of you want to take a look here?

@felipecrv felipecrv self-assigned this May 24, 2023
@felipecrv
Copy link
Contributor

Currently investigating this. Fix won't be trivial.

For instance, hashing the validity bitmap while considering the array offset requires some ingenuity to be made efficient as the offset doesn't always point to a byte-aligned bit.

@felipecrv
Copy link
Contributor

Is it fair to say this can only be made efficient by having some kind of Rolling Hash on the stream of bits?

https://en.wikipedia.org/wiki/Rolling_hash

@pitrou
Copy link
Member

pitrou commented May 25, 2023

The bug description is convoluted so here is a simpler reproducer:

>>> a = pa.array([[{'a': 5}, {'a': 6}], [{'a': 7}, None]])
>>> b = pa.array([[{'a': 7}, None]])
>>> a[1]
<pyarrow.ListScalar: [{'a': 7}, None]>
>>> b[0]
<pyarrow.ListScalar: [{'a': 7}, None]>
>>> a[1] == b[0]
True
>>> hash(a[1]) == hash(b[0])
False

@pitrou
Copy link
Member

pitrou commented May 25, 2023

@felipecrv Let's not overdo this. We don't need to hash everything, and the null bitmap can be ignored if it makes things simpler.

@felipecrv
Copy link
Contributor

felipecrv commented May 30, 2023

Hashing anything other than the validity bitmap buffer can be super tricky as the null values can be anything within the value buffers. It would also require inference of bit widths for each type, so I maintained the hashing simple by hashing only validity buffers as it is now.

The part where I might have overdone things a bit is that I figured a way to make a hash function for bitmaps that can consider offsets. It didn't have to be a rolling hash, only involved shifts and rotations before data is fed into the hash mixing and careful handling of leading/trailing bits. I wrote very comprehensive tests for it.

pitrou added a commit that referenced this issue Jun 1, 2023
…() (#35814)

### Rationale for this change

A fix for #35360.

### What changes are included in this PR?

 - [x] A hash function that can hash bitmaps
 - [x] The fix for hashes of equal scalars sometimes not being equal because of offsets

### Are these changes tested?

Yes. By unit tests.

### Are there any user-facing changes?

No.
* Closes: #35360

Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 13.0.0 milestone Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants