[C++] Incorrect hash value for Scalars with sliced child data (ignores offset) #35360

mosalx · 2023-04-27T23:23:22Z

Summary

When a pyarrow ListArray or FixedSizeListArray has a struct type, it is possible to run into a condition when two equal scalars have different hash values. It violates the contract for python hash function stating "The only required property is that objects which compare equal have the same hash value"
https://docs.python.org/3/reference/datamodel.html#object.__hash__

Below is the smallest reproducible example that demonstrates this issue. This example is for FixedSizeListArray but it affects ListSizeArray too.

Environment

Windows 10
python=3.11.2
pyarrow=11.0.0

Details

import pyarrow as pa

# initial array
_type = pa.list_(pa.struct([('a', pa.int32())]), list_size=1)
array = pa.array([[{'a': 1}], [{'a': 1}], [{'a': 1}], None], type=_type)

# make a deep copy of the last two elements. This involves copying all array buffers 
# and truncating unused bytes due to array offset. For simplicity, I am not copying all buffers
# (`field` array is not copied). This step was omitted to keep the example small
chunk = array[2:]
child = chunk.values[chunk.offset:]  # StructArray
field = child.field('a')  # Int32Array

# create a copy of `child`. Validity buffer could be set to None, it would not change the outcome
validity_buffer_child = pa.array([True, False, False, False, False, False, False, False]).buffers()[1]
child_copy = type(child).from_buffers(child.type, length=len(child), 
                                      buffers=[validity_buffer_child], 
                                      children=[field])
assert child_copy.equals(child)

# create a copy of `chunk` using `child_copy` made above
validity_buffer_chunk = pa.array([True, False, False, False, False, False, False, False]).buffers()[1]
chunk_copy = pa.FixedSizeListArray.from_buffers(type=chunk.type, length=len(chunk), 
                                                buffers=[validity_buffer_chunk], 
                                                children=[child_copy])
assert chunk_copy.equals(chunk)

Now we have two equal arrays, where the first element is valid (not-null).
Equality check for the first element passes

assert chunk_copy[0] == chunk[0]  # Ok

But their hash values are different

assert hash(chunk_copy[0]) == hash(chunk[0])  # AssertionError

Component(s)

Python

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2023-04-28T07:24:53Z

Thanks for the report!

I can confirm this on the last development version as well, and looking into our Scalar::Hash implementation, it seems that when the scalar is backed by an array (as is the case for a ListArray), our hashing implementation is a bit too simple:

arrow/cpp/src/arrow/scalar.cc

Lines 154 to 165 in 05a61d6

    
           Status ArrayHash(const ArrayData& a) { 
        
             RETURN_NOT_OK(StdHash(a.length) & StdHash(a.GetNullCount())); 
        
             if (a.buffers[0] != nullptr) { 
        
               // We can't visit values without unboxing the whole array, so only hash 
        
               // the null bitmap for now. 
        
               RETURN_NOT_OK(BufferHash(*a.buffers[0])); 
        
             } 
        
             for (const auto& child : a.child_data) { 
        
               RETURN_NOT_OK(ArrayHash(*child)); 
        
             } 
        
             return Status::OK(); 
        
           }

First, this will just loop through the child data and hash those. But I think that this then doesn't take into account the correct length / offset in case your array is sliced (in which case the child data correspond to the full un-sliced data. For example, for a StructArray, getting a field does not just access the child_data, but slices the child data:

arrow/cpp/src/arrow/array/array_nested.cc

Lines 584 to 598 in 05a61d6

    
           const std::shared_ptr<Array>& StructArray::field(int i) const { 
        
             std::shared_ptr<Array> result = std::atomic_load(&boxed_fields_[i]); 
        
             if (!result) { 
        
               std::shared_ptr<ArrayData> field_data; 
        
               if (data_->offset != 0 || data_->child_data[i]->length != data_->length) { 
        
                 field_data = data_->child_data[i]->Slice(data_->offset, data_->length); 
        
               } else { 
        
                 field_data = data_->child_data[i]; 
        
               } 
        
               std::shared_ptr<Array> result = MakeArray(field_data); 
        
               std::atomic_store(&boxed_fields_[i], result); 
        
               return boxed_fields_[i]; 
        
             } 
        
             return boxed_fields_[i]; 
        
           }

Now, in addition you can also see from the first snippet we actually only check the length and null count of an array, and not the values inside it ... So that means we actually ignore the content and give the same hash for different scalars:

In [69]: hash(pa.scalar([{'a': 1}]))
Out[69]: -285312971393311483

In [70]: hash(pa.scalar([{'a': 2}]))
Out[70]: -285312971393311483

pitrou · 2023-05-24T16:37:49Z

@felipecrv @benibus Does one of you want to take a look here?

felipecrv · 2023-05-25T19:15:44Z

Currently investigating this. Fix won't be trivial.

For instance, hashing the validity bitmap while considering the array offset requires some ingenuity to be made efficient as the offset doesn't always point to a byte-aligned bit.

felipecrv · 2023-05-25T19:21:24Z

Is it fair to say this can only be made efficient by having some kind of Rolling Hash on the stream of bits?

https://en.wikipedia.org/wiki/Rolling_hash

pitrou · 2023-05-25T19:37:36Z

The bug description is convoluted so here is a simpler reproducer:

>>> a = pa.array([[{'a': 5}, {'a': 6}], [{'a': 7}, None]])
>>> b = pa.array([[{'a': 7}, None]])
>>> a[1]
<pyarrow.ListScalar: [{'a': 7}, None]>
>>> b[0]
<pyarrow.ListScalar: [{'a': 7}, None]>
>>> a[1] == b[0]
True
>>> hash(a[1]) == hash(b[0])
False

pitrou · 2023-05-25T19:38:21Z

@felipecrv Let's not overdo this. We don't need to hash everything, and the null bitmap can be ignored if it makes things simpler.

felipecrv · 2023-05-30T02:56:32Z

Hashing anything other than the validity bitmap buffer can be super tricky as the null values can be anything within the value buffers. It would also require inference of bit widths for each type, so I maintained the hashing simple by hashing only validity buffers as it is now.

The part where I might have overdone things a bit is that I figured a way to make a hash function for bitmaps that can consider offsets. It didn't have to be a rolling hash, only involved shifts and rotations before data is fed into the hash mixing and careful handling of leading/trailing bits. I wrote very comprehensive tests for it.

…() (#35814) ### Rationale for this change A fix for #35360. ### What changes are included in this PR? - [x] A hash function that can hash bitmaps - [x] The fix for hashes of equal scalars sometimes not being equal because of offsets ### Are these changes tested? Yes. By unit tests. ### Are there any user-facing changes? No. * Closes: #35360 Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

mosalx added the Type: bug label Apr 27, 2023

github-actions bot added the Component: Python label Apr 27, 2023

jorisvandenbossche changed the title ~~Equal scalars may have different hash values~~ [C++] Incorrect hash value for Scalars with sliced child data (ignores offset) Apr 28, 2023

jorisvandenbossche added the Component: C++ label Apr 28, 2023

jorisvandenbossche mentioned this issue May 10, 2023

[C++] Hashing array scalar with null bitmap and non-null 0s bitmap produces different hashes. #35521

Closed

AlenkaF mentioned this issue May 15, 2023

[Python]pa.FixedShapeTensorArray.to_numpy_ndarray fails on sliced arrays #35573

Closed

felipecrv self-assigned this May 24, 2023

felipecrv mentioned this issue May 30, 2023

GH-35360: [C++] Take offset into account in ScalarHashImpl::ArrayHash() #35814

Merged

2 tasks

pitrou closed this as completed in #35814 Jun 1, 2023

pitrou added this to the 13.0.0 milestone Jun 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Incorrect hash value for Scalars with sliced child data (ignores offset) #35360

[C++] Incorrect hash value for Scalars with sliced child data (ignores offset) #35360

mosalx commented Apr 27, 2023 •

edited

Loading

jorisvandenbossche commented Apr 28, 2023

pitrou commented May 24, 2023

felipecrv commented May 25, 2023

felipecrv commented May 25, 2023

pitrou commented May 25, 2023

pitrou commented May 25, 2023

felipecrv commented May 30, 2023 •

edited

Loading

[C++] Incorrect hash value for Scalars with sliced child data (ignores offset) #35360

[C++] Incorrect hash value for Scalars with sliced child data (ignores offset) #35360

Comments

mosalx commented Apr 27, 2023 • edited Loading

Summary

Environment

Details

Component(s)

jorisvandenbossche commented Apr 28, 2023

pitrou commented May 24, 2023

felipecrv commented May 25, 2023

felipecrv commented May 25, 2023

pitrou commented May 25, 2023

pitrou commented May 25, 2023

felipecrv commented May 30, 2023 • edited Loading

mosalx commented Apr 27, 2023 •

edited

Loading

felipecrv commented May 30, 2023 •

edited

Loading