Add `Py_hash_t Py_HashBytes(const void*, Py_ssize_t)` #13

pitrou · 2024-02-06T21:39:19Z

CPython has an internal API Py_hash_t _Py_HashBytes(const void*, Py_ssize_t) that implements hashing of a buffer of bytes, consistently with the hash output of the bytes object. It was added (by me) in python/cpython@ce4a9da

It is currently used internally for hashing bytes objects (of course), but also str objects, memoryview objects, some datetime objects, and a couple other duties:

>>> hash(b"xxx")
-7933225254679225263
>>> hash(memoryview(b"xxx"))
-7933225254679225263
>>> bio = io.BytesIO()
>>> bio.write(b"xxx")
3
>>> bio.getbuffer()
<memory at 0x7f288ed29a80>
>>> hash(bio.getbuffer().toreadonly())
-7933225254679225263

Third-party libraries may want to define buffer-like objects and ensure that they are hashable in a way that's compatible with built-in bytes objects. Currently this would mean relying on the aforementioned internal API. An example I'm familiar with is the Buffer object in PyArrow.

>>> import pyarrow as pa
>>> buf = pa.py_buffer(b"xxx")
>>> buf
<pyarrow.Buffer address=0x7f28e05bdd00 size=3 is_cpu=True is_mutable=False>
>>> buf == b"xxx"
True
>>> hash(buf)
Traceback (most recent call last):
  ...
TypeError: unhashable type: 'pyarrow.lib.Buffer'

I simply propose making the API public and renaming it to Py_HashBytes, such that third-party libraries have access to the same facility.

The text was updated successfully, but these errors were encountered:

encukou · 2024-02-15T11:34:56Z

This is necessary to implement a hashable object that compares equal to a bytes (or memoryview) with the same contents.
Going a bit further: if we encourage allowing such comparisons, like PyArrow.buffer(b'xxx') == b'xxx' == MyCustomBuffer(b'xxx'), then we should encourage that PyArrow.buffer(b'xxx') == MyCustomBuffer(b'xxx'). (And of course the hashes need to match too.)
This means that comparison methods of buffer classes should, when given an other argument of an unknown type, ask other for a buffer and compare its contents to what self would export. And if they do that, they should use _Py_HashBytes for hashing.

I'd name the public function Py_HashBuffer, even though it doesn't take the entire Py_Buffer struct, to emphasize that it's meant to hash data you'd export using the buffer protocol.

vstinner · 2024-02-17T15:41:12Z

What's the behavior for negative length? Currently, it seems like negative length is cast to unsigned length and so have funny behavior. I suggest to change the length type to unsigned size_t.

I would be fine with Py_hash_t Py_HashBytes(const void *data, size_t size) API. FYI it returns 0 is size is equal to zero, but I would prefer to not put it in the API description.

Other examples of size_t parameters:

PyAPI_FUNC(PyObject*) PyLong_FromNativeBytes(const void* buffer, size_t n_bytes, int endianness);: recent function
static inline size_t _PyObject_SIZE(PyTypeObject *type): i converted it to a static inline function recently. The implementation is way easier if it returns an unsigned size.
void* PyObject_Malloc(size_t size)
PyAPI_FUNC(wchar_t *) Py_DecodeLocale(const char *arg, size_t *size);
PyAPI_FUNC(int) PyOS_snprintf(char *str, size_t size, const char *format, ...)

In the C library, size_t is common. Examples:

void *memcpy(void *dest, const void *src, size_t n);
ssize_t write(int fd, const void *buf, size_t count);: return type is interesting here :-)
int pthread_attr_setstacksize(pthread_attr_t *attr, size_t stacksize);

pitrou · 2024-02-17T15:49:17Z

Currently, it seems like negative length is cast to unsigned length and so have funny behavior. I suggest to change the length type to unsigned size_t.

All objects lengths in Python are signed, including the Py_buffer::len member.

I don't really care either way, but insisting on size_t seems a bit gratuitous to me :-)

vstinner · 2024-02-17T15:51:57Z

I don't really care either way, but insisting on size_t seems a bit gratuitous to me :-)

If you want to keep Py_ssize_t, I would ask the behavior for negative length to be defined, and different than "random crash" if possible.

pitrou · 2024-02-17T16:32:45Z

Negative length could for example be clamped to 0, and perhaps coupled with a debug-mode assertion?

vstinner · 2024-02-17T16:59:05Z

I'm fine with Py_HashBytes(data, -5) returning 0 (rather than crashing). We can document that the length must be positive :-) Sure an assertion is welcomed.

vstinner · 2024-02-17T17:00:12Z

I'd name the public function Py_HashBuffer, even though it doesn't take the entire Py_Buffer struct, to emphasize that it's meant to hash data you'd export using the buffer protocol.

If the parameter type is not Py_buffer, IMO Py_HashBuffer() name is misleading and I prefer proposed Py_HashBytes() name.

encukou · 2024-02-28T12:28:35Z

It's not a PyBytes object either. The reasoning behind Py_HashBuffer is that you should use the same data that you expose with bf_getbuffer, otherwise you get surprising behaviour.

@pitrou, I see you've added a thumbs-up to my comment. IMO, the docs will be fairly important in this case; I'd really like to frame it as “implementing equality & hashes for (immutable) buffer objects” rather than “here's a function you can use”.
Do you want to draft it, or should I try my hand?

Negatives aren't the only case of invalid sizes. IMO it's OK if Py_HashBuffer(data, -1) has undefined behaviour, like e.g. Py_HashBuffer(data, SIZE_MAX/4) would (on common systems). A debug assertion is fine.

pitrou · 2024-03-02T14:33:09Z

Perhaps something like:

.. c:function:: Py_hash_t Py_HashBuffer(const void* ptr, Py_ssize_t len)

   Compute and return the hash value of a buffer of *len* bytes
   starting at address *ptr*. This hash value is guaranteed to be
   equal to the hash value of a :class:`bytes` object with the same
   contents.

   This function is meant to ease implementation of hashing for
   immutable objects providing the :ref:`buffer protocol <bufferobjects>`.

encukou · 2024-03-06T13:18:20Z

I meant something like:

.. c:function:: Py_hash_t Py_HashBuffer(const void* ptr, Py_ssize_t len)

   Compute and return the hash value of a buffer of *len* bytes
   starting at address *ptr*. The hash is guaranteed to match that of
   :class:`bytes`, :class:`memoryview`, and other built-in objects
   that implement the :ref:`buffer protocol <bufferobjects>`.

   Use this function to implement hashing for immutable
   objects whose `tp_richcompare` function compares
   to another object's buffer.

But the details can be hashed out in review.

encukou · 2024-03-08T14:14:56Z

vstinner · 2024-03-28T11:54:50Z

Ping @iritkatriel who didn't vote.

iritkatriel · 2024-03-28T11:57:55Z

Sorry.

vstinner · 2024-03-28T12:58:10Z

@pitrou: You can go ahead with Py_hash_t Py_HashBuffer(const void* ptr, Py_ssize_t len) API (and proposed documentation), it's approved by the C API working group. I close the issue.

vstinner closed this as completed Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `Py_hash_t Py_HashBytes(const void*, Py_ssize_t)` #13

Add `Py_hash_t Py_HashBytes(const void*, Py_ssize_t)` #13

pitrou commented Feb 6, 2024

encukou commented Feb 15, 2024

vstinner commented Feb 17, 2024

pitrou commented Feb 17, 2024 •

edited

Loading

vstinner commented Feb 17, 2024

pitrou commented Feb 17, 2024

vstinner commented Feb 17, 2024

vstinner commented Feb 17, 2024

encukou commented Feb 28, 2024 •

edited

Loading

pitrou commented Mar 2, 2024

encukou commented Mar 6, 2024

encukou commented Mar 8, 2024 •

edited by iritkatriel

Loading

vstinner commented Mar 28, 2024

iritkatriel commented Mar 28, 2024

vstinner commented Mar 28, 2024

Add Py_hash_t Py_HashBytes(const void*, Py_ssize_t) #13

Add Py_hash_t Py_HashBytes(const void*, Py_ssize_t) #13

Comments

pitrou commented Feb 6, 2024

encukou commented Feb 15, 2024

vstinner commented Feb 17, 2024

pitrou commented Feb 17, 2024 • edited Loading

vstinner commented Feb 17, 2024

pitrou commented Feb 17, 2024

vstinner commented Feb 17, 2024

vstinner commented Feb 17, 2024

encukou commented Feb 28, 2024 • edited Loading

pitrou commented Mar 2, 2024

encukou commented Mar 6, 2024

encukou commented Mar 8, 2024 • edited by iritkatriel Loading

vstinner commented Mar 28, 2024

iritkatriel commented Mar 28, 2024

vstinner commented Mar 28, 2024

Add `Py_hash_t Py_HashBytes(const void*, Py_ssize_t)` #13

Add `Py_hash_t Py_HashBytes(const void*, Py_ssize_t)` #13

pitrou commented Feb 17, 2024 •

edited

Loading

encukou commented Feb 28, 2024 •

edited

Loading

encukou commented Mar 8, 2024 •

edited by iritkatriel

Loading