Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Py_hash_t Py_HashBytes(const void*, Py_ssize_t) #13

Closed
pitrou opened this issue Feb 6, 2024 · 14 comments
Closed

Add Py_hash_t Py_HashBytes(const void*, Py_ssize_t) #13

pitrou opened this issue Feb 6, 2024 · 14 comments

Comments

@pitrou
Copy link

pitrou commented Feb 6, 2024

CPython has an internal API Py_hash_t _Py_HashBytes(const void*, Py_ssize_t) that implements hashing of a buffer of bytes, consistently with the hash output of the bytes object. It was added (by me) in python/cpython@ce4a9da

It is currently used internally for hashing bytes objects (of course), but also str objects, memoryview objects, some datetime objects, and a couple other duties:

>>> hash(b"xxx")
-7933225254679225263
>>> hash(memoryview(b"xxx"))
-7933225254679225263
>>> bio = io.BytesIO()
>>> bio.write(b"xxx")
3
>>> bio.getbuffer()
<memory at 0x7f288ed29a80>
>>> hash(bio.getbuffer().toreadonly())
-7933225254679225263

Third-party libraries may want to define buffer-like objects and ensure that they are hashable in a way that's compatible with built-in bytes objects. Currently this would mean relying on the aforementioned internal API. An example I'm familiar with is the Buffer object in PyArrow.

>>> import pyarrow as pa
>>> buf = pa.py_buffer(b"xxx")
>>> buf
<pyarrow.Buffer address=0x7f28e05bdd00 size=3 is_cpu=True is_mutable=False>
>>> buf == b"xxx"
True
>>> hash(buf)
Traceback (most recent call last):
  ...
TypeError: unhashable type: 'pyarrow.lib.Buffer'

I simply propose making the API public and renaming it to Py_HashBytes, such that third-party libraries have access to the same facility.

@encukou
Copy link
Collaborator

encukou commented Feb 15, 2024

This is necessary to implement a hashable object that compares equal to a bytes (or memoryview) with the same contents.
Going a bit further: if we encourage allowing such comparisons, like PyArrow.buffer(b'xxx') == b'xxx' == MyCustomBuffer(b'xxx'), then we should encourage that PyArrow.buffer(b'xxx') == MyCustomBuffer(b'xxx'). (And of course the hashes need to match too.)
This means that comparison methods of buffer classes should, when given an other argument of an unknown type, ask other for a buffer and compare its contents to what self would export. And if they do that, they should use _Py_HashBytes for hashing.

I'd name the public function Py_HashBuffer, even though it doesn't take the entire Py_Buffer struct, to emphasize that it's meant to hash data you'd export using the buffer protocol.

@vstinner
Copy link

What's the behavior for negative length? Currently, it seems like negative length is cast to unsigned length and so have funny behavior. I suggest to change the length type to unsigned size_t.

I would be fine with Py_hash_t Py_HashBytes(const void *data, size_t size) API. FYI it returns 0 is size is equal to zero, but I would prefer to not put it in the API description.

Other examples of size_t parameters:

  • PyAPI_FUNC(PyObject*) PyLong_FromNativeBytes(const void* buffer, size_t n_bytes, int endianness);: recent function
  • static inline size_t _PyObject_SIZE(PyTypeObject *type): i converted it to a static inline function recently. The implementation is way easier if it returns an unsigned size.
  • void* PyObject_Malloc(size_t size)
  • PyAPI_FUNC(wchar_t *) Py_DecodeLocale(const char *arg, size_t *size);
  • PyAPI_FUNC(int) PyOS_snprintf(char *str, size_t size, const char *format, ...)

In the C library, size_t is common. Examples:

  • void *memcpy(void *dest, const void *src, size_t n);
  • ssize_t write(int fd, const void *buf, size_t count);: return type is interesting here :-)
  • int pthread_attr_setstacksize(pthread_attr_t *attr, size_t stacksize);

@pitrou
Copy link
Author

pitrou commented Feb 17, 2024

Currently, it seems like negative length is cast to unsigned length and so have funny behavior. I suggest to change the length type to unsigned size_t.

All objects lengths in Python are signed, including the Py_buffer::len member.

I don't really care either way, but insisting on size_t seems a bit gratuitous to me :-)

@vstinner
Copy link

I don't really care either way, but insisting on size_t seems a bit gratuitous to me :-)

If you want to keep Py_ssize_t, I would ask the behavior for negative length to be defined, and different than "random crash" if possible.

@pitrou
Copy link
Author

pitrou commented Feb 17, 2024

Negative length could for example be clamped to 0, and perhaps coupled with a debug-mode assertion?

@vstinner
Copy link

I'm fine with Py_HashBytes(data, -5) returning 0 (rather than crashing). We can document that the length must be positive :-) Sure an assertion is welcomed.

@vstinner
Copy link

I'd name the public function Py_HashBuffer, even though it doesn't take the entire Py_Buffer struct, to emphasize that it's meant to hash data you'd export using the buffer protocol.

If the parameter type is not Py_buffer, IMO Py_HashBuffer() name is misleading and I prefer proposed Py_HashBytes() name.

@encukou
Copy link
Collaborator

encukou commented Feb 28, 2024

It's not a PyBytes object either. The reasoning behind Py_HashBuffer is that you should use the same data that you expose with bf_getbuffer, otherwise you get surprising behaviour.

@pitrou, I see you've added a thumbs-up to my comment. IMO, the docs will be fairly important in this case; I'd really like to frame it as “implementing equality & hashes for (immutable) buffer objects” rather than “here's a function you can use”.
Do you want to draft it, or should I try my hand?

Negatives aren't the only case of invalid sizes. IMO it's OK if Py_HashBuffer(data, -1) has undefined behaviour, like e.g. Py_HashBuffer(data, SIZE_MAX/4) would (on common systems). A debug assertion is fine.

@pitrou
Copy link
Author

pitrou commented Mar 2, 2024

Perhaps something like:

.. c:function:: Py_hash_t Py_HashBuffer(const void* ptr, Py_ssize_t len)

   Compute and return the hash value of a buffer of *len* bytes
   starting at address *ptr*. This hash value is guaranteed to be
   equal to the hash value of a :class:`bytes` object with the same
   contents.

   This function is meant to ease implementation of hashing for
   immutable objects providing the :ref:`buffer protocol <bufferobjects>`.

@encukou
Copy link
Collaborator

encukou commented Mar 6, 2024

I meant something like:

.. c:function:: Py_hash_t Py_HashBuffer(const void* ptr, Py_ssize_t len)

   Compute and return the hash value of a buffer of *len* bytes
   starting at address *ptr*. The hash is guaranteed to match that of
   :class:`bytes`, :class:`memoryview`, and other built-in objects
   that implement the :ref:`buffer protocol <bufferobjects>`.

   Use this function to implement hashing for immutable
   objects whose `tp_richcompare` function compares
   to another object's buffer.

But the details can be hashed out in review.

@encukou
Copy link
Collaborator

encukou commented Mar 8, 2024

Well, I think this is overdue for a vote:

@vstinner
Copy link

Ping @iritkatriel who didn't vote.

@iritkatriel
Copy link
Member

Sorry.

@vstinner
Copy link

@pitrou: You can go ahead with Py_hash_t Py_HashBuffer(const void* ptr, Py_ssize_t len) API (and proposed documentation), it's approved by the C API working group. I close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants