Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPython project has the lead on the C API, other Python implementations have to follow #64

Open
vstinner opened this issue Jul 24, 2023 · 12 comments

Comments

@vstinner
Copy link
Contributor

Other Python implementations have to provide a compatible C API, as complete and compatible as possible. The problem is that the C API leaks many implementation details like memory layout (structures like PyTupleObject or PyDictObject), memory allocation (objects must be allocated on the heap), reference counting, etc. The C API doesn't fit well with a moving GC for example.

While some core devs tries to reach consumers of this API and developers of other Python implementations, in practice, CPython still has the lead the C API and dictates how other Python implementations must provide a C API.

Maybe the HPy project will change that, since HPy gives a great freedom on how a Python implementation supports HPy. In the meanwhile, there is a long list of C extensions accessing directly the C API.

Other option is to migrate existing C extensions to higher level API like Cython, HPy, cffi, pybind11 or anything else: no longer access the C API directly.

@malemburg
Copy link

@vstinner: You are forgetting that tools such as Cython hide away the necessary interface logic, but if you want to work with the Python interpreter and its objects, you will still use the Python C API in the functions you define in those tools.

They are not a more abstract way of accessing Python, just a more convenient one, which lets you focus more on the task at hand rather than how to define e.g. functions or types in C for use in Python.

More abstract would be higher level parts of the Python API, i.e. the abstract API or the very high level API, but they lose big on the performance side of things and so are not always what C extension writers are looking for.

BTW: A better way to provide the Python C API in other Python implementations is to simply embed CPython in those implementations and provide interfaces from those implementations to the embedded CPython. Antonio's SPy will be taking this approach.

@mattip
Copy link

mattip commented Aug 7, 2023

A better way to provide the Python C API in other Python implementations is to simply embed CPython in those implementations

I would like to understand what this looks like. How would the communication between the two interpreters work? Calls into the C-API need a PyObject*. The slow part of the PyPy emulation layer (cpyext) is exactly that: converting the internal PyPy representation of an object into a CPython compatible struct.

@malemburg
Copy link

A better way to provide the Python C API in other Python implementations is to simply embed CPython in those implementations

I would like to understand what this looks like.

I don't know how Antonio wants to implement this, but I'd use an in-memory data transfer approach to bridge between the two worlds, ideally with zero copy. E.g. Apache Arrow provides tooling for this.

Trying to do this at the PyObject level approach will always cause problems with how the CPython internals work (or change going forward) and create too much overhead going back and forth.

This will not work for low level bridging, e.g. using Python C extensions in tight loops written in the other Python implementation, but it works well if you clearly separate code that needs the Python C API from other code which runs in the separate implementation.

@mattip
Copy link

mattip commented Aug 8, 2023

Makes sense. Apache Arrow's underlying structure is a (somewhat large?) data structure not related to PyObject. I think interacting with PyUnicodeObject via this kind of transfer would not be very performant.

@timfel
Copy link

timfel commented Aug 8, 2023

This will not work for low level bridging, e.g. using Python C extensions in tight loops written in the other Python implementation, but it works well if you clearly separate code that needs the Python C API from other code which runs in the separate implementation.

Just as a data point though, this means you may lose a lot of the performance some alternative implementation can offer again. In GraalPy, we've found that even with the overhead of transforming our objects to PyObject* and back, we can JIT the Python parts of NumPy and beat CPython that way sometimes (but only for very specific workloads!). If we ran all of NumPy on CPython, we would then be slower for these workloads. Other workloads where the boundary dominates more are not so lucky, but it's often hard to tell by looking at the code. And this is even harder to explain to users and thus I feel this approach would simply deny many users who do not have the time and/or expertise to dive into these details the best choice of implementation for their specific problem.

@malemburg
Copy link

I think interacting with PyUnicodeObject via this kind of transfer would not be very performant.

You get the most benefit out of Arrow if you keep the data maintained by Arrow and only extract results into Python objects (ideally, after having finished calculations). The data types support in Arrow is pretty extensive and also includes Unicode string support (stored as UTF-8 in Arrow).

Since Arrow is already on the rise for data science and processing workloads, the added complexity pays off, since it's likely you're going to need Arrow as part of the application you're running anyway.

For other workloads, it may be better going with a serialization solution and share memory, e.g. using MessagePack. You don't get zero copy, but you also don't have install an additional ~100MB worth of Arrow libs.

@malemburg
Copy link

even with the overhead of transforming our objects to PyObject* and back, we can JIT the Python parts of NumPy and beat CPython that way sometimes

True, for UDFs written in Python or code that manages low level data structures you are stuck with CPython. But then again: there are other tools to help with gaining more performance for these parts which work well with CPython, e,g. numba (for numpy) or Cython.

Using the embedding approach will certainly require changes to applications which don't provide clear boundaries between compute blocks, but IMO this a price worth paying if you'd like to benefit from other Python implementations while still using the Python C API. There's no free lunch ;-)

@steve-s
Copy link
Contributor

steve-s commented Aug 8, 2023

Antonio's SPy will be taking this approach.

Antonio's project is not meant to be Python compatible language that is a big difference.

You get the most benefit out of Arrow if you keep the data maintained by Arrow and only extract results into Python objects.

If existing Python code-bases were structured like this we wouldn't need to solve problems with C API. What about some Python code-base which works with NumPy arrays and mixes pure Python code and calls into the NumPy extension, even in loops. There's no clear separation between the "native" and Python parts. On top of that replacing NumPy array with Arrow, would be much bigger change than migrating to some better C API, which will be conceptually still similar to what people used before.

@malemburg
Copy link

On top of that replacing NumPy array with Arrow, would be much bigger change than migrating to some better C API

This is already slowly happening, see e.g. polars and Pandas2 (currently only optionally) using to Arrow as the data management backend.

But we're digressing here :-)

What I'm trying to say is that in order to support the Python C API in other language implementations, you don't necessarily need to emulate it - simply use it directly via an embedded CPython and ask your users to code accordingly.

That'll safe a lot of headaches and you can then focus on making pure Python code as fast as possible.

@steve-s
Copy link
Contributor

steve-s commented Aug 8, 2023

ask your users to code accordingly

We want to run existing Python code, unmodified, that's the whole point of being compatible. There are lots of quirks in Python semantics that we work hard to emulate, because one little Python package that happens to be a dependency of few others happens to rely on them. We could have asked them to just change their code.

@malemburg
Copy link

We want to run existing Python code, unmodified, that's the whole point of being compatible.

That's a worthwhile goal, but only for pure Python code.

For code using C extensions, I don't think it's worthwhile trying to emulate the Python C API, but rather just embed CPython and provide it in a complete and unchanged way, with a bridge into the alternative pure Python implementation.

YMMV, of course.

@hodgestar
Copy link

@malemburg How does embedding CPython help an alternative implementation that doesn't lay out its objects in memory the same way CPython does? It that case it seems one still has to pay the cost of creating a new object every time wants to call a C extension.

@iritkatriel iritkatriel removed the v label Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants