New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPython project has the lead on the C API, other Python implementations have to follow #64
Comments
@vstinner: You are forgetting that tools such as Cython hide away the necessary interface logic, but if you want to work with the Python interpreter and its objects, you will still use the Python C API in the functions you define in those tools. They are not a more abstract way of accessing Python, just a more convenient one, which lets you focus more on the task at hand rather than how to define e.g. functions or types in C for use in Python. More abstract would be higher level parts of the Python API, i.e. the abstract API or the very high level API, but they lose big on the performance side of things and so are not always what C extension writers are looking for. BTW: A better way to provide the Python C API in other Python implementations is to simply embed CPython in those implementations and provide interfaces from those implementations to the embedded CPython. Antonio's SPy will be taking this approach. |
I would like to understand what this looks like. How would the communication between the two interpreters work? Calls into the C-API need a |
I don't know how Antonio wants to implement this, but I'd use an in-memory data transfer approach to bridge between the two worlds, ideally with zero copy. E.g. Apache Arrow provides tooling for this. Trying to do this at the PyObject level approach will always cause problems with how the CPython internals work (or change going forward) and create too much overhead going back and forth. This will not work for low level bridging, e.g. using Python C extensions in tight loops written in the other Python implementation, but it works well if you clearly separate code that needs the Python C API from other code which runs in the separate implementation. |
Makes sense. Apache Arrow's underlying structure is a (somewhat large?) data structure not related to |
Just as a data point though, this means you may lose a lot of the performance some alternative implementation can offer again. In GraalPy, we've found that even with the overhead of transforming our objects to |
You get the most benefit out of Arrow if you keep the data maintained by Arrow and only extract results into Python objects (ideally, after having finished calculations). The data types support in Arrow is pretty extensive and also includes Unicode string support (stored as UTF-8 in Arrow). Since Arrow is already on the rise for data science and processing workloads, the added complexity pays off, since it's likely you're going to need Arrow as part of the application you're running anyway. For other workloads, it may be better going with a serialization solution and share memory, e.g. using MessagePack. You don't get zero copy, but you also don't have install an additional ~100MB worth of Arrow libs. |
True, for UDFs written in Python or code that manages low level data structures you are stuck with CPython. But then again: there are other tools to help with gaining more performance for these parts which work well with CPython, e,g. numba (for numpy) or Cython. Using the embedding approach will certainly require changes to applications which don't provide clear boundaries between compute blocks, but IMO this a price worth paying if you'd like to benefit from other Python implementations while still using the Python C API. There's no free lunch ;-) |
Antonio's project is not meant to be Python compatible language that is a big difference.
If existing Python code-bases were structured like this we wouldn't need to solve problems with C API. What about some Python code-base which works with NumPy arrays and mixes pure Python code and calls into the NumPy extension, even in loops. There's no clear separation between the "native" and Python parts. On top of that replacing NumPy array with Arrow, would be much bigger change than migrating to some better C API, which will be conceptually still similar to what people used before. |
This is already slowly happening, see e.g. polars and Pandas2 (currently only optionally) using to Arrow as the data management backend. But we're digressing here :-) What I'm trying to say is that in order to support the Python C API in other language implementations, you don't necessarily need to emulate it - simply use it directly via an embedded CPython and ask your users to code accordingly. That'll safe a lot of headaches and you can then focus on making pure Python code as fast as possible. |
We want to run existing Python code, unmodified, that's the whole point of being compatible. There are lots of quirks in Python semantics that we work hard to emulate, because one little Python package that happens to be a dependency of few others happens to rely on them. We could have asked them to just change their code. |
That's a worthwhile goal, but only for pure Python code. For code using C extensions, I don't think it's worthwhile trying to emulate the Python C API, but rather just embed CPython and provide it in a complete and unchanged way, with a bridge into the alternative pure Python implementation. YMMV, of course. |
@malemburg How does embedding CPython help an alternative implementation that doesn't lay out its objects in memory the same way CPython does? It that case it seems one still has to pay the cost of creating a new object every time wants to call a C extension. |
Other Python implementations have to provide a compatible C API, as complete and compatible as possible. The problem is that the C API leaks many implementation details like memory layout (structures like PyTupleObject or PyDictObject), memory allocation (objects must be allocated on the heap), reference counting, etc. The C API doesn't fit well with a moving GC for example.
While some core devs tries to reach consumers of this API and developers of other Python implementations, in practice, CPython still has the lead the C API and dictates how other Python implementations must provide a C API.
Maybe the HPy project will change that, since HPy gives a great freedom on how a Python implementation supports HPy. In the meanwhile, there is a long list of C extensions accessing directly the C API.
Other option is to migrate existing C extensions to higher level API like Cython, HPy, cffi, pybind11 or anything else: no longer access the C API directly.
The text was updated successfully, but these errors were encountered: