Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Reduce dependency on file descriptors #311

Open
1 of 3 tasks
LasseBlaauwbroek opened this issue Apr 3, 2023 · 12 comments
Open
1 of 3 tasks

Proposal: Reduce dependency on file descriptors #311

LasseBlaauwbroek opened this issue Apr 3, 2023 · 12 comments

Comments

@LasseBlaauwbroek
Copy link
Contributor

LasseBlaauwbroek commented Apr 3, 2023

Pycapnp mostly works by passing file descriptors into the capnproto c++ library. This causes some problems:

To remedy this, I propose that instead of passing file descriptors to capnproto, we create wrappers around Python's IO facilities.

  • For synchronous file operations, we write a subclass of kj::InputStream and kj::OutputStream that wraps a Python file object
    • For synchronous socket operations, one can obtain a file-like object by calling socket.makefile. (This would also work with ssl.SSLSocket)
  • For asyncio socket operations, we write a subclass of kj::AsyncInputStream and kj:AsyncOutputStream that wraps a StreamReader and StreamWriter.
  • For asyncio file operations, it might be possible to use aiofile or aiofiles with the same interface as the previous point. To be investigated.

With these facilities, all classes like TwoPartyClient and TwoPartyServer will be modified to receive one of these wrapper classes. That should resolve all of the aforementioned problems.

A potential downside is that everything might slow down, because read and write operations have to be routed through the Python interpreter. I haven't measured the impact of this. If this is a problem, I guess we can also keep the file descriptor API (with it's known deficiencies). Downside of this, is that the file descriptor API requires a fairly large amount of code to work (see asyncProvider.cpp in #310)

@kentonv
Copy link
Member

kentonv commented Apr 3, 2023

This is @haata's call but FWIW this plan makes a lot of sense to me.

@haata
Copy link
Collaborator

haata commented Apr 3, 2023

Yeah, I think this makes sense (it definitely would make things a lot more straightforward).

A potential downside is that everything might slow down, because read and write operations have to be routed through the Python interpreter. I haven't measured the impact of this. If this is a problem, I guess we can also keep the file descriptor API (with it's known deficiencies). Downside of this, is that the file descriptor API requires a fairly large amount of code to work (see asyncProvider.cpp in #310)

I haven't investigated these APIs specifically, but it's usually possible to call the CPython apis directly. If done the right way, you shouldn't have to pay the Python interpreter tax (or we may need to add python dependency with a C version of that specific api).

Before you spend too much effort, it might be worth measuring the performance of read/write operations with some small POC code.

@kentonv
Copy link
Member

kentonv commented Apr 3, 2023

Surely you can have an optimization where you detect if the stream is backed by a raw FD and, if so, use it raw, but if not, then use the streams that call back into Python?

@LasseBlaauwbroek
Copy link
Contributor Author

LasseBlaauwbroek commented Apr 3, 2023

I haven't investigated these APIs specifically, but it's usually possible to call the CPython apis directly.

I'm not aware of any C-level or Cython api's for either file objects or sockets. Everything goes through io and socket for sync communication and transports and protocols or streams for async communication. None of those seem to expose a C-level API. But then, there certainly is no low-level API for the Gzip and SSL wrapper.

Surely you can have an optimization where you detect if the stream is backed by a raw FD and, if so, use it raw, but if not, then use the streams that call back into Python?

It is certainly possible to just pass in a raw fd (int) to the library, and use if you need maximum performance. Whether or not we can autodetect a stream being backed by a raw FD is tricky. For example, gzip.open returns a file object with a fileno() function. But that returns the file descriptor of the underlying compressed file, which is unusable. So one would have to use a pretty conservative heuristic to do such auto-detection. Even then, this would still cause issues in synchronous mode where kj file reads cannot be interrupted and the cross-platform issues.

Personally, I would be inclined to forego such heuristics unless the resulting performance is absolutely abysmal. Python is a notoriously extremely slow language, and most python developers prefer convenience over speed (to my dismay). The people that want maximum speed can still just pass in a raw fd int.

@kentonv
Copy link
Member

kentonv commented Apr 5, 2023

For example, gzip.open returns a file object with a fileno() function. But that returns the file descriptor of the underlying compressed file, which is unusable.

Oh man. Yeah that makes fileno() essentially unusable. IMO that's an egregious bug in the gzip package but I suppose there's nothing anyone can do about that now.

@LasseBlaauwbroek
Copy link
Contributor Author

Now that #312 exists, I'd like to take this proposal one step further. I propose to entirely remove the ability to run capnproto RPC without Python's asyncio loop. In my opinion, the existing functionality does not work with capnproto's philosophy, which is fundamentally based on a well-integrated event loop, and leads to many bugs.

Take for example thread_server.py and thread_client.py:

  • Instead of just waiting for the server to complete in blocking style, one has to poll in order to allow Ctrl+C to come through. See How to handle interrupts in KJ_SYSCALL capnproto#1542
  • The client has to start two threads, with two separate connections to the server. This is just kind of silly...
  • The client regularly segfaults with FATAL: exception not rethrown, which is difficult to impossible to fix AFAIK.
  • The need to chain promises with then() and lambdas is awkward at best.

When we remove this functionality, lots of other code can also be removed. So, I propose the following:

  • Only allow starting a KJ event loop when a Python asyncio loop is already running.
  • Remove the _RemotePromise.wait() and related methods. Use await promise instead.
  • Require all capability implementation methods to be async.
  • Remove _RemotePromise.then(). The new functionality in Allow capability implementation methods to be async #312 can be used.
  • Remove KJ timer integration. Python's asyncio timers can be used.
  • Probably entirely remove the _Promise and _VoidPromise wrappers, as well as _PromiseFulfillerPair and join_promises and friends. They are no longer needed. The only reason to keep _RemotePromise around is for promise pipelining.
  • TwoPartyClient and TwoPartyServer will only accept a _AsyncIoStream as their stream. No socket or raw fd.

This would obviously be a large breaking change. But #310 is already breaking and I think we might as well double down. I don't personally have the cycles to ensure a lengthy deprecation phase. And I suspect nobody does...

Thoughts?

@kentonv
Copy link
Member

kentonv commented Apr 24, 2023

I don't know anything about Python asyncio but abstractly speaking this proposal sounds like the right thing.

@haata
Copy link
Collaborator

haata commented Apr 24, 2023

Yeah, this is a breaking change. But I agree with it. Asyncio was the conclusion I reached after lots of fighting with libcapnproto file descriptors.

If we're going to be incrementing to v2.0.0 anyways, this is the time to do it. Making pycapnp more pythonic helps everyone in the long run.

The simplification is also a huge pro IMO.

@LasseBlaauwbroek
Copy link
Contributor Author

In that case, I'd suggest that a v2.0.0 feature branch is made and that - after proper review - my open PR's are merged there.

@LasseBlaauwbroek
Copy link
Contributor Author

With #315, I believe that the only thing remaining in this issue is to clean up the _MessageReader and MessageWriter class hierarchy, which are involved in reading and writing messages from disk. Those currently still rely on naked file descriptors, which needs to be rewritten to use Python's own file objects.

@fungs
Copy link

fungs commented Feb 14, 2024

For example, gzip.open returns a file object with a fileno() function. But that returns the file descriptor of the underlying compressed file, which is unusable.

Oh man. Yeah that makes fileno() essentially unusable. IMO that's an egregious bug in the gzip package but I suppose there's nothing anyone can do about that now.

There is bug for this in the Python issue tracker: python/cpython#100066 . I hope it will go away, but a similar bug exists since 2015.

@fungs
Copy link

fungs commented Feb 16, 2024

I'm also looking forward to having a simple way to use (sync) file object wrappers to read and write messages. This is important for integration with other Python functionality for compression, encryption, databases etc. My current, ugly workaround is to spawn a separate process to read the source data stream and feed it into a pipe or socket, which is then consumed in the parent process. UNIX pipes and sockets are OS low level tools which provide a working fileno() method in Python. There seem to be issues with this approach in a thread instead of a process (#354).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants