Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket listeners do not get close notifications #14

Closed
agronholm opened this issue Oct 15, 2018 · 30 comments
Closed

Socket listeners do not get close notifications #14

agronholm opened this issue Oct 15, 2018 · 30 comments
Labels
bug Something isn't working
Milestone

Comments

@agronholm
Copy link
Owner

If one task is waiting for a socket to become ready for reading or writing and another task closes it, the former task is left waiting forever. Trio has a native mechanism to handle this gracefully, and for the other two we could whip up a solution fairly easily.

@agronholm agronholm added the bug Something isn't working label Oct 15, 2018
@agronholm agronholm added this to the 1.0.0 milestone Oct 15, 2018
@agronholm
Copy link
Owner Author

Curio has proven to be a bit more problematic in this regard. We use its native I/O waiting traps but there seems to be no way to abort this operation short of cancelling the entire task.

@dabeaz
Copy link

dabeaz commented Oct 16, 2018

Curio only provides an async wrapper layer on top of native sockets. As such, the solution to this problem is likely going to be the same as it would be for normal synchronous code involving threads and sockets. A bit of research on this topic indicates that calling close() on a socket from a separate thread than one that is reading it is indeed problematic---one problem being the fact that close() makes the underlying file descriptor go away and become invalid.

After a bit of searching around, the suggested solution to this issue is to use shutdown(SHUT_RDWR) instead of close(). I just tried this in Curio, and it works perfectly for the problem described in this bug report. Instead of hanging forever, tasks waiting to receive are instead awakened and notified of a proper EOF condition. So, I'd use that.

@agronholm
Copy link
Owner Author

Ooh, cool. Thanks, I'll try that!

@agronholm
Copy link
Owner Author

@dabeaz did you try with only TCP sockets? I'm not having much luck with UDP sockets:

OSError: [Errno 107] Transport endpoint is not connected

@agronholm
Copy link
Owner Author

The shutdown (2) man page says that this only works for connected sockets, so it's back to square one.

@dabeaz
Copy link

dabeaz commented Oct 17, 2018

Stepping back for a moment, is this issue addressing an actual issue in application code that exists in the real world or is it addressing a theoretical issue in someone's head? The reason I ask is that virtually every example of network programming I've ever seen follows a rough pattern of this:

sock = get_socket_somehow()
try:
    use_socket(sock)
    ...
finally:
    sock.close()

Having an outside task come in from afar and close() a socket underneath another task/thread that's using it is weird, and frankly a bit scary (well, maybe not as scary as having a task free the memory of an object in another task, but it's not that dissimilar either). What kind of behavior is expected when this happens? For example, if you do this in threads, you'll suddenly start getting OSError exceptions related to bad file descriptors. A lot of existing network code is probably not going to be expecting that and it will potentially cause complexity up and down the whole I/O stack. There's also an issue related to the re-use of file descriptors after you close it (a variant of the so-called ABA problem).

If the goal is to get a task to stop reading a socket, the proper solution (to me) seems to be to cancel it, not to pull the rug from under it by closing the socket from somewhere else.

Honestly, my inclination for Curio is to address this with a note in the documentation to close() that simply tells the user "Don't do that."

@agronholm
Copy link
Owner Author

For now, this is a theoretical issue but I did encounter it while writing a trivial UDP test where the listener loop launches a new task that closes the socket (and thus finishes the async generator).

More than that, there is also notify_fd_close() in trio. If not, I would've considered this to be purely overengineering on my behalf.

For the record, I have a working fix for this with tests added.

@Fuyukai
Copy link
Contributor

Fuyukai commented Oct 17, 2018

I've had this in practice before in a websocket library, where a websocket connection was waiting for events off a socket but then had the socket closed under it, leaving that WS connection dead forever. What kind of API can I expose to then fix that? I can't cancel the reader since the reading task isn't under my control. I can send a close frame and hope the server closes it, but that's no guarentee.

The only two options are to internally spawn a task for every single read operation, and then use a queue and cancel that as appropriate, but that's barely even async/await at that point, it's more callbacky. The other option is for curio to actually fix this in the core.

@dabeaz
Copy link

dabeaz commented Oct 17, 2018

Who/what was responsible for closing the underlying socket in this case? Did you do that? Was it in third-party code? Was this in Curio? Interested in more details about the exact situation.

(Note: would think that using sock.shutdown() instead of sock.close() would probably handle this case more gracefully).

@Fuyukai
Copy link
Contributor

Fuyukai commented Oct 17, 2018

It was me, in curio.

It was a class Websocket, which wrapped internally a curio socket. An application could iterate over it with async for message in websocket or await websocket.next_event(), which would read, and could close the websocket with websocket.close which would close the socket. The socket closing would often cause the reader loop to hang forever.

@njsmith
Copy link
Collaborator

njsmith commented Oct 18, 2018

Here's the Trio PR where we added it, if you want to get context: python-trio/trio#460

@Fuyukai is one of three different people who got bitten by this and reported issues... it's definitely a footgun. The kernel already knows who's blocked on any given fd at any given moment, so it's easy for it to error those out if the fd is about to go away. And even if you think that it's a mistake for one task to close an fd that another task is waiting on, if it does happen then it's not very convenient to have the waiting task hang forever, and then trigger weird errors later after the fd number is reassigned to a new file.

@smurfix
Copy link
Collaborator

smurfix commented Oct 18, 2018

@dabeaz sock.shutdown works perfectly, but only for connected sockets. You can't do that for UDP. I plan to adapt a KNX library that looks like it will have this problem …

@dabeaz
Copy link

dabeaz commented Oct 18, 2018

On UDP: If you're in a position where a task can call sock.close() on a UDP socket being used by another task, why wouldn't you also be in a position where you could simply cancel that task instead? What's so special about using `close()'? Just cancel it.

@Fuyukai
Copy link
Contributor

Fuyukai commented Oct 18, 2018

The problem with "just cancel it" is that it then means every single wrapper library needs to then either perform weird internal task spawning semantics to steamroll over curio's inability to close safely, or have the user have to track the tasks instead, effectively requiring one of the two to create an entire supervision layer tracking tasks that need to be cancelled as well until an entire task supervision library is re-invented on top of curio. For every single time a socket is interacted with and potentially closed (i.e. every single time any library implements a protocol over a socket).

Or, curio could add a mechanism to safely close the socket, and then libraries and users don't need to constantly have to avoid this footgun in every single piece of code that touches a socket that may ever be closed in some layer of the abstraction level.

@dabeaz
Copy link

dabeaz commented Oct 18, 2018

I'm sorry, but I'm not getting this. The whole point of having cancellation in a library like Curio is so that you don't have to build some kind of hacked up half-broken cancellation scheme based on close(), sentinel values, or some other madness. If want something to stop, cancel it.

I'm also not getting where it's now a suddenly a good idea to go around haphazardly closing resources that are being used by different tasks/threads? Is there now a "Lucy" design pattern based on the football handling in the Peanuts cartoon?

I don't mean to be overly dense here, but closing resources and handling tasks in this way seems like a really bad idea.

@agronholm
Copy link
Owner Author

There are clearly two distinct schools of thought at work here. Me? I'm just munching popcorn until consensus is reached. That said, I think that we can all agree that leaving a task waiting forever on a socket that has been closed is bad. Another approach would be to detect that another task is waiting on that socket and then raise an exception to the task that attempts to close the socket. The same thing would probably be a good thing to do if a task tries to read or write to the socket while another task is waiting on it.

@dabeaz
Copy link

dabeaz commented Oct 18, 2018

Curio already raises an error for the case of a task trying to read/write on a socket while another task is waiting to perform the same operation on the same socket. Extending close() to have similar behavior might not be too hard.

@Fuyukai
Copy link
Contributor

Fuyukai commented Oct 18, 2018

If you're writing a library that implements a protocol on top of an extra socket, how are you meant to cancel the thing reading from your protocol?

Say, given this abc:

class SomeProto(object):
	def __init__(self, socket):
		...

	async def read_event(self) -> SomeEvent:
		...

	async def send_event(self, evt: SomeEvent):
		...

	async def close(self):
		...   # does cleanup and then closes the socket

Somebody else's code calls read_event, but then they also call close in response to something else. read_event now has a chance of hanging forever. How is the implementation of the ABC meant to cancel the reader task now?

@dabeaz
Copy link

dabeaz commented Oct 18, 2018

I guess a more general design question is whether or not the implementation of a protocol should be combined together with aspects of task management. For example, is closing a protocol effectively the same as cancelling a task using it?

@smurfix
Copy link
Collaborator

smurfix commented Oct 18, 2018

You might not know which task is using the protocol, and you might want to have the reading (or writing) procedure raise a "real" exception instead of a cancellation.

For instance, a common pattern for multiplexed protocols is that the client grabs a lock, writes some data, adds itself to the response-handling queue, and then waits for the response to arrive. If there is none, it wants to get an exception. If the write locks up, a keepalive handler will close the socket and then send an error to all queue entries.

I do not want to be forced to create yet another queue and a separate "writer" task just because closing the socket results in a hung-up writer. That's an annoying rewrite when converting asyncio and/or trio code to anyio.

@dabeaz
Copy link

dabeaz commented Oct 18, 2018

But would using sock.shutdown() not work in this case? I don't mean to keep coming back to that (and I get that it doesn't work with UDP), but this whole idea that calling close() on a socket from wherever you want and at any time you want seems like it's fraught with problems. Is it the I/O equivalent of calling free() on some block of memory without any regard to what might be using it at the time? I don't know. It feels that way though.

@smurfix
Copy link
Collaborator

smurfix commented Oct 18, 2018

But would using sock.shutdown() not work in this case?

There are request/response protocols that don't use TCP.

We're not talking about arbitrary file descriptor integers here, we're talking about sockets, or some other Python-level data structure. You try to use the thing after it's been closed, you get an error – presumably because the act of closing sets the file descriptor to -1, sets a _closed flag, or whatever. All we want here is to also get an error if it's used while being closed.

It's not as if people habitually called fileno() on things and then use that for their own nefarious purposes …

@dabeaz
Copy link

dabeaz commented Oct 18, 2018

I get the not everything is TCP bit. However, I 'm still not buying the argument that letting arbitrary tasks walk around and call close() on sockets whenever they want is a graceful way to properly coordinate the shutdown of a protocol or tasks that are using it. I wouldn't teach people to use sockets in that way. Is some book/tutorial somewhere telling people to code in this manner?

I'm not trying to be overly difficult here, but you have these nice async libraries with cancellation. But, instead of using that, we're doing this? That makes no sense.

@Fuyukai
Copy link
Contributor

Fuyukai commented Oct 18, 2018

In order to leverage this paticular nice async library with cancellation, you have to set your wrapper up like this:

async def read_event(self) -> SomeEvent:
	task = await curio.spawn(self._real_read_event)
	self.tasks.append(task)
	return await task.join()

async def _real_read_event(self):
	# read and feed state machine, etc
	return some_event

async def close(self):
	for task in self.tasks:
		await task.cancel()

Or you have to mark close as "doesn't actually work" and make the user track all their tasks manually instead which sort of defeats half of the point of using this nice async library.

@dabeaz
Copy link

dabeaz commented Oct 18, 2018

Well, that's actually not far from what I might have in mind. Proper task management. Maybe it's formulated slightly differently, but basically that's the gist of it.

async def read_event(self):
    current = await curio.current_task()
    self.tasks.add(current)
    try:
          # Whatever it is
          return some_event
    finally:
        self.tasks.remove(current)

Task cancellation. Not a bunch of code worrying about closed sockets.

@smurfix
Copy link
Collaborator

smurfix commented Oct 19, 2018

Sorry for being a bit blunt here, but this is exactly the sort of housekeeping dance-around which we (meaning, people who have been using asyncio) are all too familiar with.

So instead of the cognitive load of adding ten lines of hopefully-correct but feels-like-it's-not-really-composeable and definitely-not-well-testable boilerplate to every piece of code that accesses a socket, with Trio I can simply close it and be sure that all users get what they deserve, i.e. an exception stating that the socket is now closed. Instead of, say, a cancellation that can mean anything.

I don't want my library to force any task management on me, it's difficult enough to structure tasks in a Real Application even without such constraints. Instead, I want my library to do basic socket management for me – that's the reason why it has proper socket and file objects in the first place, instead of a random and reused file descriptor integer.

in Trio, coordination happens when I close the thing. I might add some exception handlers for ClosedResourceError, but that's all, I'm done. Problem solved.

Thus, we ask you to please solve this problem in curio, instead of forcing every user to solve it for themselves. Thank you.

@agronholm
Copy link
Owner Author

agronholm commented Oct 19, 2018

Thus, we ask you to please solve this problem in curio, instead of forcing every user to solve it for themselves. Thank you.

More accurately, I would like curio to at least give me the primitives I need to implement such semantics on anyio. I'm not asking curio to change its philosophy.

For the record, this is how I currently handle socket close notifications on curio:

class Socket(BaseSocket):
    _reader_tasks = {}
    _writer_tasks = {}

    async def _wait_readable(self):
        task = await curio.current_task()
        self._reader_tasks[self._raw_socket] = task
        try:
            await curio.traps._read_wait(self._raw_socket)
        except curio.TaskCancelled:
            if self._raw_socket.fileno() == -1:
                raise ClosedResourceError from None
            else:
                raise
        finally:
            del self._reader_tasks[self._raw_socket]

    async def _notify_close(self) -> None:
        task = Socket._reader_tasks.get(self._raw_socket)
        if task:
            await task.cancel(blocking=False)

        task = Socket._writer_tasks.get(self._raw_socket)
        if task:
            await task.cancel(blocking=False)

The close() coroutine method in BaseSocket calls _notify_close() before closing the socket.

@dabeaz
Copy link

dabeaz commented Oct 19, 2018

From the Curio FAQ: "Curio remains committed to not doing much of anything the best it can."

When I wrote that, I was mainly thinking about HTTP. However, I think it also applies to bolting some kind of muddled mess of code related to the semantics of close/cancel onto the side of sockets. I don't really want to do it. Sorry.

That said, it's important to emphasis that Curio is really more of a low-level library for building your own framework or your own I/O primitives. There's nothing particularly sacred about the Curio socket implementation. In fact, the whole point of Curio "not doing anything" was for it to stay out of the way of you doing things. Build a different socket class if it makes sense. Or use inheritance. I look at the Socket class above and it makes perfect sense to me. If that's the behavior you want, then do that.

I've got some ideas on how I might make it easier to get some useful information out of the Curio kernel that can help with the above implementation, but I'm not inclined to expand Curio sockets with a bunch of new behavior beyond that.

@dabeaz
Copy link

dabeaz commented Oct 26, 2018

A somewhat random aside---a tricky thread cancellation problem came up in a class I was teaching this week. Perhaps I was thinking of this thread, but in trying to solve it, I suggested that maybe you could just close the associated file and hope that the thread would error out of whatever it was waiting for. Well, that definitely worked except that the "error out" was actually a segmentation fault of the entire interpreter. So, I think I still stand by my original thoughts on this issue. For what it's worth, the thread did, in fact, die.

@njsmith
Copy link
Collaborator

njsmith commented Oct 26, 2018

@dabeaz haha wut. I've never managed to get closing a file to affect a waiting thread at all. What on Earth was this thread doing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants