Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WebSocket time-out handling using the ping mechanism #113

Merged
merged 8 commits into from
May 31, 2023

Conversation

robbert-vdh
Copy link
Member

As mentioned in #104, the while the websockets library does have a function that spawns a thread to periodically ping clients to keep the connection alive and to make sure the client hasn't timed out, it doesn't actually implement the latter part on its own (see jaspervdj/websockets#159).

This implementation works by detecting timeouts, and then cancelling the websocket app action when that happens. The timeout detection is implemented by making sure the long pong was received within a certain time frame of the last sent ping whenever a new ping is sent. When that happens, the ping handler thread exits which causes the app action to exit as well thanks to the use of race_ in withInteruptablePingThread. The original library-provided withPingThread runs the ping handler on its own thread with no clean way for the ping handler to affect the thread the app action is running on due to the use of withAsync (but without actually using the async value at any point).

The last piece of the puzzle is the pong handler, which is set through the ConnectionOptions. These options were previously shared for all connections, so a small refactor had to be made to allow different options for each connection.

Let me know if there are any questions or if anyone has suggestions on how to improve this. I found the easiest way to test this to be by launching the server with short timeouts (e.g. stack run -- --websocket-ping-interval=1 --websocket-pong-timeout=1), and then using websocat to connect to the server (websocat ws://localhost:3000 -vvvv --print-ping-rtts), sending SIGSTOP/pressing Ctrl+Z to simulate a timeout.

Fixes #104.

There is no clean or built in way to do this, see
jaspervdj/websockets#159. This implementation
works by keeping track of the last received pong, and then checking
whether the previous ping has been answered with a pong when sending a
new ping. If that isn't the case, then the connection is terminated
early through the use of `withInteruptablePingThread`.
@robbert-vdh robbert-vdh requested a review from diegodiv May 11, 2023 12:59
Copy link
Contributor

@diegodiv diegodiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks nice! Suggestions are mostly about typos.

As discussed offline, we can add some toggling mechanism, in case you don't need these extra threads.

server/src/Icepeak/Server/WebsocketServer.hs Outdated Show resolved Hide resolved
server/src/Icepeak/Server/WebsocketServer.hs Outdated Show resolved Hide resolved
server/src/Icepeak/Server/Config.hs Outdated Show resolved Hide resolved
server/src/Icepeak/Server/Config.hs Outdated Show resolved Hide resolved
server/src/Icepeak/Server/WebsocketServer.hs Outdated Show resolved Hide resolved
server/src/Icepeak/Server/WebsocketServer.hs Outdated Show resolved Hide resolved
As mentioned in
<#113 (review)>.
@robbert-vdh
Copy link
Member Author

robbert-vdh commented May 12, 2023

As discussed offline, we can add some toggling mechanism, in case you don't need these extra threads.

I just checked, and right now connections are actually leaked in production. If you connect to the production icepeak server that's reverse proxied through nginx and then just stop responding to pongs without closing the connection the connection will remain active and leak resources. I tested this by connecting with websocat, suspending the process (SIGSTOP/Ctrl+Z), waiting a couple minutes, then disconnecting from internet, hibernating my machine, resuming from hibernate, reconnecting to the internet, and then resuming the websocat process. The connection was still alive and it was like nothing happened. So I'm thinking it's probably a good idea to enable this ping-pong timeout mechanism in production, but maybe we can just set the timeout to like five or ten minutes instead of 30 seconds. Then the odds of it terminating connections unexpectedly are lower but it will still prevent resource leaks. I'd need to check if feedscrubber automatically reconnects closed websocket connections though.

EDIT: Apparently our nginx timeout time is also five minutes, I have not yet tested disconnecting from the internet for 10 minutes to make sure that works.

@robbert-vdh
Copy link
Member Author

robbert-vdh commented May 15, 2023

Thinking about this some more, if you change the pong timeout to be 15 5 minutes or more then the current behavior shouldn't change but it will still terminate the connections if the nginx TCP ping timeout mechanism somehow flukes out.

Copy link
Member

@fatho fatho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice doc comments! Two remarks from a cursory glance:

go :: Int -> IO ()
go i = do
threadDelay (pingInterval * 1000 * 1000)
WS.sendPing conn (T.pack $ show i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we now theoretically get stuck on this one (due to the lack of timeouts) and thus never reach the place where it is supposed to exit?
Concretely: if the send buffer is already full because the other side is not responding, then this would likely block (though I didn't check) - and the default kernel timeouts for TCP connections are rather long IIRC.

IIRC one of the designs I saw used an external thread with a timer that's reset when receiving pongs, and killing the handler thread when it times out, that would not be prone to this.
(Though it also depends on how sendPing is implemented)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd need to check if it uses a blocking send() or not, I also have no idea. A quick glance indicates that it uses this function, which sounds like it would block when the send buffer is full: https://hackage.haskell.org/package/network-3.1.2.9/docs/Network-Socket-ByteString.html

At worst it would behave the same as before, since the race_ used to spawn the action should prevent the ping thread from outliving the socket as the app action will terminate when the socket is closed, causing the other action to be cancelled. But I'll do some digging to figure out what would happen. If the send buffer indeed fills up (though with pings being only a couple bytes long, that would take forever), we could add another MVar for the last ping time and then send the pings form a third thread so the current ping handler thread only has to compare the two MVars.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked, with a default Ubuntu install it can send 342855 pings before filling the send buffer and blocking. This number will decrease slightly as time goes on (because the ping message becomes larger; it sends a sequence number as a string) but not in a significant way.

This would be fine in practice, but moving the ping sending to another thread and then comparing the values form the MVars in the interruptiblePingThread thread does not have this limitation. What do you suggest, go with the third thread just in case or accept that with a 30 second ping interval a 20 year timeout window is probably sufficiently large?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go with less over-engineering, since we should never run into these limitations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked, with a default Ubuntu install it can send 342855 pings before filling the send buffer and blocking. This number will decrease slightly as time goes on (because the ping message becomes larger; it sends a sequence number as a string) but not in a significant way.

Yes, but the same send buffer is also used for regular payloads sent by broadcasting which are presumably much larger than ping messages, so your mileage may vary here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it was the issue, because we didn't decouple it enough. But I wonder if something like that could actually happen here. Maybe going safe is the way then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this for some more and just adding a third thread may not be entirely correct since it would need to use its own delay/timer routine, which may slowly drift out of phase compared to the thread that actually sends the pings. Using an actual timer like you suggested would work, but we don't already depend on a package that can do this like timers. I checked the implementation and I don't think it's that much less expensive to restart a timer there than it would be to just slap a timeout onto the WS.sendPing, which works as expected. Any objections to just doing that instead? That prevents the ping from blocking indefinitely when the send buffer is full, and it interacts nicely with the existing timeout checking function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this for some more and just adding a third thread may not be entirely correct since it would need to use its own delay/timer routine, which may slowly drift out of phase compared to the thread that actually sends the pings

Doesn't the current implementation also run out of phase, because the ping thread runs in intervals of "ping interval" while the pongs are expected after the "pong timeout"?

I checked the implementation and I don't think it's that much less expensive to restart a timer there than it would be to just slap a timeout onto the WS.sendPing

Less expensive in terms of what? timeout also uses a timer internally. It's certainly less complex though - but what would be the timeout for that operation then?


Alternatively, the ping thread could, upon sending a ping, start a green thread that waits for a pong to be received with a timeout.
If that timeout expired, it stops the handler thread with an async exception, otherwise, it just exits.
And if there already is a green thread waiting for a pong when a subsequent ping is sent, we just do nothing (since pings, shouldn't reset the timeout, but pongs).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the current implementation also run out of phase, because the ping thread runs in intervals of "ping interval" while the pongs are expected after the "pong timeout"?

There's a bit of play between pingAction being evaluated and the on-pong handler being evaluated but that should follow a normal distribution and not really increase or decrease over time. The phase drifting I was talking about would happen if you used another thread for checking the IORef TimeSpecs written by the ping- and pong handler instead of using an actual resettable timer. But I think the timeout here is an even nicer solution since it doesn't need to introduce any more moving parts.

Less expensive in terms of what? timeout also uses a timer internally. It's certainly less complex though - but what would be the timeout for that operation then?

I set the timeout to the ping interval. Since if the socket blocks for that long, then the pingHandler will also consider the connection to have timed out so it works out nicely without any other changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a plan!

server/src/Icepeak/Server/WebsocketServer.hs Outdated Show resolved Hide resolved
This is different from the implementation in the websockets library but
it makes more sense and other than not spawning unnecessary threads, it
doesn't change the behavior.
This already happened in the old version, but it was never documented.
@robbert-vdh
Copy link
Member Author

I copied and adapted the ping thread structure from the websockets library to avoid changing behavior too much, but as @DiegoDiverio pointed out even when the timeout detection is disabled by setting the ping interval to a negative or zero value will still spawn a (lightweight) thread, which is then immediately exited. I made a small change to avoid spawning the thread when pings are disabled, and I also documented that a ping interval of 0 or less causes Icepeak to stop sending pings.

The locking is not needed since partial updates aren't possible anyways.
This prevents a situation mentioned in
#113 (comment)
where the `pingAction` that checks for timeouts no longer fires when the
TCP socket's send buffer is full.
@robbert-vdh
Copy link
Member Author

I'll merge this, and then we can set the pong timeout to match the TCP keepalive timeout when deploying. Should be Fine:tm:.

@OpsBotPrime merge

@OpsBotPrime
Copy link
Contributor

Pull request approved for merge by @robbert-vdh, rebasing now.

Approved-by: robbert-vdh
Auto-deploy: false
@OpsBotPrime
Copy link
Contributor

Rebased as b831c1b, waiting for CI …

@OpsBotPrime
Copy link
Contributor

CI job 🟡 started.

@OpsBotPrime OpsBotPrime merged commit b831c1b into master May 31, 2023
@OpsBotPrime OpsBotPrime deleted the robbert-vdh/missing-pong-timeout branch May 31, 2023 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add timeout handling to the WebSocket subscriber connections
4 participants