Add WebSocket time-out handling using the ping mechanism #113

robbert-vdh · 2023-05-11T12:59:00Z

As mentioned in #104, the while the websockets library does have a function that spawns a thread to periodically ping clients to keep the connection alive and to make sure the client hasn't timed out, it doesn't actually implement the latter part on its own (see jaspervdj/websockets#159).

This implementation works by detecting timeouts, and then cancelling the websocket app action when that happens. The timeout detection is implemented by making sure the long pong was received within a certain time frame of the last sent ping whenever a new ping is sent. When that happens, the ping handler thread exits which causes the app action to exit as well thanks to the use of race_ in withInteruptablePingThread. The original library-provided withPingThread runs the ping handler on its own thread with no clean way for the ping handler to affect the thread the app action is running on due to the use of withAsync (but without actually using the async value at any point).

The last piece of the puzzle is the pong handler, which is set through the ConnectionOptions. These options were previously shared for all connections, so a small refactor had to be made to allow different options for each connection.

Let me know if there are any questions or if anyone has suggestions on how to improve this. I found the easiest way to test this to be by launching the server with short timeouts (e.g. stack run -- --websocket-ping-interval=1 --websocket-pong-timeout=1), and then using websocat to connect to the server (websocat ws://localhost:3000 -vvvv --print-ping-rtts), sending SIGSTOP/pressing Ctrl+Z to simulate a timeout.

Fixes #104.

There is no clean or built in way to do this, see jaspervdj/websockets#159. This implementation works by keeping track of the last received pong, and then checking whether the previous ping has been answered with a pong when sending a new ping. If that isn't the case, then the connection is terminated early through the use of `withInteruptablePingThread`.

diegodiv

The code looks nice! Suggestions are mostly about typos.

As discussed offline, we can add some toggling mechanism, in case you don't need these extra threads.

server/src/Icepeak/Server/WebsocketServer.hs

server/src/Icepeak/Server/Config.hs

server/src/Icepeak/Server/WebsocketServer.hs

As mentioned in <#113 (review)>.

robbert-vdh · 2023-05-12T13:54:34Z

As discussed offline, we can add some toggling mechanism, in case you don't need these extra threads.

I just checked, and right now connections are actually leaked in production. If you connect to the production icepeak server that's reverse proxied through nginx and then just stop responding to pongs without closing the connection the connection will remain active and leak resources. I tested this by connecting with websocat, suspending the process (SIGSTOP/Ctrl+Z), waiting a couple minutes, then disconnecting from internet, hibernating my machine, resuming from hibernate, reconnecting to the internet, and then resuming the websocat process. The connection was still alive and it was like nothing happened. So I'm thinking it's probably a good idea to enable this ping-pong timeout mechanism in production, but maybe we can just set the timeout to like five or ten minutes instead of 30 seconds. Then the odds of it terminating connections unexpectedly are lower but it will still prevent resource leaks. I'd need to check if feedscrubber automatically reconnects closed websocket connections though.

EDIT: Apparently our nginx timeout time is also five minutes, I have not yet tested disconnecting from the internet for 10 minutes to make sure that works.

robbert-vdh · 2023-05-15T08:06:07Z

Thinking about this some more, if you change the pong timeout to be 15 5 minutes or more then the current behavior shouldn't change but it will still terminate the connections if the nginx TCP ping timeout mechanism somehow flukes out.

fatho

Nice doc comments! Two remarks from a cursory glance:

fatho · 2023-05-15T08:56:20Z

server/src/Icepeak/Server/WebsocketServer.hs

+    go :: Int -> IO ()
+    go i = do
+      threadDelay (pingInterval * 1000 * 1000)
+      WS.sendPing conn (T.pack $ show i)


Couldn't we now theoretically get stuck on this one (due to the lack of timeouts) and thus never reach the place where it is supposed to exit?
Concretely: if the send buffer is already full because the other side is not responding, then this would likely block (though I didn't check) - and the default kernel timeouts for TCP connections are rather long IIRC.

IIRC one of the designs I saw used an external thread with a timer that's reset when receiving pongs, and killing the handler thread when it times out, that would not be prone to this.
(Though it also depends on how sendPing is implemented)

I'd need to check if it uses a blocking send() or not, I also have no idea. A quick glance indicates that it uses this function, which sounds like it would block when the send buffer is full: https://hackage.haskell.org/package/network-3.1.2.9/docs/Network-Socket-ByteString.html

At worst it would behave the same as before, since the race_ used to spawn the action should prevent the ping thread from outliving the socket as the app action will terminate when the socket is closed, causing the other action to be cancelled. But I'll do some digging to figure out what would happen. If the send buffer indeed fills up (though with pings being only a couple bytes long, that would take forever), we could add another MVar for the last ping time and then send the pings form a third thread so the current ping handler thread only has to compare the two MVars.

I just checked, with a default Ubuntu install it can send 342855 pings before filling the send buffer and blocking. This number will decrease slightly as time goes on (because the ping message becomes larger; it sends a sequence number as a string) but not in a significant way.

This would be fine in practice, but moving the ping sending to another thread and then comparing the values form the MVars in the interruptiblePingThread thread does not have this limitation. What do you suggest, go with the third thread just in case or accept that with a 30 second ping interval a 20 year timeout window is probably sufficiently large?

I would go with less over-engineering, since we should never run into these limitations.

I just checked, with a default Ubuntu install it can send 342855 pings before filling the send buffer and blocking. This number will decrease slightly as time goes on (because the ping message becomes larger; it sends a sequence number as a string) but not in a significant way.

Yes, but the same send buffer is also used for regular payloads sent by broadcasting which are presumably much larger than ping messages, so your mileage may vary here.

Yeah, it was the issue, because we didn't decouple it enough. But I wonder if something like that could actually happen here. Maybe going safe is the way then.

I thought about this for some more and just adding a third thread may not be entirely correct since it would need to use its own delay/timer routine, which may slowly drift out of phase compared to the thread that actually sends the pings. Using an actual timer like you suggested would work, but we don't already depend on a package that can do this like timers. I checked the implementation and I don't think it's that much less expensive to restart a timer there than it would be to just slap a timeout onto the WS.sendPing, which works as expected. Any objections to just doing that instead? That prevents the ping from blocking indefinitely when the send buffer is full, and it interacts nicely with the existing timeout checking function.

I thought about this for some more and just adding a third thread may not be entirely correct since it would need to use its own delay/timer routine, which may slowly drift out of phase compared to the thread that actually sends the pings

Doesn't the current implementation also run out of phase, because the ping thread runs in intervals of "ping interval" while the pongs are expected after the "pong timeout"?

I checked the implementation and I don't think it's that much less expensive to restart a timer there than it would be to just slap a timeout onto the WS.sendPing

Less expensive in terms of what? timeout also uses a timer internally. It's certainly less complex though - but what would be the timeout for that operation then?

Alternatively, the ping thread could, upon sending a ping, start a green thread that waits for a pong to be received with a timeout.
If that timeout expired, it stops the handler thread with an async exception, otherwise, it just exits.
And if there already is a green thread waiting for a pong when a subsequent ping is sent, we just do nothing (since pings, shouldn't reset the timeout, but pongs).

Doesn't the current implementation also run out of phase, because the ping thread runs in intervals of "ping interval" while the pongs are expected after the "pong timeout"?

There's a bit of play between pingAction being evaluated and the on-pong handler being evaluated but that should follow a normal distribution and not really increase or decrease over time. The phase drifting I was talking about would happen if you used another thread for checking the IORef TimeSpecs written by the ping- and pong handler instead of using an actual resettable timer. But I think the timeout here is an even nicer solution since it doesn't need to introduce any more moving parts.

Less expensive in terms of what? timeout also uses a timer internally. It's certainly less complex though - but what would be the timeout for that operation then?

I set the timeout to the ping interval. Since if the socket blocks for that long, then the pingHandler will also consider the connection to have timed out so it works out nicely without any other changes.

Sounds like a plan!

server/src/Icepeak/Server/WebsocketServer.hs

This is different from the implementation in the websockets library but it makes more sense and other than not spawning unnecessary threads, it doesn't change the behavior.

This already happened in the old version, but it was never documented.

robbert-vdh · 2023-05-15T11:11:23Z

I copied and adapted the ping thread structure from the websockets library to avoid changing behavior too much, but as @DiegoDiverio pointed out even when the timeout detection is disabled by setting the ping interval to a negative or zero value will still spawn a (lightweight) thread, which is then immediately exited. I made a small change to avoid spawning the thread when pings are disabled, and I also documented that a ping interval of 0 or less causes Icepeak to stop sending pings.

The locking is not needed since partial updates aren't possible anyways.

This prevents a situation mentioned in #113 (comment) where the `pingAction` that checks for timeouts no longer fires when the TCP socket's send buffer is full.

robbert-vdh · 2023-05-31T14:00:32Z

I'll merge this, and then we can set the pong timeout to match the TCP keepalive timeout when deploying. Should be Fine:tm:.

@OpsBotPrime merge

OpsBotPrime · 2023-05-31T14:00:35Z

Pull request approved for merge by @robbert-vdh, rebasing now.

Approved-by: robbert-vdh Auto-deploy: false

OpsBotPrime · 2023-05-31T14:00:40Z

Rebased as b831c1b, waiting for CI …

OpsBotPrime · 2023-05-31T14:00:47Z

CI job 🟡 started.

robbert-vdh added 2 commits May 10, 2023 13:49

Add an (unused) config option for pong timeouts

32c6a6f

robbert-vdh requested a review from diegodiv May 11, 2023 12:59

diegodiv suggested changes May 12, 2023

View reviewed changes

Fix spelling

a0b7897

As mentioned in <#113 (review)>.

fatho reviewed May 15, 2023

View reviewed changes

robbert-vdh added 2 commits May 15, 2023 13:01

Don't spawn a ping thread at all when disabled

30ee990

This is different from the implementation in the websockets library but it makes more sense and other than not spawning unnecessary threads, it doesn't change the behavior.

Document that ping intervales <= 0 disable pings

d1e0915

This already happened in the old version, but it was never documented.

robbert-vdh added 2 commits May 15, 2023 13:37

Store last pong time in an IORefs instead of MVar

c467874

The locking is not needed since partial updates aren't possible anyways.

Add a timeout to the ping sending

6f541b6

This prevents a situation mentioned in #113 (comment) where the `pingAction` that checks for timeouts no longer fires when the TCP socket's send buffer is full.

Merge #113: Add WebSocket time-out handling using the ping mechanism

b831c1b

Approved-by: robbert-vdh Auto-deploy: false

OpsBotPrime merged commit b831c1b into master May 31, 2023

OpsBotPrime deleted the robbert-vdh/missing-pong-timeout branch May 31, 2023 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WebSocket time-out handling using the ping mechanism #113

Add WebSocket time-out handling using the ping mechanism #113

robbert-vdh commented May 11, 2023

diegodiv left a comment

robbert-vdh commented May 12, 2023 •

edited

Loading

robbert-vdh commented May 15, 2023 •

edited

Loading

fatho left a comment

fatho May 15, 2023

robbert-vdh May 15, 2023

robbert-vdh May 15, 2023

diegodiv May 15, 2023

fatho May 15, 2023

diegodiv May 15, 2023

robbert-vdh May 15, 2023

fatho May 15, 2023

robbert-vdh May 15, 2023

fatho May 15, 2023

robbert-vdh commented May 15, 2023

robbert-vdh commented May 31, 2023

OpsBotPrime commented May 31, 2023

OpsBotPrime commented May 31, 2023

OpsBotPrime commented May 31, 2023

Add WebSocket time-out handling using the ping mechanism #113

Add WebSocket time-out handling using the ping mechanism #113

Conversation

robbert-vdh commented May 11, 2023

diegodiv left a comment

Choose a reason for hiding this comment

robbert-vdh commented May 12, 2023 • edited Loading

robbert-vdh commented May 15, 2023 • edited Loading

fatho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robbert-vdh commented May 15, 2023

robbert-vdh commented May 31, 2023

OpsBotPrime commented May 31, 2023

OpsBotPrime commented May 31, 2023

OpsBotPrime commented May 31, 2023

robbert-vdh commented May 12, 2023 •

edited

Loading

robbert-vdh commented May 15, 2023 •

edited

Loading