Proposal: Handle websocket closure while performing initial handshake #70

yutotakano · 2021-06-12T08:37:26Z

Heya Karl, thanks as always for the simple yet amazing library. This proposal is to implement a design decision to handle websocket closures during the gateway handshake stage.

Proposal

While waiting for Opcode Hello and Opcode Dispatch "Ready", discord-haskell should handle websocket close codes properly, similar to how it is done in eventStream:
https://github.com/aquarial/discord-haskell/blob/7a33ecf756c6b404f7201e229a71bfdbd408e873/src/Discord/Internal/Gateway/EventLoop.hs#L175-L183

Existing Code

Currently, the library closes with ConnClosed on lines 74-76 (while waiting for Hello):
https://github.com/aquarial/discord-haskell/blob/7a33ecf756c6b404f7201e229a71bfdbd408e873/src/Discord/Internal/Gateway/EventLoop.hs#L74-L76
and on lines 67-69 (while waiting for Ready):
https://github.com/aquarial/discord-haskell/blob/7a33ecf756c6b404f7201e229a71bfdbd408e873/src/Discord/Internal/Gateway/EventLoop.hs#L67-L69
The library only properly filters and handle the close codes in eventStream, which is only called after Hello and Ready was received as expected.

Impact

Our bot experiences periodic disconnection from Discord (which itself is fine, since it reconnects), but once in a while, when a disconnection happens during the handshake phase, the bot goes down for good and never retries. Specifically, we get a 4000 Close Request "Unknown Error" as the Response to Identify. This is not a fatal error, and should reconnect (docs say 4000 is worth reconnecting). This proposal would fix it.

A problem I might see is that the bot would keep rapidly retrying indefinitely if there is no internet connection (i think), resulting in perhaps high computation power, so a delay may be necessary.

Other libraries

discord.py
- Uses a single poll_event function everywhere to get everything from Hello, Heartbeat, Dispatch, etc. This function contains a try/catch, in which it handles close codes even during the Hello/Ready exchange, and reconnects as necessary: https://github.com/Rapptz/discord.py/blob/3b555737778f718a4106fdc87a47e0a097bab37c/discord/gateway.py#L571-L576
discord.js
- Upon opening a websocket, it immediately registers a .on("CLOSE") callback handler, that does the checking against a list of UNRESUMABLE_CLOSE_CODES, and resumes as necessary (which means even during Hello/Ready).
serenity (rust)
- Similar to python, uses a single handle_event function everywhere to get everything from Hello, Heartbeat, Dispatch, etc. If the websocket has closed, it calls handle_close_event which logs and checks as necessary, then reconnects (L474). This means it also properly reconnects upon an error during handshake.
discordgo
- Does not handle close codes during Hello/Ready/Identify (it returns err, which is caught in the defer func, and then just exits without reconnecting). It treats any and ALL connection errors there as a fatal error (similar to the current discord-haskell).

Overall, all major libraries reconnect as appropriate. The golang library was interesting, but either they may not have had reports on this (since they are not as major as py/js), or they made a conscious design decision not to reconnect.

The text was updated successfully, but these errors were encountered:

aquarial · 2021-06-13T06:03:38Z

Great idea, I appreciate the prior-art references too. Always detailed!

Also I like that you're using this library to run a bot. I started this library in 2018 trying to write a discord chat exporter, but discord.hs library was defunct. Reading through that code, it's filled with monad instances and complicated type stuff.

Very good proposal, having a busy day so I'll work on this tomorrow I think. Maybe extract a handleCloseRequest func.

Specifically, we get a 4000 Close Request "Unknown Error" as the Response to Identify. This is not a fatal error, and should reconnect (docs say 4000 is worth reconnecting)

the bot would keep rapidly retrying indefinitely if there is no internet connection, so a delay may be necessary

Nice narrow issue space. Retrying indefinably with a delay is a natural idea. I think the library shouldn't give up.

aquarial · 2021-06-14T20:59:43Z

Error codes like 4000, 1000, etc are specifically sent by the server. If the wifi cuts then the eventloop notices there's no heartbeat and tries to reconnect.

Reading the discord socket error codes, most of these handle a library error, or tell to reconnect. Seems safe to reconnect on any gateway close event code.

I pushed a commit that adds reconnecting. I'm going to test it out as well before I push a new hackage version.

extra-deps:
- github: aquarial/discord-haskell
  commit: 83cbc89e006ad91648a5f6f6d04c73ea758d028b
- emoji-0.1.0.2

yutotakano · 2021-06-18T16:34:39Z

~~Hmmm, I don't think this necessarily addresses the issue. I'll elaborate and edit this message in a few hours.~~

When the Discord docs say "Reconnect and start a new session.", I think they mean "close the websocket and start anew", which is different from "try resuming again". So, when a CloseRequest is given within ConnReconnect (L106), my theory is that you have to do a ConnStart.

https://github.com/aquarial/discord-haskell/blob/83cbc89e006ad91648a5f6f6d04c73ea758d028b/src/Discord/Internal/Gateway/EventLoop.hs#L89-L112

yutotakano · 2021-06-22T23:52:36Z

We also got this out of nowhere today (after days of running on your new commit). I'm not too sure what caused it to run the loop with ConnStart (which is the only place this error is printed) and not a ConnReconnect.

aquarial · 2021-06-23T16:14:53Z

Hm. Looks like Eventloop.hs needs a remold. Handle all the events in one large event catcher instead of separating out the eventloop. Handle the stored status edge case as well.

aquarial · 2021-11-14T05:59:01Z

I was afk in a way for a while, wanted to clean this up.

When the Discord docs say "Reconnect and start a new session.", I think they mean "close the websocket and start anew", which is different from "try resuming again". So, when a CloseRequest is given within ConnReconnect (L106), my theory is that you have to do a ConnStart

Oh right. Depends on the error code though. I've fixed up EventLoop.hs so all the CloseRequest handling is in one place now.

We got this out of nowhere today. Response to Identify must be Ready

The reconnect must have failed in a way that the library decieded to do a clean loop restart, which then failed. I've updated the EventLoop so the library will fully close the connection before restarting. Perhaps that would help?

rewriting loop

pushed with 1.9.1. been a while

yutotakano · 2021-12-15T02:01:47Z

Closing this as it seems to have been fixed, both experimentally and in code. Thank you very much!

aquarial mentioned this issue Aug 18, 2021

GuildMemberAdd and GuildMemberRemove is not working #76

Closed

yutotakano closed this as completed Dec 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Handle websocket closure while performing initial handshake #70

Proposal: Handle websocket closure while performing initial handshake #70

yutotakano commented Jun 12, 2021 •

edited

Loading

aquarial commented Jun 13, 2021

aquarial commented Jun 14, 2021

yutotakano commented Jun 18, 2021 •

edited

Loading

yutotakano commented Jun 22, 2021

aquarial commented Jun 23, 2021

aquarial commented Nov 14, 2021

yutotakano commented Dec 15, 2021

Proposal: Handle websocket closure while performing initial handshake #70

Proposal: Handle websocket closure while performing initial handshake #70

Comments

yutotakano commented Jun 12, 2021 • edited Loading

Proposal

Existing Code

Impact

Other libraries

aquarial commented Jun 13, 2021

aquarial commented Jun 14, 2021

yutotakano commented Jun 18, 2021 • edited Loading

yutotakano commented Jun 22, 2021

aquarial commented Jun 23, 2021

aquarial commented Nov 14, 2021

yutotakano commented Dec 15, 2021

yutotakano commented Jun 12, 2021 •

edited

Loading

yutotakano commented Jun 18, 2021 •

edited

Loading