Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Handle websocket closure while performing initial handshake #70

Closed
yutotakano opened this issue Jun 12, 2021 · 7 comments
Closed

Comments

@yutotakano
Copy link
Member

yutotakano commented Jun 12, 2021

Heya Karl, thanks as always for the simple yet amazing library. This proposal is to implement a design decision to handle websocket closures during the gateway handshake stage.

Proposal

While waiting for Opcode Hello and Opcode Dispatch "Ready", discord-haskell should handle websocket close codes properly, similar to how it is done in eventStream:
https://github.com/aquarial/discord-haskell/blob/7a33ecf756c6b404f7201e229a71bfdbd408e873/src/Discord/Internal/Gateway/EventLoop.hs#L175-L183

Existing Code

Currently, the library closes with ConnClosed on lines 74-76 (while waiting for Hello):
https://github.com/aquarial/discord-haskell/blob/7a33ecf756c6b404f7201e229a71bfdbd408e873/src/Discord/Internal/Gateway/EventLoop.hs#L74-L76
and on lines 67-69 (while waiting for Ready):
https://github.com/aquarial/discord-haskell/blob/7a33ecf756c6b404f7201e229a71bfdbd408e873/src/Discord/Internal/Gateway/EventLoop.hs#L67-L69
The library only properly filters and handle the close codes in eventStream, which is only called after Hello and Ready was received as expected.

Impact

Our bot experiences periodic disconnection from Discord (which itself is fine, since it reconnects), but once in a while, when a disconnection happens during the handshake phase, the bot goes down for good and never retries. Specifically, we get a 4000 Close Request "Unknown Error" as the Response to Identify. This is not a fatal error, and should reconnect (docs say 4000 is worth reconnecting). This proposal would fix it.

A problem I might see is that the bot would keep rapidly retrying indefinitely if there is no internet connection (i think), resulting in perhaps high computation power, so a delay may be necessary.

Other libraries

  • discord.py
  • discord.js
    • Upon opening a websocket, it immediately registers a .on("CLOSE") callback handler, that does the checking against a list of UNRESUMABLE_CLOSE_CODES, and resumes as necessary (which means even during Hello/Ready).
  • serenity (rust)
    • Similar to python, uses a single handle_event function everywhere to get everything from Hello, Heartbeat, Dispatch, etc. If the websocket has closed, it calls handle_close_event which logs and checks as necessary, then reconnects (L474). This means it also properly reconnects upon an error during handshake.
  • discordgo
    • Does not handle close codes during Hello/Ready/Identify (it returns err, which is caught in the defer func, and then just exits without reconnecting). It treats any and ALL connection errors there as a fatal error (similar to the current discord-haskell).

Overall, all major libraries reconnect as appropriate. The golang library was interesting, but either they may not have had reports on this (since they are not as major as py/js), or they made a conscious design decision not to reconnect.

@aquarial
Copy link
Collaborator

Great idea, I appreciate the prior-art references too. Always detailed!

Also I like that you're using this library to run a bot. I started this library in 2018 trying to write a discord chat exporter, but discord.hs library was defunct. Reading through that code, it's filled with monad instances and complicated type stuff.

Very good proposal, having a busy day so I'll work on this tomorrow I think. Maybe extract a handleCloseRequest func.

Specifically, we get a 4000 Close Request "Unknown Error" as the Response to Identify. This is not a fatal error, and should reconnect (docs say 4000 is worth reconnecting)

the bot would keep rapidly retrying indefinitely if there is no internet connection, so a delay may be necessary

Nice narrow issue space. Retrying indefinably with a delay is a natural idea. I think the library shouldn't give up.

@aquarial
Copy link
Collaborator

Error codes like 4000, 1000, etc are specifically sent by the server. If the wifi cuts then the eventloop notices there's no heartbeat and tries to reconnect.

Reading the discord socket error codes, most of these handle a library error, or tell to reconnect. Seems safe to reconnect on any gateway close event code.

I pushed a commit that adds reconnecting. I'm going to test it out as well before I push a new hackage version.

extra-deps:
- github: aquarial/discord-haskell
  commit: 83cbc89e006ad91648a5f6f6d04c73ea758d028b
- emoji-0.1.0.2

@yutotakano
Copy link
Member Author

yutotakano commented Jun 18, 2021

Hmmm, I don't think this necessarily addresses the issue. I'll elaborate and edit this message in a few hours.

When the Discord docs say "Reconnect and start a new session.", I think they mean "close the websocket and start anew", which is different from "try resuming again". So, when a CloseRequest is given within ConnReconnect (L106), my theory is that you have to do a ConnStart.

https://github.com/aquarial/discord-haskell/blob/83cbc89e006ad91648a5f6f6d04c73ea758d028b/src/Discord/Internal/Gateway/EventLoop.hs#L89-L112

@yutotakano
Copy link
Member Author

We also got this out of nowhere today (after days of running on your new commit). I'm not too sure what caused it to run the loop with ConnStart (which is the only place this error is printed) and not a ConnReconnect.

image

@aquarial
Copy link
Collaborator

Hm. Looks like Eventloop.hs needs a remold. Handle all the events in one large event catcher instead of separating out the eventloop. Handle the stored status edge case as well.

@aquarial
Copy link
Collaborator

I was afk in a way for a while, wanted to clean this up.

When the Discord docs say "Reconnect and start a new session.", I think they mean "close the websocket and start anew", which is different from "try resuming again". So, when a CloseRequest is given within ConnReconnect (L106), my theory is that you have to do a ConnStart

Oh right. Depends on the error code though. I've fixed up EventLoop.hs so all the CloseRequest handling is in one place now.

We got this out of nowhere today. Response to Identify must be Ready

The reconnect must have failed in a way that the library decieded to do a clean loop restart, which then failed. I've updated the EventLoop so the library will fully close the connection before restarting. Perhaps that would help?

rewriting loop

pushed with 1.9.1. been a while

@yutotakano
Copy link
Member Author

Closing this as it seems to have been fixed, both experimentally and in code. Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants