Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heartbeat dropping crashes the bot until manually restarted #430

Closed
dxf opened this issue Jan 13, 2022 · 12 comments
Closed

Heartbeat dropping crashes the bot until manually restarted #430

dxf opened this issue Jan 13, 2022 · 12 comments
Labels
bug Something isn't working

Comments

@dxf
Copy link

dxf commented Jan 13, 2022

This has been discussed in the Discord server, but I thought I'd file an issue for the sake of completeness.

When a bot can't get a heartbeat, there's no connection logic to reconnect the bot afterwards. This means that bots, even on stable connections (I'm using a Contabo VPS), can have trouble staying online for any extended period of time.

For now, as a bodge, I'm testing using GNU timeout to restart my bot by force as a pre-emptive measure - if anyone wants the shell script, let me know

@FayeDel FayeDel added the bug Something isn't working label Jan 15, 2022
@CBradell
Copy link

I am experiencing the same issue. My bot will stop working as soon as it can't get a heartbeat and is needing to be restarted about once an hour right now. :(

@CBradell
Copy link

Also, I am using an Azure VM; should be a pretty stable connection.

@i0bs
Copy link
Contributor

i0bs commented Jan 19, 2022

I believe I know the cause for this issue and it's a simple fix, however, we're planning on refactoring the gateway which would eliminate this problem anyhow. (Or at least in theory, it would.) There's a current experiment you can try running to see if this produces the issue or not:

  • Go to gateway.py
  • Find the Heartbeat class
  • Remove the - random() arithmetic math.

What I believe is happening is that for every HEARTBEAT packet the Gateway recognises (when the websocket connection remains to be valid) we run the threading Event's main protocol all over again, and it keeps on iterating the random() call as a completely new value. Now, this call produces a float anywhere between 0 and 1 and it's theoretically very small of a difference. However, a heartbeat is typically sent once every 43-45 seconds. If you do the math, and let's assume the average of random() is 0.10, it would only take ~45 minutes before the heartbeat is completely out of alignment. This is because we have a wait() call happening on the thread, and then it adds an additional micro value. Because of this there's accumulation over time, and then eventually leads to a zombified connection.

This is honestly an oversight on my part, and we can probably do away entirely with the subtraction of the random value. We did this, however, because it worked well for most cases that needed us to check for jitter. The Gateway does handle network latency between you (the client) and the server, but we didn't know just how precise it was. The Developers documentation also mentioned that it was okay for us to do this, so we bit the bullet and did it anyhow.

@CBradell
Copy link

There's a current experiment you can try running to see if this produces the issue or not:

  • Go to gateway.py
  • Find the Heartbeat class
  • Remove the - random() arithmetic math.

Do you mean this will introduce the error? Or that it should fix it? I am available to try this and let you know.

@CBradell
Copy link

Just FYI, I removed the - random() arithmetic math from the Heartbeat class and still am having the same issue.

@i0bs
Copy link
Contributor

i0bs commented Jan 19, 2022

Does the bot crash more, less or the same amount of frequency?

@CBradell
Copy link

I left my bot running overnight and it wasn't working when I woke up. I'll restart it, monitor, and let you know. I can record some average times to crashing for you.

@Chrisae9
Copy link

Chrisae9 commented Jan 19, 2022

Same issue after the fix:
01/18/2022 07:46:37 PM INFO: [] Starting bot.
01/18/2022 09:42:19 PM ERROR: [run] The client was unable to send a heartbeat, closing the connection.

01/19/2022 12:35:26 AM INFO: [] Starting bot.
01/19/2022 02:45:32 AM ERROR: [run] The client was unable to send a heartbeat, closing the connection.

Seems to be crashing every 2 hours.

@CBradell
Copy link

I also timed my first crash at ~2 hours just now.

@i0bs
Copy link
Contributor

i0bs commented Jan 20, 2022

I've identified the cause of the bug during my testing and will be addressed for next release.

@CBradell
Copy link

CBradell commented Jan 20, 2022

Is this a quick fix still; something I can go ahead and change to get the bot working? Or will we need to wait for release?

Thanks for checking into this and identifying the issue!

@i0bs
Copy link
Contributor

i0bs commented Jan 25, 2022

We've initiated a PR that will address this. See issue #452 that goes into the specific details of why this is happening. Because another issue has been set up for this, we will be closing this one. I highly recommend to take any conversations from here to there.

@i0bs i0bs closed this as completed Jan 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants