Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Websocket gets stuck and doesn't resume #10174

Closed
HRK44 opened this issue Mar 13, 2024 · 10 comments
Closed

Websocket gets stuck and doesn't resume #10174

HRK44 opened this issue Mar 13, 2024 · 10 comments

Comments

@HRK44
Copy link

HRK44 commented Mar 13, 2024

Which package is this bug report for?

discord.js

Issue description

I have a function that runs continuously in a loop that checks for server members and roles.
Sometimes the shard is being destroyed but I can see in the logs it gets resumed, and the events are being replayed (function resumes as if nothing happened):

[83307] 2024-03-13T19:50:47.423Z debug: [WS => Shard 0] The gateway closed with an unexpected code 1006, attempting to resume.
[83307] 2024-03-13T19:50:47.424Z debug: [WS => Shard 0] Destroying shard
        Reason: none
        Code: 1006
        Recover: Resume
[83307] 2024-03-13T19:50:47.424Z debug: [WS => Shard 0] Connection status during destroy
        Needs closing: false
        Ready state: 3
[83307] 2024-03-13T19:50:47.931Z debug: [WS => Shard 0] Connecting to wss://gateway-us-east1-b.discord.gg?v=10&encoding=json
[83307] 2024-03-13T19:50:47.934Z debug: [WS => Shard 0] Waiting for event hello for 60000ms
[83307] 2024-03-13T19:50:48.112Z debug: [WS => Shard 0] Preparing first heartbeat of the connection with a jitter of 0.3208934412049065; waiting 13236ms
[83307] 2024-03-13T19:50:48.114Z debug: [WS => Shard 0] Resuming session
        resume url: wss://gateway-us-east1-b.discord.gg
        sequence: 182254
        shard id: 0
[83307] 2024-03-13T19:50:49.999Z debug: [WS => Shard 0] Resumed and replayed 73 events

But some other times, my function call doesn't get resumed and all I see in the logs are the heartbeats until I kill the process and restart it manually:

[74589] 2024-03-13T19:38:39.823Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 81ms.
[74589] 2024-03-13T19:39:21.068Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 75ms.
[74589] 2024-03-13T19:40:02.318Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 75ms.
[74589] 2024-03-13T19:40:43.578Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 84ms.

I'm not using any kind of custom retry mechanism (all handled by discordjs), open to any idea on why that could happen or what I can do/add to give better details, thanks.

Code sample

No response

Versions

  • discord.js: 14.14.1
  • typescript: 5.3.3
  • node: 20.10.0

Issue priority

Medium (should be fixed soon)

Which partials do you have configured?

Not applicable

Which gateway intents are you subscribing to?

GuildMembers

I have tested this issue on a development release

No response

@didinele
Copy link
Member

There's not enough relevant information here. What do you mean your function call isn't resumed?

@HRK44
Copy link
Author

HRK44 commented Mar 13, 2024

There's not enough relevant information here. What do you mean your function call isn't resumed?

I'll try to give more context:

Let's say my bot is in 5 guilds, and it's checking for some members/roles, here's some pseudo-code:

async checkUserRole() {
    const guilds = getGuilds();
    
    for (const g of guilds) {
          const members = await g.getMembers();
          //...check if member has a certain role
    }
    
    //logging here once it's done
    console.log('finished parsing 5 guilds');
}

while (true) {
    await checkUserRole();
    sleep(30s);
}

So my code will keep running this (while true), and I correctly see the logs, with sometimes some heartbeats in between (which is normal):

finished parsing 5 guilds (wait 30s)
finished parsing 5 guilds (wait 30s)
[74589] 2024-03-13T19:39:21.068Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 75ms.
finished parsing 5 guilds (wait 30s)
finished parsing 5 guilds (wait 30s)
...

Then sometimes, the shard gets disconnected, which I guess is fine, could be some network issues, then I see this:

finished parsing 5 guilds (wait 30s)
finished parsing 5 guilds (wait 30s)
finished parsing 5 guilds (wait 30s)
finished parsing 5 guilds (wait 30s)
[83307] 2024-03-13T19:50:47.423Z debug: [WS => Shard 0] The gateway closed with an unexpected code 1006, attempting to resume.
[83307] 2024-03-13T19:50:47.424Z debug: [WS => Shard 0] Destroying shard
        Reason: none
        Code: 1006
        Recover: Resume
[83307] 2024-03-13T19:50:47.424Z debug: [WS => Shard 0] Connection status during destroy
        Needs closing: false
        Ready state: 3
[83307] 2024-03-13T19:50:47.931Z debug: [WS => Shard 0] Connecting to wss://gateway-us-east1-b.discord.gg?v=10&encoding=json
[83307] 2024-03-13T19:50:47.934Z debug: [WS => Shard 0] Waiting for event hello for 60000ms
[83307] 2024-03-13T19:50:48.112Z debug: [WS => Shard 0] Preparing first heartbeat of the connection with a jitter of 0.3208934412049065; waiting 13236ms
[83307] 2024-03-13T19:50:48.114Z debug: [WS => Shard 0] Resuming session
        resume url: wss://gateway-us-east1-b.discord.gg
        sequence: 182254
        shard id: 0
[83307] 2024-03-13T19:50:49.999Z debug: [WS => Shard 0] Resumed and replayed 73 events
finished parsing 5 guilds (wait 30s)
finished parsing 5 guilds (wait 30s)
finished parsing 5 guilds (wait 30s)
finished parsing 5 guilds (wait 30s)

So it resumes the calls after reconnecting the shard.

The issue is that sometimes, it somehow gets stuck and I only see the heartbeats:

finished parsing 5 guilds (wait 30s)
finished parsing 5 guilds (wait 30s)
finished parsing 5 guilds (wait 30s)
finished parsing 5 guilds (wait 30s)
[74589] 2024-03-13T19:39:21.068Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 75ms.
[74589] 2024-03-13T19:39:21.068Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 75ms.
[74589] 2024-03-13T19:39:21.068Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 75ms.
[74589] 2024-03-13T19:39:21.068Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 75ms.
[74589] 2024-03-13T19:39:21.068Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 75ms.
[74589] 2024-03-13T19:39:21.068Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 75ms.
//indefinitely until I kill/restart process...

@HRK44
Copy link
Author

HRK44 commented Mar 13, 2024

Just happened again:

[83307] 2024-03-13T20:36:23.971Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 79ms.
[83307] 2024-03-13T20:37:05.215Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 73ms.
[83307] 2024-03-13T20:37:46.470Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 77ms.
[83307] 2024-03-13T20:38:27.721Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 78ms.
[83307] 2024-03-13T20:39:08.968Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 74ms.
[83307] 2024-03-13T20:39:50.221Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 76ms.
[83307] 2024-03-13T20:40:31.472Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 73ms.
[83307] 2024-03-13T20:41:12.727Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 77ms.
[83307] 2024-03-13T20:41:53.983Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 81ms.
[83307] 2024-03-13T20:42:35.224Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 73ms.
[83307] 2024-03-13T20:43:16.493Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 91ms.
[83307] 2024-03-13T20:43:57.731Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 79ms.
[83307] 2024-03-13T20:44:38.981Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 78ms.
[83307] 2024-03-13T20:45:20.230Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 76ms.
[83307] 2024-03-13T20:46:01.480Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 76ms.
[83307] 2024-03-13T20:46:42.731Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 76ms.
[83307] 2024-03-13T20:47:23.984Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 78ms.
[83307] 2024-03-13T20:47:55.165Z debug: [WS => Shard 0] Destroying shard
        Reason: Told to reconnect by Discord
        Code: 4200
        Recover: Resume
[83307] 2024-03-13T20:47:55.165Z debug: [WS => Shard 0] Connection status during destroy
        Needs closing: true
        Ready state: 1
[83307] 2024-03-13T20:47:55.748Z debug: [WS => Shard 0] Connecting to wss://gateway-us-east1-b.discord.gg?v=10&encoding=json
[83307] 2024-03-13T20:47:55.752Z debug: [WS => Shard 0] Waiting for event hello for 60000ms
[83307] 2024-03-13T20:47:55.971Z debug: [WS => Shard 0] Preparing first heartbeat of the connection with a jitter of 0.3551804231492781; waiting 14651ms
[83307] 2024-03-13T20:47:55.973Z debug: [WS => Shard 0] Resuming session
        resume url: wss://gateway-us-east1-b.discord.gg
        sequence: 237673
        shard id: 0
[83307] 2024-03-13T20:47:56.185Z debug: [WS => Shard 0] Resumed and replayed 43 events
[83307] 2024-03-13T20:48:10.622Z debug: [WS => Shard 0] First heartbeat sent, starting to beat every 41250ms
[83307] 2024-03-13T20:48:10.700Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 78ms.
[83307] 2024-03-13T20:48:51.947Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 74ms.
[83307] 2024-03-13T20:49:33.198Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 75ms.
[83307] 2024-03-13T20:50:14.455Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 82ms.
[83307] 2024-03-13T20:50:55.702Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 79ms.
[83307] 2024-03-13T20:51:36.961Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 87ms.
[83307] 2024-03-13T20:52:18.203Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 77ms.
[83307] 2024-03-13T20:52:59.455Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 77ms.
[83307] 2024-03-13T20:53:40.707Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 80ms.
[83307] 2024-03-13T20:54:21.957Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 77ms.
[83307] 2024-03-13T20:55:03.213Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 83ms.
[83307] 2024-03-13T20:55:44.467Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 79ms.
[83307] 2024-03-13T20:56:25.722Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 83ms.
[83307] 2024-03-13T20:57:06.974Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 84ms.
[83307] 2024-03-13T20:57:48.213Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 71ms.
[83307] 2024-03-13T20:58:29.481Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 89ms.
[83307] 2024-03-13T20:59:10.716Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 73ms.
[83307] 2024-03-13T20:59:51.971Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 76ms.
[83307] 2024-03-13T21:00:33.227Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 83ms.
[83307] 2024-03-13T21:01:14.475Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 80ms.
[83307] 2024-03-13T21:01:55.717Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 72ms.
[83307] 2024-03-13T21:02:36.972Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 76ms.
[83307] 2024-03-13T21:03:18.228Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 81ms.
[83307] 2024-03-13T21:03:59.476Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 80ms.
[83307] 2024-03-13T21:04:40.727Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 78ms.
[83307] 2024-03-13T21:05:21.978Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 80ms.
[83307] 2024-03-13T21:06:03.230Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 81ms.
[83307] 2024-03-13T21:06:44.481Z debug: [WS => Shard 0] Heartbeat acknowledged, latency of 81ms.

My pod doesn't "crash" so it's hard to detect and restart it because it stays in this "zombie" mode, seems to be connected but doesn't run my function.

@monbrey
Copy link
Member

monbrey commented Mar 13, 2024

Pseudocode is a little hard to call a minimum reproducible sample.

I assume getMembers() is going to correspond to guild.members.fetch(). This call gets all the members of the guild via the websocket - it sends a socket payload and then waits to receive sufficient GUILD_MEMBERS_CHUNK events back, until the cache is full and it's size equals the guild.memberCount.

If I had to make a guess based on the logs you've provided, Discord is kicking your shard off and telling it to resume it's session in the middle of waiting for these chunks - so they probably don't all return. It might hang waiting for the rest of those chunks.

On that though - I don't understand why you're doing this every 30 seconds - at least the fetching part. Once populated, GuildMemberUpdate events will keep the cache up to date.

@HRK44
Copy link
Author

HRK44 commented Mar 13, 2024

Pseudocode is a little hard to call a minimum reproducible sample.

I assume getMembers() is going to correspond to guild.members.fetch(). This call gets all the members of the guild via the websocket - it sends a socket payload and then waits to receive sufficient GUILD_MEMBERS_CHUNK events back, until the cache is full and it's size equals the guild.memberCount.

If I had to make a guess based on the logs you've provided, Discord is kicking your shard off and telling it to resume it's session in the middle of waiting for these chunks - so they probably don't all return. It might hang waiting for the rest of those chunks.

On that though - I don't understand why you're doing this every 30 seconds - at least the fetching part. Once populated, GuildMemberUpdate events will keep the cache up to date.

Correct, here's the relevant part of my code:

        const guild = await this.client.guilds.fetch(guildId);
        await guild.members.fetch();

        for (const data of dataarray) {
             const discRole = await guild.roles.fetch(roleId);
             //do custom logic, add/remove roles to users
        ....

I'm running it with a 30s pause because I make a lot of calls and don't want to spam too much (this is run by multiple pods on thousands of guilds)

@HRK44
Copy link
Author

HRK44 commented Mar 15, 2024

Based on your comments about the guild member chunk, if this resolves to false for some reason

if (members.size < 1_000 || (limit && fetchedMembers.size >= limit) || i === chunk.count) {
could this get stuck there since it won't resolve/clear listeners? I've also noticed OOM after multiple days/hours (depending on how many guilds I pull, average 100) on my pods but that's less of an issue because I can restart them.

@monbrey
Copy link
Member

monbrey commented Mar 15, 2024

I think you've missed my point though - there is no reason to re-fetch entire guilds of members every 30 seconds. None at all.

After populating the guild member caches once and once only, events received by subscribing to the GuildMembers intent will ensure the cache is kept up to date.

You're welcome to run the rest of your functionality every 30 seconds, I'm not commenting on that.

@HRK44
Copy link
Author

HRK44 commented Mar 15, 2024

I understand I'll optimize the code to pull from cache instead of refetching everytime (I guess using membercount === member.length), but I think there's still some underlying issue with the websocket getting stuck in that heartbeat only mode.

@Qjuh
Copy link
Contributor

Qjuh commented Mar 15, 2024

Provide a reproducible sample that causes that to happen without your spam of unnecessary member fetches and we‘ll happily take a look.

@HRK44
Copy link
Author

HRK44 commented Apr 18, 2024

I identified the issue, it was from my own code logic (slow queries clogging the process). Closing the ticket, thanks for help!

@HRK44 HRK44 closed this as completed Apr 18, 2024
@Jiralite Jiralite closed this as not planned Won't fix, can't repro, duplicate, stale Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants