Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarifying how sharding works #6853

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

GitMeep
Copy link
Contributor

@GitMeep GitMeep commented May 10, 2024

I and at least one other user in this Forum Channel thread on the Discord Developers server were confused about how sharding worked with different num_shards for different sessions. I have rewritten some of the sharding explanation to hopefully be a bit more clear about how it actually works, based on discussion with @Zoddo in the aforementioned thread. Please correct anything I may have gotten wrong.

Here's a screenshot of the original discussion with a transcript:

discord com_channels_613425648685547541_1208012213491736576 (1)

Jovan OP — 02/16/2024 12:29 PM
It says that you can start multiple shards with the same shard_id or start shards with different num_shards.
Well which num_shards is Discord using in the formula?
What happens when you have multiple shards with the same shard_id, how is that handled?

https://discord.com/developers/docs/topics/gateway#sharding-sharding-formula

---

meep — 05/10/2024 1:50 PM
I am wondering about this too. On the whole, I am having a hard time understanding this entire paragraph in the documentation:

Note that num_shards does not relate to (or limit) the total number of potential sessions. It is only used for routing traffic. As such, sessions do not have to be identified in an evenly-distributed manner when sharding. You can establish multiple sessions with the same [shard_id, num_shards], or sessions with different num_shards values. This allows you to create sessions that will handle more or less traffic for more fine-tuned load balancing, or to orchestrate "zero-downtime" scaling/updating by handing off traffic to a new deployment of sessions with a higher or lower num_shards count that are prepared in parallel.

Let's say 3 Gateway sessions exist that have identified with shard values of [0,3], [1,3] and [2,3], repspectively. Then, 5 new Gateway sessions are started with [0,5], [1,5], [2,5], [3,5] and [4,5].
When an event happens, how is it then decided which session it is sent to?

Let's say the right side of the formula below is simply evaluated to decide which shard_id to send the event to.
shard_id = (guild_id >> 22) % num_shards

We then have the problem of deciding which num_shards to use. Are all unique values from the currently active sessions used, resulting in potentially multiple different shard_id's, or is the maximum value used, or perhaps the most recent? Also, would the event be sent to every session with a shard_id matching the calculated value(s) or just the most recent one?

Perhaps the strategy is different, and the sharding formula should be treated as a comparison, in programming terms:
shard_id == (guild_id >> 22) % num_shards

This would then be run for every active session to see whether the pair of [shard_id, num_shards] makes the comparison true. We then still have the same problem of deciding whether to send the event to all session that match or just the most recent session.

Those are just the possibilities I see. It would be great if someone could clarify exactly how it is decided which session or sessions that an event is sent to. :advaith_anim: nudge nudge :advaith_anim: 

Thank you

---

Zoddo — 05/10/2024 2:51 PM
Let's say 3 Gateway sessions exist that have identified with shard values of [0,3], [1,3] and [2,3], repspectively. Then, 5 new Gateway sessions are started with [0,5], [1,5], [2,5], [3,5] and [4,5].
When an event happens, how is it then decided which session it is sent to?
events are sent to all matching shards.
So in your case, events from this server (613425648685547541) will be sent to both [2,3] and [4,5}
We then have the problem of deciding which num_shards to use. Are all unique values from the currently active sessions used, resulting in potentially multiple different shard_id's, or is the maximum value used, or perhaps the most recent? Also, would the event be sent to every session with a shard_id matching the calculated value(s) or just the most recent one?
Actually, that's not an issue because it doesn't work that way. It's not "well, I have an event to send to this bot... to which shard I should send it?".
Instead, when you IDENTIFY over the gateway, it gets the list of guilds your bot is in and that match the provided [shard_id, num_shards]. Then your gateway session will subscribe to these guilds' events.
So, when an event happens in a guild, the event just fans out to all subscribed sessions... 

Zoddo — 05/10/2024 2:58 PM
It's an implementation detail, but that's explain why you can mix multiple shards with different num_shards, and why you can also have multiple shards all receiving events for a single guild

---

meep — 05/10/2024 3:14 PM
Okay, that makes sense. Thank you for explaining.

I'll just sum it up to check my understanding:
When a new Gateway session is started by sending an Identify event, the backend goes through every server that the bot is in and subscribes the session to events from those where the comparison shard_id == (guild_id >> 22) % num_shards is true using the corresponding guild_id and the provided shard_id and num_shards. Presumably, this also happens when the bot joins a new server.
Then, whenever an event happens in a server, the backend checks which sessions are subscribed to it (and have the corresponding intent) and then sends the event to every session that matches. It is then the bot developers responsibility to handle multiple shards potentially receiving the same event.
If I understand correctly, this also means that it is possible to start just a single shard with [0,2], thus causing the bot to be offline in (approximately) half of all servers. 

---

Zoddo — 05/10/2024 3:16 PM
yep, you got it 🙂
If I understand correctly, this also means that it is possible to start just a single shard with [0,2], thus causing the bot to be offline in (approximately) half of all servers.
Yeah, you can observe that when large bots (that have thousands of shards) are recovering from outages. The bot will start to appear online in some servers, but will still be offline in others.
btw, this tend to cause confusion among users, when they don't understand why the bot is offline in their server, but online on the bot's support server, for example.

---

meep — 05/10/2024 3:20 PM
Yeah I can imagine 😅
I think I'll submit a pull request to the api docs to clarify the sharding section on how events are sent to shards

---

Zoddo — 05/10/2024 3:21 PM
yeah, I think it can be improved 🙂

---

meep — 05/10/2024 3:33 PM
I suspect that all shards with shard_id = 0 will be subscribed to DM's as well?

---

Zoddo — 05/10/2024 3:34 PM
yeah
they will receive all events not related to a guild (except USER_UPDATE which is always dispatched to all shards) + ephemeral messages events

---

meep — 05/10/2024 3:38 PM
I see, thanks!

@GitMeep GitMeep changed the title Clarified how sharding works and removed some dangling spaces Clarifying how sharding works May 10, 2024
docs/topics/Gateway.md Outdated Show resolved Hide resolved
docs/topics/Gateway.md Outdated Show resolved Hide resolved
docs/topics/Gateway.md Outdated Show resolved Hide resolved
Thank you for looking it through and fixing my mistakes :)

Co-authored-by: Zoddo <github@zoddo.fr>
docs/topics/Gateway.md Outdated Show resolved Hide resolved
docs/topics/Gateway.md Outdated Show resolved Hide resolved
Word choice and formula disambiguation

Co-authored-by: Oliver Wilkes <oliverwilkes2006@icloud.com>
Co-authored-by: Zoddo <github@zoddo.fr>

###### Sharding Formula

```python
shard_id = (guild_id >> 22) % num_shards
(guild_id >> 22) % num_shards == shard_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure why this change is necessary - both say the same thing, and I think the prior version says it a bit more clearly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean whether shard_id is first or last, or whether it's an assignment = or comparison ==?

In the first case, it is to less ambiguously show that it is, in fact, intended to be a comparison and not an assignment.

The whole point of this PR is to rewrite the documentation in terms of which of the guilds that a bot is in to which a gateway "session" with a certain shard_array will be subscribed to events, as opposed to deciding which "shard_id a certain event will be sent to" as this is very confusing in the context of multiple "sessions" with the same shard_id and different num_shards. See the original discussion that led to this PR and Zoddos reply in Lulalaby's review above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that a = b + c is much more clear than b + c == a

Copy link
Contributor Author

@GitMeep GitMeep May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR, the formula is written in the context of - when a shard is connecting to the gateway - iterating through all guilds that a bot is part of and deciding whether the shard should receive events from that guild based on the guilds guild_id and the shard array with shard_id and num_shards that the shard provided.
Sure, if you see it as a mathematical equation, a single = would suffice, but this is documentation that programmers will be reading to understand how sharding behaves, and when programmers see an expression like that with things that are very clearly variable names they would assume that = is an assignment, which this formula is not.
I agree that writing the single variable on the left-hand side looks cleaner, but as discussed in the thread I mentioned, that would create some ambiguity of whether it was actually meant as a comparison or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this not an assignment? The original wording is much clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I appreciate the discussion, I would still prefer the original form to remain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can keep it as an assignment. It still has to be in the context of a single shard though, as each session has its own (potentially different) value of num_shards. Then we can just write with words afterwards that "The session will receive events from every guild that, using the formula above, evaluates to the sessions shard_id using the corresponding num_shards".
I don't see how adding this level of indirection makes it any clearer though?

@@ -561,27 +561,33 @@ When connecting to the gateway as a bot user, guilds that the bot is a part of w

## Sharding

As apps grow and are added to an increasing number of guilds, some developers may find it necessary to divide portions of their app's operations across multiple processes. As such, the Gateway implements a method of user-controlled guild sharding which allows apps to split events across a number of Gateway connections. Guild sharding is entirely controlled by an app, and requires no state-sharing between separate connections to operate. While all apps *can* enable sharding, it's not necessary for apps in a smaller number of guilds.
As apps grow and are added to an increasing number of guilds, some developers may find it necessary to divide portions of their app's operations across multiple processes. As such, the Gateway implements a method of user-controlled guild sharding which allows apps to split events across a number of Gateway sessions. Guild sharding is entirely controlled by an app, and requires no state-sharing between separate sessions to operate. While all apps *can* enable sharding, it's not necessary for apps in a smaller number of guilds.
Copy link
Member

@jhgg jhgg May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really have something called a "gateway session" but I understand the intention here.

To elaborate on the terminology, a gateway connection refers to a connection to our websocket gateway at gateway.discord.gg, and a gateway connection then spawns a session, or re-establishes a connection to an existing session. The session outlives the gateway connection, since you can re-connect to the gateway when you're disconnected, and RESUME to re-establish the gateway socket's connection to a given session.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it should just be "session" instead of "Gateway session"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that would be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is session more accurate than connection here? IMO connections are more intuitive than sessions, so rewriting this section in terms of sessions makes it harder to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that session is more accurate yes.
As Jake wrote, a session can be RESUMED in a new connection to the gateway. And, as per the documentation, a connection will be sent all missed events from a session once it resumes it.
Thus, events being sent to a session which "forwards" them over the active connection or stores them if a connection is not currently active is a more accurate way of thinking of it (unless I completely misunderstand how it works).

As an example, if you wanted to split the connection between three shards, you'd use the following values for `shard` for each connection: `[0, 3]`, `[1, 3]`, and `[2, 3]`. Note that only the first shard (`[0, 3]`) would receive DMs.
Every session with `shard_id = 0` will be subscribed to DMs and other non-guild related events.

As an example, if you wanted to split events equally between three shards, you'd use the following values for `shard` for each session: `[0, 3]`, `[1, 3]`, and `[2, 3]`. DMs would only be sent to the `[0, 3]` shard.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "evenly" is not necessarily true, since the volume of events dispatched to a given session is entirely dependent on the guilds on that shard.

E.g. in a pathological case, if your bot is in 2 guilds, and you have one of the guilds has 1 member, and the other 500,000 members, you definitely won't see an "even" distribution of events.

Copy link
Contributor Author

@GitMeep GitMeep May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. "As an example, if you wanted to split guilds equally between three shards" would probably be better wording, along with a remark that this of course won't necessarily split the events evenly if some guilds produce more events than others.
I'll include this in a batch once all your comments have been resolved.

Copy link
Member

@jhgg jhgg May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no "split equally" either - for example, if you are running 3 shards, and you leave all guilds on shard 0, then your guilds are not and will no longer be split equally, and re-connecting would of course not repair this bias.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire basis of this system is that it takes a probabilistic approach to distributing guilds between shards, based on the millisecond that the guild was created in our system. With enough guilds and shards, it should balance out, but there definitely is no guarantee of an even or equal split one way or another.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand that very well. There is an implicit assumption in that sentence that the bot is part of many guilds (as it would most likely be when sharding becomes necessary). I could definitely make that assumption explicit.


As an example, if you wanted to split events equally between three shards, you'd use the following values for `shard` for each session: `[0, 3]`, `[1, 3]`, and `[2, 3]`. DMs would only be sent to the `[0, 3]` shard.

Note that `num_shards` does not relate to (or limit) the total number of potential sessions, and can be different between multiple sessions existing at the same time. It is only used to decide whether an event will be sent to the associated session using the [Sharding Formula](#DOCS_TOPICS_GATEWAY/sharding-sharding-formula) above. In the simple case like the example above, where every session has the same `num_shards` and the sessions respective `shard_id`'s cover every value from `0` to `num_shards - 1`, the events will be split evenly between the sessions. This is probably how most bots will operate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the word "evenly" here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, I agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants