implement clustering for horizontal scalability #1532

slingamn · 2021-02-10T21:55:01Z

Motivation

The original model of IRC as a distributed system was as an open federation of symmetrical, equally privileged peer servers. This model failed almost immediately with the 1990 EFnet split. Modern IRC networks are under common management: they require all the server operators to agree on administration and policy issues. Similarly, the model of IRC as an AP system (available and partition-tolerant) failed with the introduction of services frameworks. In modern IRC networks, the services framework is a SPOF: if it fails, the network remains available but is dangerously degraded (in particular, its security properties have been silently weakened).

The current Oragono architecture is a single process. A single Oragono instance can scale comfortably to 10,000 clients and 2,000 clients per channel: you can push those limits if you have bigger hardware, but ultimately the single instance is a serious bottleneck. The largest IRC network of all time was 2004-era Quakenet, with 240,000 concurrent clients. Biella Coleman reports that the largest IRC channel of all time had 7,000 participants (#operationpayback on AnonOps in late 2010). This gives us our initial scalability targets: 250,000 concurrent clients with 10,000 clients per channel.

Oragono's single-process architecture offers compelling advantages in terms of flexibility and pace of development; it's significantly easier to prototype new features without having to worry about distributed systems issues. The architecture that balances all of these considerations --- acknowledging the need for centralized management, acknowledging the indispensability of user accounts, providing the horizontal scalability we need, and minimizing implementation complexity --- is a hub-and-spoke design.

Design

The oragono executable will accept two modes of operation: "root" and "leaf". The root mode of operation will be much like the present single-process mode. In leaf mode, the process will not have direct access to a config file or a buntdb database: it will take the IP (typically a virtual IP of some kind, as in Kubernetes) of the root node, then connect to the root node and receive a serialized copy of the root node's authoritative configuration over the network.

Clients will then connect to the leaf nodes. Most commands received by the leaf nodes will be serialized and passed through to the root node, which will process them and return an answer. A few commands, like capability negotiation, will be handled locally. (This corresponds loosely to the Session vs. Client distinction in the current codebase: the leaf node will own the Session and the root node will own the Client. Anything affecting global server state is processed by the root; anything affecting only the client's connection is processed by the leaf.)

The root node will unconditionally forward copies of all IRC messages (PRIVMSG, NOTICE, KICK, etc.) to each leaf node, which will then determine which sessions are eligible to receive them. This provides the crucial fan-out that generates the horizontal scalability: the traffic (and TLS) burden is divided evenly among the n leaf nodes.

Unresolved questions

The intended deployment strategy for this system is Kubernetes. However, I don't currently have a complete picture of which Kubernetes primitives will be used. The key assumption of the design is that the network can be virtualized such that the leaf nodes only need to know a single, consistent IP address for the root node.

I'm not sure how to do history. The simplest architecture is for HISTORY and CHATHISTORY requests to be forwarded to the root. This seems like it will result in a scalability bottleneck. In a deployment that uses MySQL, the leaf nodes can connect directly to MySQL; this reduces the problem to deploying a highly available and scalable MySQL, which is nontrivial. The logical next step would seemingly be to abstract away the historyDB interface, then provide a Cassandra implementation --- the problem with this is that we seem to be going in the direction of expecting read-after-write consistency from the history store (see this comment on #393 in particular.)

server name (blah.oragono.io, aus.oragono.io, etc) - I do like this, if nothing else because it helps users work out where they are and when they start running into any trouble they can explicitly know which server they're on and try others.
listening ports/setups? the exposed ports shouldn't change ofc but I'm considering for more in-depth or precise routing setups. this probably isn't necessary at all?

anything else?

optimally, not sending an IRC-based line protocol on the wire and doing something more efficient would be nice? if we do need to, that's fine. optimally we'll be able to use the same fairly-simple command handlers like we do now to parse incoming client messages.

The root node will unconditionally forward copies of all IRC messages (PRIVMSG, NOTICE, KICK, etc.) to each leaf node, which will then determine which sessions are eligible to receive them. This provides the crucial fan-out that generates the horizontal scalability: the traffic (and TLS) burden is divided evenly among the n leaf nodes.

I'm not super familiar with how scalability is affected by different factors, but is there any reason why a leaf that has e.g. 20 clients on it would be more stable if it gets the full 50k-client-traffic piped to it vs getting a smaller amount of traffic if the hub has some logic to work out which leafs should get which traffic? or is it something like 'reducing the burden on the hub (not needing to keep in mind which leafs have which clients, not needing to spend time directing messages) provides more benefit to the network as a whole than reducing the burden on specific leaf connections'?

Just to make sure that we're on the same page - if Ora is in hub mode it'll accept connections from oragono servers but not clients. if Ora is in leaf mode it'll accept connections from clients only, yeah? the standard irc padiagram has this feature where users can connect to hubs, which is sometimes useful for opers sitting on those hubs to e.g. fix the network or stuff like that, and sometimes you'll see servers that both accept clients (y'know sitting there with a few thousand normal clients connected) and also being the hub to 2-3 other leaf servers. It probably makes sense for us to just have a clean split, either hub or leaf and that means either accept servers or accept clients, done.

thoughts on the 'server name' stuff again, this'd probably make sense as part of the connection authentication stuff that leaves do to auth to the hub. e.g. something along the lines of this in the hub config:

leaves:
    "password-hash-here":
        name: aus.oragono.io
        connects-from: 184.5.6.78/24
        ... blahblahblah ...

or just a more normal-looking

leaves:
    aus.oragono.io:
        password: "password-hash-here"
        connects-from: 184.5.6.78/20
        ... blahblahblah ...

or are we gonna go the no-passwords-to-connect route and tell everyone to setup internal vpns and junk to expose the hub to the leaves or something instead?

slingamn · 2021-02-11T06:11:15Z

I wasn't planning to expose server names for the leaf nodes --- they're "cattle", not "pets", in k8s parlance. (In particular, they might be automatically spun up and down by an autoscaler.)
I'm considering serialized wire formats that aren't IRC for the leaf-to-root links, yeah. In particular there is a need to attach metadata (like a session UUID) to the line. This could be done with tags but it seems like an abuse.
The objective of unconditionally forwarding all messages to all leaves is to reduce the burden on the hub, yeah. (Also in terms of implementation complexity: it removes a potential class of bugs where the hub is confused about what clients are where.)
I think it would be potentially OK for the hub to accept C2S connections, but we'll have to see how the implementation shakes out.
Yeah, I was planning to leave all authentication out of scope and rely on internal cloud networking or VPNs. I think if we don't do that, the next option is probably MTLS with a private CA.

slingamn · 2021-02-11T06:20:04Z

Something I hadn't considered adequately before: this design requires the leaf nodes to have an up-to-date view of channel memberships. This shouldn't be too difficult but it introduces a potential class of desync bugs.

slingamn · 2021-02-11T06:22:08Z

In the MVP of this, all password checking will happen on the hub, but later on we'll probably want to have a mechanism to defer it to the leaf (in-memory cache of accounts and hashed passwords?)

DanielOaks · 2021-02-11T06:34:22Z

ah, 'cattle not pets' makes sense then, mad.
yeah not-irc for the s2s format makes sense for us. traditional s2s protocols do weird things that change the original line a fair bit and all, if we can avoid that then that'd be cool.
👍
yeah fair enough, we'll see how impl shakes out. might be useful for an oper or whatever to connect directly and not via a leaf potentially.
👍

tacerus · 2021-02-11T20:40:50Z

Hi! This all sounds very interesting, and we can't wait to test it out. Did I understand correctly, that all IRC clients will connect exclusively to leaf-nodes, and non will connect to the root directly? If yes, will it be feasible to have one leaf on the same server as the root?

slingamn · 2021-02-11T22:06:48Z

If yes, will it be feasible to have one leaf on the same server as the root?

Yes, but what would this accomplish?

slingamn · 2021-02-11T22:07:49Z

Oh --- I didn't mean to suggest that the old single-process mode of operation will no longer be supported.

DanielOaks · 2021-02-11T22:16:22Z

I assume they’d probably want this for the same reason I would - if users can’t directly connect to the hub when it’s running in a hub+spoke configuration, it might be useful to have a single consistent leaf where opers and such can connect. tl;dr if it’s possible to let users connect to the hub then we’ll do that, and if they can’t then you should definitely be able to do this.

…

On Fri, 12 Feb 2021 at 8:08 am, Shivaram Lingamneni < ***@***.***> wrote: Oh --- I didn't mean to suggest that the old single-process mode of operation will no longer be supported. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1532 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAB5LEPBPTYB7HA7QOXW3VDS6RIMLANCNFSM4XNXT62A> .

tacerus · 2021-02-12T00:01:38Z

I was more thinking, one would ideally have at least two leafs in such a configuration (else one could just run the single mode of operation) - if the resources of the root server are good enough, could someone just run one of their leafs on it, or are there any drawbacks of doing so?

tacerus · 2021-02-12T00:02:20Z

@DanielOaks ' point is a good one too though - given that your root server is for some reason more reliable than the others.

slingamn · 2021-02-12T00:09:56Z

if the resources of the root server are good enough, could someone just run one of their leafs on it, or are there any drawbacks of doing so?

My intuition is that in this setup, the leaf will compete with the master for resources and you might be better off running a single node. But I'm not sure.

slingamn · 2021-02-12T04:56:18Z

Answering some other questions received:

Replacing the root server will be fairly disruptive, unfortunately; the network will go down for a period on the order of a minute. It should be possible to minimize the extent to which this occurs (StatefulSet seems potentially relevant)
Upgrading the cluster will also cause downtime on the order of minutes.

janc13 · 2022-01-09T21:32:57Z

If I understand correctly this means that if the root node goes down (e.g. for a reboot) or becomes unreachable (e.g. network issues), then the whole IRC network goes down with it?

The original IRC federation design was not (only) to be able to scale horizontally, but also to be resilient to such infrastructure outages (and we’ve seen over the years that even the biggest and most experienced internet companies aren’t immune to such calamities…).

slingamn · 2022-01-09T22:01:31Z

As discussed in the issue description, this is in fact the status quo for conventional IRC networks: the services node is a SPOF for the entire network.

slingamn added release blocker Blocks release performance labels Feb 10, 2021

slingamn added this to the v2.6 milestone Feb 10, 2021

This was referenced Feb 10, 2021

Horizontal scalibility #1000

Closed

implement leaf and hub #1265

Closed

slingamn removed the release blocker Blocks release label Mar 11, 2021

slingamn modified the milestones: v2.6, v2.7 Apr 7, 2021

slingamn modified the milestones: v2.7, v2.8 May 30, 2021

slingamn mentioned this issue Jun 3, 2021

Ergo is IRCd and service, why not split it into two services (software)? #1673

Closed

slingamn modified the milestones: v2.8, v2.9 Aug 26, 2021

slingamn modified the milestones: v2.9, v2.10 Jan 2, 2022

slingamn modified the milestones: v2.10, v2.11 Apr 26, 2022

slingamn modified the milestones: v2.11, selected Dec 11, 2022

This was referenced Apr 19, 2023

refactor accounts system so that buntdb is no longer the data plane #1986

Open

s2s protocol / linking (or 'federation') #26

Closed

slingamn mentioned this issue Aug 5, 2024

[Question] Does Ergo communicate with other IRC servers? #2186

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement clustering for horizontal scalability #1532

implement clustering for horizontal scalability #1532

slingamn commented Feb 10, 2021

DanielOaks commented Feb 11, 2021 •

edited

Loading

slingamn commented Feb 11, 2021

slingamn commented Feb 11, 2021

slingamn commented Feb 11, 2021

DanielOaks commented Feb 11, 2021

tacerus commented Feb 11, 2021

slingamn commented Feb 11, 2021

slingamn commented Feb 11, 2021

DanielOaks commented Feb 11, 2021 via email

tacerus commented Feb 12, 2021

tacerus commented Feb 12, 2021

slingamn commented Feb 12, 2021

slingamn commented Feb 12, 2021

janc13 commented Jan 9, 2022

slingamn commented Jan 9, 2022

implement clustering for horizontal scalability #1532

implement clustering for horizontal scalability #1532

Comments

slingamn commented Feb 10, 2021

Motivation

Design

Unresolved questions

See also

DanielOaks commented Feb 11, 2021 • edited Loading

slingamn commented Feb 11, 2021

slingamn commented Feb 11, 2021

slingamn commented Feb 11, 2021

DanielOaks commented Feb 11, 2021

tacerus commented Feb 11, 2021

slingamn commented Feb 11, 2021

slingamn commented Feb 11, 2021

DanielOaks commented Feb 11, 2021 via email

tacerus commented Feb 12, 2021

tacerus commented Feb 12, 2021

slingamn commented Feb 12, 2021

slingamn commented Feb 12, 2021

janc13 commented Jan 9, 2022

slingamn commented Jan 9, 2022

DanielOaks commented Feb 11, 2021 •

edited

Loading