Skip to content

High Availability

Andrew J. Gillis edited this page Aug 9, 2018 · 2 revisions

High-Availability Overview

High-availability (HA) needs both distribution and redundancy to:

  1. Distribute work over multiple nodes to avoid the resource limitations of any single node, and
  2. Have redundant nodes to continue operations in the event of node failure

I propose that both of these aspects of HA can be accomplished by building HA functionality on top of the existing nexus router and client. A HA-capable WAMP service implementation would use the existing nexus router library and would include as part of the service a special client (federation agent) that has logic for federation of separate HA WAMP service instances (ha-routers). This would provide scalability by distributing clients among multiple ha-routers, while allowing clients on one router to deliver messages to clients on other routers.

A HA-capable WAMP client, that can fail-over to and use multiple other routers, and has some additional message handling logic, would enable redundancy by not relying on a single router for messaging.

The ha-router would be compatible with not non-ha clients, and ha-clients would be compatible with non-ha routers.

Design Concept

Distribution for Scalability

With a large enough number of clients, at some scale no single node will be able to support the workload and maintain reasonable performance. This means that WAMP clients need to be distributed over all the nodes running a WAMP router. Events and RPC calls received by one router must be forwarded to the other router(s) that have clients consuming those messages.

Clients must connect to the routers, in a pool of routers, so that each router handles a equal share of the clients. This can be accomplished by a load balancer, round-robin DNS, etc., or less efficiently by a ha-client connecting to any single router, retrieving a list of routers in the pool and then choosing one to connect to.

The message forwarding between routers can be accomplished by a specialized client (federation agent), that is running locally to each router (probably as an in-process client in the case of nexus). That federation agent is capable of:

  • Discovering other routers (discovery)
  • Subscribing to an event with all other routers, for each subscription (a subscription has one or more subscribers) on the host router. (event delegation)
  • Registering a procedure with all other routers, for each registration (a shared registration can have multiple callees) on the host router. (procedure delegation)
  • Mapping a response from another router into a response to the original requesting client (response translation)

Pub/Sub Delegation

When a client subscribes to a ha-router and creates a new subscription, the router's federation agent detects this as a wamp.subscription.on_create meta event. This triggers the federation agent to subscribe to the same URI, with the same match policy, on all other remote ha-routers. No matter which ha-router an event is published to, the event is routed directly or via delegate to the subscriber. When the last subscriber to a topic unsubscribes or leaves, the federation agent detects that as a wamp.subscription.on_delete and unsubscribes from all other remote ha-routers.

This method of delegation keeps the routing table limited to the number of subscribing clients attached directly to the host router plus the number of other remote routers that have any subscribing clients. In other words, a remote router's federation acts as a single client subscribing to an event no matter how many subscriber sessions the remote router is managing.

RPC Delegation

RPC registration is handled similarly, with the local federation agent registering the a procedure with all other remote ha-routers when a registration is created, and unregistering when a registration is deleted. The federation agent registers the delegated procedure with remote routers as "shared", so that if the same procedure is provided by a client at another router, the routers without a direct callee for that procedure can distribute calls to all routers that have a callee for it. If the original registration is shared then the same invocation policy (roundrobin, random, first, etc.) is used. Otherwise, the configured default is used. The use of shared registrations is a way to provide client scalability where the RPC workload is divided among multiple clients.

Router Redundancy

Router redundancy could mean different things.

  • A client only knows of other routers so that it can connect to an alternate choice in case its current router becomes unavailable.
  • A client maintains an active connection to multiple routers, and has redundant subscriptions and registrations at each router. If a router becomes unavailable, the client will still receive events and RPC calls through other redundant router(s).

Assuming the latter is required. In that case ha-clients will need to establish sessions with N separate routers, where N is the router redundancy factor, where N <= the number of routers in the pool (if N == number of routers in pool, then there is no distribution; if N == 1 there is no redundancy). The ha-clients duplicate any subscriptions and registrations across all N redundant sessions. The ha-client must also discard duplicate events from separate routers/sessions. This means that some additional unique ID may need to be encoded into messages to identify duplicates (needs design). The ha-clients also have retry logic so that if the router handling an RPC request fails, then the client can try its other redundant ha-router(s), before returning an error or result.

Externally (from the point-of-view of the application code) a ha-client appears to connect to a single router, and manages its connections to redundant ha-routers internally.

The ha-client initiates sending messages to only one of its N redundant ha-routers, since delegation routes the message to suitable destination(s). If a router became unavailable, the the ha-client initiates sending to a different router in its redundancy set. If there are no available routers in the redundancy set, the ha-client will not attempt to connect to any other ha-routers, since that would have the effect of exceeding the redundancy factor and could overload a router and impact the availability to other clients.

The ha-client can request a list of all available ha-routers by making a RPC request, using a known URI, to a procedure that is supplied by each ha-router's federation agent. Every federation agent knows all ha-routers. From the list of ha-routers, the ha-client chooses N-1 additional routers to maintain redundant sessions with.

Discovery

When the first router is started, there are no other routers for its federation agent to connect to. The federation agent subscribes to the ha.announce.alive event on its local router. This event is published when a remote federation connects to the local router, to announce the presence of the remote router.

When another router is brought online, it is added to the federation by by first connecting its federation to any one of the already existing routers. The new router's federation agent can be told explicitly the address of another router. If using RR DNS or load-balancer it will retry if it gets a connection to its local router. Once the new router's federation agent connects to any existing remote ha-router, the agent requests the ha-router list from the remote router.

After receiving the ha-router list, the federation connects as a client to all additional remote routers in the list (it is already a client of its local router and the first remote router). After connecting to each router in the list, the new agent announces its presence by publishing the ha.announce.alive to each remote router. All federation agents automatically subscribe to ha.announce.alive. The event contains the address of the sending agent's router. When a federation agent receives this event, it automatically connects as a client to the router specified by the event, if not already connected. This causes all federation agents to be a client to every ha-router.

Reconnecting Federated Routers

If a router loses its connection a federated peer router, that peer remains in the list of federated routers, and periodic attempts to reconnect are made. If the disconnect was due to a network partition, then both routers will be trying to reconnect to each other. When one router connects, it sends a ha.announce.alive message to tell the other router to connect back, if not already connected. It is ok for both routers connect to each other and both send the alive event. In addition to carrying the router's address, the alive event also has the unique ID of the router's federation agent. If a router's address changed, this ID allows other routers to cancel any retry attempts to connect to the old address, and update the address of the ha-router associated with the ID.

Multiple Realms

HA capabilities are offered on a per-realm basis, meaning that a federation agent will need a separate set of sessions for federation of separate realms, since a session is bound to a single realm. The same is true for redundant ha-client sessions.

Stand-alone HA Add-on for Multiple Router Implementations

Since the proposed HA functionality is provided using a federation agent which is implemented as a client that accompanies a WAMP router to provide the HA functionality, an HA WAMP service could be constructed using any sufficiently capable WAMP router and a federation agent. The ha-client is just a WAMP client that has the ability to use features provided by a federation agent if present, and to maintain redundant sessions with ha-routers. Such a client could be constructed using any number of WAMP client libraries since the logic is above the ability to communicate over WAMP. Therefore, it seems possible that a WAMP HA solution could be offered as an add-on that could be used with any number of WAMP implementations.

Router Requirements

  • Subscription meta events
  • Registration meta events
  • Shared Registration