Orchestrator/raft, consensus cluster
orchestrator/raft is a deployment setup where several
orchestrator nodes communicate with each other via
raft consensus protocol.
orchestrator/raft deployments solve both high-availability for
orchestrator itself as well as solve issues with network isolation, and in particular cross-data-center network partitioning/fencing.
Very brief overview of traits of raft
By using a consensus protocol the
orchestrator nodes are able to pick a leader that has quorum, implying it is not isolated. For example, consider a
orchestrator/raft setup. Normally the three nodes will chat with each other and one will be a stable elected leader. However in face of network partitioning, say node
n1 is partitioned away from nodes
n3, it is guaranteed that the leader will be either
n1 would not be able to lead because it does not have a quorum (in a
3 node setup the quorum size is
2; in a
5 node setup the quorum size is
This turns useful in cross data-center (DC) setups. Assume you set three
orchestrator nodes, each on its own DC. If one DC gets isolated, it is guaranteed the active
orchestrator node will be one that has consensus, i.e. operates from outside the isolated DC.
orchestrator/raft setup technical details
You will set up
5 (recommended raft node count)
orchestrator nodes. Other numbers are also legitimate but you will want at least
At this time
orchestrator nodes to not join dynamically into the cluster. The list of nodes is preconfigured as in:
"RaftEnabled": true, "RaftDataDir": "/var/lib/orchestrator", "RaftBind": "<ip.or.fqdn.of.this.orchestrator.node>", "DefaultRaftPort": 10008, "RaftNodes": [ "<ip.or.fqdn.of.orchestrator.node1>", "<ip.or.fqdn.of.orchestrator.node2>", "<ip.or.fqdn.of.orchestrator.node3>" ],
orchestrator node has its own, dedicated backend database server. This would be either:
A MySQL backend DB (no replication setup required, but OK if this server has replicas)
As deployment suggestion, this MySQL server can run on the same
A SQLite backend DB. Use:
"BackendDB": "sqlite", "SQLite3DataFile": "/var/lib/orchestrator/orchestrator.db",
orchestrator is bundled with
sqlite, there is no need to install an external dependency.
Only the leader is allowed to make changes.
Simplest setup it to only route traffic to the leader, by setting up a
HTTP proxy (e.g HAProxy) on top of the
See orchestrator-client section for an alternate approach
/api/leader-checkas health check. At any given time at most one
orchestratornode will reply with
HTTP 200/OKto this check; the others will respond with
HTTP 404/Not found.
- Hint: you may use, for example,
/api/leader-check/503is you explicitly wish to get a
503response code, or similarly any other code.
- Hint: you may use, for example,
- Only direct traffic to the node that passes this test
As example, this would be a
listen orchestrator bind 0.0.0.0:80 process 1 bind 0.0.0.0:80 process 2 bind 0.0.0.0:80 process 3 bind 0.0.0.0:80 process 4 mode tcp option httpchk GET /api/leader-check maxconn 20000 balance first retries 1 timeout connect 1000 timeout check 300 timeout server 30s timeout client 30s default-server port 3000 fall 1 inter 1000 rise 1 downinter 1000 on-marked-down shutdown-sessions weight 10 server orchestrator-node-0 orchestrator-node-0.fqdn.com:3000 check server orchestrator-node-1 orchestrator-node-1.fqdn.com:3000 check server orchestrator-node-2 orchestrator-node-2.fqdn.com:3000 check
Proxy: healthy raft nodes
A relaxation of the above constraint.
Healthy raft nodes will reverse proxy your requests to the leader. You may choose (and this happens to be desirable for
kubernetes setups) to talk to any healthy raft member.
You must not access unhealthy raft members, i.e. nodes that are isolated from the quorum.
/api/raft-healthto identify that a node is part of a healthy raft group.
HTTP 200/OKresponse identifies the node as part of the healthy group, and you may direct traffic to the node.
HTTP 500/Internal Server Errorindicates the node is not part of a healthy group. Note that immediately following startup, and until a leader is elected, you may expect some time where all nodes report as unhealthy. Note that upon leader re-election you may observe a brief period where all nodes report as unhealthy.
An alternative to the proxy approach is to use
orchestrator-client is a wrapper script that accesses the
orchestrator service via HTTP API, and provides a command line interface to the user.
It is possible to provide
orchestrator-client with the full listing of all orchestrator API endpoints. In such case,
orchestrator-client will figure out which of the endpoints is the leader, and direct requests at that endpoint.
As example, we can set:
export ORCHESTRATOR_API="https://orchestrator.host1:3000/api https://orchestrator.host2:3000/api https://orchestrator.host3:3000/api"
A call to
orchestrator-client will first check
Otherwise, if you already have a proxy, it's also possible for
orchestrator-client to work with the proxy, e.g.:
Behavior and implications of orchestrator/raft setup
orchestratornode independently runs discoveries of all servers. This means that in a three node setup, each of your
MySQLtopology servers will be independently visited by three different
In normal times, the three nodes will see a more-or-less identical picture of the topologies. But they will each have their own independent analysis.
orchestratornodes writes to its own dedicated backend DB server (whether
orchestratornodes communication is minimal. They do not share discovery info (since they each discover independently). Instead, the leader shares with the other nodes what user instructions is intercepted, such as:
The leader will also educate its followers about ongoing failovers.
The communication between
orchestratornode does not correlate to transactional database commits, and is sparse.
All user changes must go through the leader, and in particular via the
HTTP API. You must not manipulate the backend database directly, since such a change will not be published to the other nodes.
As result, on a
orchestrator/raft, one may not use the
orchestratorexecutable in command line mode: an attempt to run
orchestratorcli will refuse to run when
raftmode is enabled. Work is ongoing to allow some commands to run via cli.
A utility script, orchestrator-client is available that provides similar interface as the command line
orchestrator, and that uses & manipulates
You will only install the
orchestratorservice nodes, and no where else. The
orchestrator-clientscript can be installed wherever you wish to.
A failure of a single
orchestratornode will not affect
orchestrator's availability. On a
3node setup at most one server may fail. On a
2nodes may fail.
orchestratornode cannot run without its backend DB. With
sqlitebackend this is trivial since
sqliteruns embedded with
orchestratorservice will bail out if unable to connect to the backend DB over a period of time.
orchestratornode may be down, then come back. It will rejoin the
raftgroup, and receive whatever events it missed while out. There is allowed as long as there is enough
raftlog. On most environments there should be enough log for a few hours.
orchestratorservice will bail out if it can't join the
To join an
orchestratornode that was down/away longer than what log retention permits, or a node where the database is completely empty, you will need to clone the backend DB from another, active node.
active-node$ sqlite3 /var/lib/orchestrator/orchestrator.db .dump > /tmp/orchestrator-dump.sql` active-node$ scp /tmp/orchestrator-dump.sql new-node:/tmp/ new-node$ sqlite3 /var/lib/orchestrator/orchestrator.db < /tmp/orchestrator-dump.sql`
MySQLuse your favorite backup/restore method.
See also Master discovery with Key Value stores via
Main advantages of orchestrator/raft
- Highly available
- Consensus: failovers are made by leader node that is member of quorum (not isolated)
SQLite(embedded) backend, no need for
MySQLbackend though supported.
- Little cross-node communication ; fit for high latency cross DC networks
DC fencing example
Consider this example of three data centers,
DC3. We run
orchestrator/raft with three nodes, one in each data center.
What happens when
DC2 gets network isolated?
Still ongoing and TODO:
Failure detection to require quorum agreement (i.e. a
DeadMasterneeds to be analyzed by multiple
orchestratornodes) in order to kick failover/recovery.
Support sharing of probing (mutually exclusive to the above): the
leaderwill divide the list of servers to probe between all nodes. Potentially by data-center. This will reduce probing load (each MySQL server will be probed by a single node rather than all nodes). All
orchestratornodes will see same picture as opposed to independent views.