Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: orchestrator/raft[/sqlite] #183

Merged
merged 234 commits into from
Aug 3, 2017
Merged
Show file tree
Hide file tree
Changes from 198 commits
Commits
Show all changes
234 commits
Select commit Hold shift + click to select a range
ca9828f
Auto-merged master into sqlite on deployment
Mar 15, 2017
b7887c6
Merge branch 'master' into sqlite
Mar 15, 2017
eddd044
Merge branch 'master' into sqlite
Apr 21, 2017
7bcdf62
Merge branch 'master' into sqlite
May 8, 2017
e9fd330
Merge branch 'master' into sqlite
May 11, 2017
364ca00
Merge branch 'master' into sqlite
May 11, 2017
89c8ffd
Merge branch 'master' into sqlite
May 21, 2017
9723606
orchestrator/raft[/sqlite]
May 22, 2017
30a1c77
added dependencies
May 22, 2017
d1062ca
updated .gotignore
May 23, 2017
6327198
adding and supporting RaftEnabled config
May 23, 2017
be36d0a
trying to fight 'database is locked' scenario
May 23, 2017
bb2c5c6
refactored IsLeader()
May 23, 2017
bbe9a3d
trying to fight 'database is locked' scenario
May 23, 2017
3f84a4d
adding /leader-check API; 200 when leader, 404 when not leader
May 23, 2017
6ab50e7
raft nodes get to run cleanups and everything; just not recoveries
May 23, 2017
61cb18b
unauthorizing API for non-leader raft members
May 23, 2017
2be3e0d
trying to fight 'database is locked' scenario
May 23, 2017
7257b97
trying to fight 'database is locked' scenario
May 23, 2017
5f00ea5
trying to fight 'database is locked' scenario
May 23, 2017
1547a71
experiment: no hostname resolve on sqlite
May 23, 2017
e163616
trying to fight 'database is locked' scenario
May 23, 2017
ed27c1a
driving away hostname resolve DB concurrency issues
May 23, 2017
3daa060
election is meaningless when raft enabled; health checks consider raft
May 23, 2017
82730e4
minor cleanup
May 23, 2017
0959398
refactoring: split health_dao into health, health_dao
May 23, 2017
8a74760
simplifying Once operations
May 24, 2017
13a4949
Merge branch 'master' into raft-sqlite
May 28, 2017
939107c
more structured NodeHealth registration
May 28, 2017
d3885ab
gofmt
May 28, 2017
f0f391f
sending health command to group
May 28, 2017
9ae907b
adding appliers
May 28, 2017
2aa0afb
attempt to allow writing commands from followers
May 28, 2017
19d574c
reverting follower updates
May 28, 2017
6a354b8
api's /discover to apply to group
May 28, 2017
399b3d1
simplified message distribution
May 28, 2017
214eff6
refactoring, simplifying raft commands
May 28, 2017
d74c59d
Merge branch 'master' into raft-sqlite
May 28, 2017
216c1ad
fixed node health hostanme
May 28, 2017
2fe95f1
Merge branch 'raft-sqlite' of github.com:github/orchestrator into raf…
May 28, 2017
3064873
improved nodHealth.Update()
May 28, 2017
79487a7
begin/end downtime via raft
May 29, 2017
d4636c9
begin/end downtime via raft
May 29, 2017
7f18f6a
health doesn't go through raft
May 29, 2017
fdadb1a
candidate instance / promotion rules via raft
May 29, 2017
c78b608
unused code cleanup
May 29, 2017
acc8f59
acknowledge, hostname unresolve: in raft
May 29, 2017
cad04eb
the typo is there, so I have to keep it
May 29, 2017
f2a066d
experimenting with low snapshot threshold
May 29, 2017
809f04c
fixed register-hostname-unresolve
May 29, 2017
c29f9ae
not applying expired hostname unresolve
May 29, 2017
3550b6e
more aggressive maintenance purging
May 29, 2017
6ac786e
more snapshot debug info
May 29, 2017
8c76b03
aggressive snapshots
May 29, 2017
3a667be
not checking response
May 29, 2017
100d913
submit-pool-instances via raft
May 29, 2017
0cbfd6a
support for tx.Prepare with sqlite3 dialect; pool_dao to use this
May 29, 2017
9546bf3
failure detection via raft
Jun 4, 2017
99f01cb
Merge branch 'master' into raft-sqlite
Jun 4, 2017
18942ad
fixed compilation errors
Jun 4, 2017
183d02d
Auto-merged master into raft-sqlite on deployment
Jun 5, 2017
dda394b
Merge branch 'master' into raft-sqlite
Jun 5, 2017
b319d44
Merge branch 'master' into raft-sqlite
Jun 5, 2017
247cbdd
writing TopologyRecoveryStep via raft
Jun 5, 2017
c25233e
Merge branch 'master' into raft-sqlite
Jun 11, 2017
595f8c8
merged detection refactoring
Jun 11, 2017
fe02b58
raft: write-recovery command. Writing recoveries and recovery steps u…
Jun 12, 2017
2fb7aef
improved PrettyUniqueToken()
Jun 13, 2017
3f8c585
forget-instances cache
Jun 14, 2017
d557fc7
aborting instance discovery if is forgotten
Jun 14, 2017
70550f2
more aggressive forgetting
Jun 14, 2017
d09efb8
adding /api/instances
Jun 14, 2017
ae2f213
debug messages for skipping discovery on forgotten instances
Jun 14, 2017
e795909
since we rely on relational backend, and do not actually generate sna…
Jun 14, 2017
9979b43
DiscardSnapshotStore doesn't seem to work as we expect
Jun 14, 2017
6514162
creating relational snapshots (no state, only meta)
Jun 15, 2017
e806adb
Merge branch 'master' into raft-sqlite
Jun 15, 2017
b31bc75
working rel-snapshot (with way too much debug info, will rip)
Jun 15, 2017
c5777fd
ripped debug messages
Jun 15, 2017
48f39d1
more frequent snapshots
Jun 15, 2017
34aeeee
JS var safety
Jun 15, 2017
f435b37
JS var safety
Jun 15, 2017
26f7eff
JS var safety
Jun 15, 2017
c343199
JS var safety
Jun 15, 2017
bb6d15d
Merge branch 'api-replication-analysis' into raft-sqlite
Jun 15, 2017
222e5b5
disabling Reelect on raft
Jun 15, 2017
6e02046
working towards analysis concensus
Jun 17, 2017
0c5f875
Merge branch 'master' into raft-sqlite
Jun 18, 2017
b8347b2
Merge branch 'raft-sqlite' of github.com:github/orchestrator into raf…
Jun 18, 2017
17e9598
marking snapshot's created_at
Jun 18, 2017
d7e85a6
more snapshot history
Jun 21, 2017
bc646c3
removed 'heartbeat' message that was mostly used for testing
Jun 21, 2017
f269d48
raft leader StepDown()
Jun 21, 2017
fd79c46
restoring app-level heartbeat, just to avoid complete desolation of e…
Jun 22, 2017
8bd8607
resolved merge conflict
Jun 22, 2017
bf2c3f9
Merge branch 'master' into raft-sqlite
Jun 22, 2017
9accb43
gofmt
Jun 22, 2017
1e83ee2
Merge branch 'master' into raft-sqlite
Jun 24, 2017
e630fd3
Merge branch 'master' into raft-sqlite
Jun 25, 2017
b194b25
Merge branch 'master' into raft-sqlite
Jun 27, 2017
47e24a2
merging with master, resolving conflicts
Jun 27, 2017
5114002
Merge branch 'raft-sqlite' of github.com:github/orchestrator into raf…
Jun 27, 2017
1485251
merge master
Jun 27, 2017
394850b
Merge branch 'master' into raft-sqlite
Jun 29, 2017
62a805e
Merge branch 'master' into raft-sqlite
Jun 29, 2017
31311c4
Merge branch 'master' into raft-sqlite
Jun 29, 2017
c512925
Merge branch 'master' into raft-sqlite
Jul 2, 2017
f6a72cb
preparation for raft-yield
Jul 2, 2017
5075d93
more preparation for 'yield', also 'raft-peers'
Jul 2, 2017
48c1289
yield
Jul 2, 2017
79d0b27
more yield logging
Jul 2, 2017
da54632
more yield logging
Jul 2, 2017
8c7f56d
fixed yield logic flow
Jul 2, 2017
76a5ca5
mroe time for yielded-to peer to become leader
Jul 2, 2017
5a2c98b
Merge branch 'master' into raft-sqlite
Jul 3, 2017
87f8641
gofmt
Jul 3, 2017
2f33a77
API fixes
Jul 3, 2017
3ea4cd7
Merge branch 'master' into raft-sqlite
Jul 5, 2017
b483f30
all-instances to use SearchInstances()
Jul 5, 2017
4dadfe5
removed use of rlike
Jul 5, 2017
3c0cd3d
Merge branch 'master' into raft-sqlite
Jul 9, 2017
52567ba
normalize yield host
Jul 9, 2017
9835e28
Merge branch 'raft-sqlite' of github.com:github/orchestrator into raf…
Jul 9, 2017
da56f89
normalize yield host
Jul 9, 2017
efde4c9
support /api/raft-state
Jul 9, 2017
4818b45
Merge branch 'master' into raft-sqlite
Jul 10, 2017
c065383
no need to authorize raft-state and raft-peers
Jul 10, 2017
a77c63a
Merge branch 'raft-sqlite' of github.com:github/orchestrator into raf…
Jul 10, 2017
5e16c9d
raft state as string
Jul 10, 2017
4954d3e
standard 'which' invocation in orchestrator-client
Jul 10, 2017
61eb056
Auto-merged master into raft-sqlite on deployment
Jul 10, 2017
d3c97a3
Merge branch 'master' into raft-sqlite
Jul 10, 2017
b5aceb4
Merge branch 'master' into raft-sqlite
Jul 11, 2017
abfce5a
in preparation for failure detection concensus analysis
Jul 11, 2017
348e25d
Merge branch 'raft-sqlite' of github.com:github/orchestrator into raf…
Jul 11, 2017
28833df
experimenting with nodes joining the group without logs
Jul 11, 2017
37a3e67
experiment done
Jul 11, 2017
52dbee8
forget-cluster
Jul 11, 2017
8dda655
more aggressive filtering our of forgotten instances
Jul 11, 2017
1eb07d8
even more aggressive period forgetting of an instance
Jul 11, 2017
b8d1eff
raft non-leader skips detection hooks
Jul 12, 2017
6684e51
yield by hint
Jul 12, 2017
c5c7605
Auto-merged master into raft-sqlite on deployment
Jul 13, 2017
6c5db36
raft leader control in orchestrator-client
Jul 16, 2017
dd59aee
fixed jq .
Jul 16, 2017
3b8dcca
orchestrator-client error reported to stderr
Jul 16, 2017
89d2245
CLI forbidden when RaftEnabled is true
Jul 16, 2017
0a06730
initial work to document orchestrator/raft
Jul 17, 2017
e158840
updated to ha doc
Jul 17, 2017
21e4c29
updated to ha doc
Jul 17, 2017
852c518
preparing documentation for shared backend vs. raft deployments
Jul 17, 2017
538b717
smaller images
Jul 18, 2017
3628a96
temporarily removing images
Jul 18, 2017
0b499de
returning images
Jul 18, 2017
0ea0041
metrics are independent of graphite. raft repots is_elected
Jul 18, 2017
c05cfbe
Merge branch 'master' into raft-sqlite
Jul 19, 2017
8175f04
Merge branch 'master' into raft-sqlite
Jul 23, 2017
f57e4e6
Merge branch 'master' into raft-sqlite
Jul 23, 2017
0264ed7
investigating installSnapshot
Jul 23, 2017
86af39d
investigating installSnapshot
Jul 23, 2017
7fc59b3
investigating installSnapshot
Jul 23, 2017
daffe85
investigating installSnapshot
Jul 23, 2017
63c860a
investigating installSnapshot
Jul 23, 2017
04e734e
investigating installSnapshot
Jul 23, 2017
ccc2590
investigating installSnapshot
Jul 23, 2017
c8d01da
investigating installSnapshot
Jul 23, 2017
9d7d39b
investigating installSnapshot
Jul 23, 2017
1868780
investigating installSnapshot
Jul 23, 2017
46fee40
investigating installSnapshot
Jul 23, 2017
ef4971f
investigating installSnapshot
Jul 23, 2017
1b1e134
investigating installSnapshot
Jul 23, 2017
d2063b2
investigating installSnapshot
Jul 23, 2017
f0f55f1
investigation cleanup
Jul 23, 2017
fa2511e
investigation cleanup
Jul 23, 2017
cb16704
investigation cleanup
Jul 23, 2017
e5c86bd
investigation cleanup
Jul 23, 2017
015d3e0
fatal on continuous health check fails
Jul 24, 2017
de1dd4d
updated images for high availability page
Jul 25, 2017
edfe3de
updated images for high availability page
Jul 25, 2017
4f9260f
updated images for high availability page
Jul 25, 2017
4f7b317
raft documentation
Jul 26, 2017
457bb05
raft documentation
Jul 26, 2017
2794b9d
api cheatsheet
Jul 26, 2017
28d98f6
cli noop is allowed in raft mode
Jul 26, 2017
51d99b3
doc updates
Jul 26, 2017
f574382
doc updates
Jul 26, 2017
074a573
doc updates
Jul 26, 2017
f2ac76d
doc updates
Jul 26, 2017
2a1e707
doc updates
Jul 26, 2017
d667525
raft vs synchronous replication
Jul 26, 2017
5f383ba
raft vs synchronous replication
Jul 27, 2017
9e23ba2
raft vs synchronous replication
Jul 27, 2017
2e816ed
raft vs synchronous replication
Jul 27, 2017
1eaa2aa
raft vs synchronous replication
Jul 27, 2017
66d3af3
raft vs synchronous replication
Jul 27, 2017
20133e0
raft vs synchronous replication
Jul 27, 2017
48552e5
raft vs synchronous replication
Jul 27, 2017
49fae62
doc updates
Jul 27, 2017
83ef4f5
fixing response code doc
Jul 27, 2017
f21e2f3
doc updates
Jul 30, 2017
7b4c16a
accept /api/leader-check/:errorStatusCode
Jul 30, 2017
5cc13de
updated doc with /api/leader-check/:errorStatusCode
Jul 30, 2017
328db3d
ack-recovery supports cluster hint
Jul 31, 2017
3b86bf2
orchestrator-client supports '-c api' generic command
Jul 31, 2017
2985b58
api, orchestrator-client both support 'raft=leader'
Jul 31, 2017
274dee4
orchestrator-client supports multiple endpoints, then auto-figures le…
Jul 31, 2017
9e3605a
Merge branch 'master' into raft-sqlite
Aug 1, 2017
4c5ddc3
Merge branch 'master' into raft-sqlite
Aug 1, 2017
d605b0c
gofmt
Aug 1, 2017
938be29
supporting /api/master/:clusterHint
Aug 1, 2017
2b4ac14
raft support for enable/disable global recoveries
Aug 1, 2017
395ba08
AcknowledgeClusterRecoveries: also acknowledging by alias
Aug 1, 2017
567b188
orchestrator-client: support -c which-api
Aug 1, 2017
8ca48db
document orchestrator-client multi-urls
Aug 1, 2017
987cacc
renamed orchestrator-client doc page
Aug 1, 2017
cb68852
orchestrator-client documentation
Aug 1, 2017
5f6807b
orchestrator-client documentation
Aug 1, 2017
f7f7b89
no-proxy options
Aug 1, 2017
efdb7bd
no-proxy options
Aug 1, 2017
8b61d0d
no-proxy options
Aug 1, 2017
a927624
config: raft
Aug 1, 2017
0bcc379
config: sqlite
Aug 1, 2017
fc4d61f
config: sqlite
Aug 1, 2017
07370e1
config: raft
Aug 1, 2017
76b9ca2
config: raft
Aug 1, 2017
84d6146
config: raft
Aug 1, 2017
0e7cda6
orchestrator-client -c which-cluster-master
Aug 2, 2017
28772a1
non empty test
Aug 2, 2017
4bc014d
PeerAPI
Aug 2, 2017
d3fa491
refactored raft-http
Aug 2, 2017
6af2f81
refactored raft-http
Aug 2, 2017
d63b68c
tracking and exporting count pending recoveries
Aug 2, 2017
a5da401
experimenting observation
Aug 2, 2017
7307957
observer experiment completed
Aug 2, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,5 @@ vagrant/db4-post-install.sh
vagrant/vagrant-ssh-key
vagrant/vagrant-ssh-key.pub
Godeps/_workspace
.gopath/
main
126 changes: 85 additions & 41 deletions docs/high-availability.md
Original file line number Diff line number Diff line change
@@ -1,57 +1,101 @@
# Orchestrator High Availability

`Orchestrator` makes your MySQL topologies available, but what makes `orchestrator` highly available?
`orchestrator` runs as a highly available service. This document lists the various ways of achieving HA for `orchestrator`, as well as less/not highly available setups.

Before drilling down into the details, we should first observe that orchestrator is a service that runs with a MySQL backend. Thus, we need to substantiate the HA of both these components, as well as the continued correctness in the failover process of either of the two or of both.
### TL;DR ways to get HA

### High Availability of the Orchestrator service
HA is achieved by choosing either:

`Orchestrator` runs as a service and its configuration needs to
reference a MySQL backend. You can quite easily add more orchestrator
applications probably running on different hosts to provide redundancy.
These servers would have an identical configuration. `Orchestrator`
uses the database to record the different applications which are
running and through it will allow an election process to choose one
of the processes to be the active node. If that process fails the
remaining processes will notice and shortly afterwards choose a new
active node. The active node is the one which periodically checks
each of the MySQL servers being monitored to determine if they are
healthy. If it detects a failure it will recover the topology if
so configured.
- `orchestrator/raft` setup, where `orchestrator` nodes communicate by raft consensus. Each `orchestrator` node [has a private database backend](#ha-via-raft), either `MySQL` or `sqlite`. See also [orchestrator/raft documentation](raft.md)
- [Shared backend](#ha-via-shared-backend) setup. Multiple `orchestrator` nodes all talk to the same backend, which may be a Galera/XtraDB Cluster/InnoDB Cluster/NDB Cluster. Synchronization is done at the database level.

If you use the web interface to look at the topology information
or to relocate replicas within a cluster and you have multiple
orchestrator processes running then you need to use a load balancer
to provide the redundant web service through a common URL. Requests
through the load balancer may not hit the active node but that is
not an issue as any of the running processes can serve the web
requests.
See also: [orchestrator/raft vs. synchronous replication setup](raft-vs-sync-repl.md)

### High Availability of the Orchestrator backend database
### Availability types

At this time `Orchestrator` relies on a MySQL backend. The state of the clusters is persisted to tables and is queried via SQL. It is worth considering the following:
You may choose different availability types, based on your requirements.

- The backend database is very small, and is linear with your number of servers. For most setups it's a matter of a few MB and well below 1GB, depending on history configuration, rate of polling etc. This easily allows for a fully in-memory database even on simplest machines.
- No high availability: easiest, simplest setup, good for testing or dev setups. Can use `MySQL` or `sqlite`
- Semi HA: backend is based on normal MySQL replication. `orchestrator` does not eat its own dog food and cannot failover its on backend.
- HA: as depicted above; support for no single point of failure. Different solutions have different tradeoff in terms of resource utilization, supported software, type of client access.

- Write rate is dependent on the frequency the MySQL hosts are polled and the number of servers involved. For most orchestrator installations the write rate is low.
Discussed below are all options.

To that extent you may use one of the following solutions in order to make the backend database highly available:
### No high availability

- 2-node MySQL Cluster
This is a synchronous solution; anything you write on one node is guaranteed to exist on the second. Data is available and up to date even in the face of a death of one server.
Suggestion: abstract via HAProxy with `first` load-balancing algorithm.
> NOTE: right now table creation explicitly creates tables using InnoDB engine; you may `ALTER TABLE ... ENGINE=NDB`
![orchestrator no HA](images/orchestrator-ha--no-ha.png)

- 3-node Galera/XtraDB cluster
This is a synchronous solution; anything you write on one node is guaranteed to exist on both other servers.
Galera is eventually consistent.
Data is available and up to date even in the face of a death of one server.
Suggestion: abstract via HAProxy with `first` load-balancing algorithm.
This setup is good for CI testing, for local dev machines or otherwise experiments. It is a single-`orchestrator` node with a single DB backend.

- MySQL Group Replication
This is similar to the MySQL Cluster (but uses the InnoDB engine) or the Galera/XtraDB cluster and is available in MySQL 5.7.17 (December 2016) and later.
Similar considerations apply as for the previous two options.
The DB backend may be a `MySQL` server or it may be a `sqlite` DB, bundled with `orchestrator` (no dependencies, no additional software required)

- 2-node active-passive master-master configuration
### Semi HA

> NOTE: there has been an initial discussion on supporting Consul/etcd as backend datastore; there is no pending work on that at this time.
![orchestrator semi HA](images/orchestrator-ha--semi-ha.png)

This setup provides semi HA for `orchestrator`. Two variations available:

- Multiple `orchestrator` nodes talk to the same backend database. HA of the `orchestrator` services is achieved. However HA of the backend database is not achieved. Backend database may be a `master` with replicas, but `orchestrator` is unable to eat its own dog food and failover its very backend DB.

If the backend `master` dies, it takes someone or something else to failover the `orchestrator` service onto a promoted replica.

- Multiple `orchestrator` services all talk to a proxy server, which load balances an active-active `MySQL` master-master setup with `STATEMENT` based replication.

- The proxy always directs to same server (e.g. `first` algorithm for `HAProxy`) unless that server is dead.
- Death of the active master causes `orchestrator` to talk to other master, which may be somewhat behind. `orchestrator` will typically self reapply the missing changes by nature of its continuous discovery.
- `orchestrator` queries guarantee `STATEMENT` based replication will not cause duplicate errors, and master-master setup will always achieve consistency.
- `orchestrator` will be able to recover the death of a backend master even if in the middle of runnign a recovery (recovery will re-initiate on alternate master)
- **Split brain is possible**. Depending on your setup, physical locations, type of proxy, there can be different `orchestrator` service nodes speaking to different backend `MySQL` servers. This scenario can lead two two `orchestrator` services which consider themselves as "active", both of which will run failovers independently, which would lead to topology corruption.

To access your `orchestrator` service you may speak to any healthy node.

Both these setups are well known to run in production for very large environments.

### HA via shared backend

![orchestrator HA via shared backend](images/orchestrator-ha--shared-backend.png)

HA is achieved by highly available shared backend. Existing solutions are:

- Galera
- XtraDB Cluster
- InnoDB Cluster
- NDB Cluster

In all of the above the MySQL nodes run synchronous replication (using the common terminology).

Two variations exist:

- Your Galera/XtraDB Cluster/InnoDB Cluster runs with a single-writer node. Multiple `orchestrator` nodes will speak to the single writer DB, probably via proxy. If the writer DB fails, the backend cluster promotes a different DB as writer; it is up to your proxy to identify that and direct `orchestrator`'s traffic to the promoted server.

- Your Galera/XtraDB Cluster/InnoDB Cluster runs in multiple writers mode. A nice setup would couple each `orchestrator` node with a DB server (possibly on the very same box). Since replication is synchronous there is no split brain. Only one `orchestrator` node can ever be the leader, and that leader will only speak with a consensus of the DB nodes.

In this setup there could be a substantial amount of traffic between the MySQL nodes. In cross-DC setups this may imply larger commit latencies (each commit may need to travel cross DC).

To access your `orchestrator` service you may speak to any healthy node. It is advisable you speak only to the leader via proxy (use `/api/leader-check` as HTTP health check for your proxy).

The latter setup is known to run in production at a very large environment on `3` or `5` nodes setup.

### HA via raft

![orchestrator HA via raft](images/orchestrator-ha--raft.png)

`orchestrator` nodes will directly communicate via `raft` consensus algorithm. Each `orchestrator` node has its own private backend database. This can be `MySQL` or `sqlite`.

Only one `orchestrator` node assumes leadership, and is always a part of a consensus. However all other nodes are independently active and are polling your topologies.

In this setup there is:
- No communication between the DB nodes.
- Minimal communication between the `orchestrator`.
- `*n` communication to `MySQL` topology nodes. A `3` node setup means each topology `MySQL` servr is probed by `3` different `orchestrator` nodes, independently.

It is recommended to run a `3`-node or a `5`-node setup.

`sqlite` is embedded within `orchestrator` and does not require an external dependency. `MySQL` outperforms `sqlite` on busy setups.

To access your `orchestrator` service you may **only** speak to the leader node. Use `/api/leader-check` as HTTP health check for your proxy.

![orchestrator HA via raft](images/orchestrator-ha--raft-proxy.png)


`orchestrator/raft` is a newer development, and is being tested in production at this time. Please read the [orchestrator/raft documentation](raft.md) for all implications.
Binary file added docs/images/orchestrator-ha--no-ha.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/orchestrator-ha--raft-proxy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/orchestrator-ha--raft.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/orchestrator-ha--semi-ha.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/orchestrator-ha--shared-backend.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
47 changes: 47 additions & 0 deletions docs/raft-vs-sync-repl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# orchestrator/raft vs. synchronous replication setup

This compares deployment, behavioral, limitations and benefits of two high availability deployment approaches: `orchestrator/raft` vs `orchestrator/[galera|xtradb cluster|innodb cluster]`

We will assume and compare:

- `3` data-centers setup (an _availability zone_ may count as a data-center)
- `3` node `orchestrator/raft` setup
- `3` node `orchestrator` on multi-writer `galera|xtradb cluster|innodb cluster` (each MySQL in cluster may accept writes)
- A proxy able to run `HTTP` or `mysql` health checks
- `MySQL`, `MariaDB`, `Percona Server` all considered under the term `MySQL`.

![orchestrator HA via raft](images/orchestrator-ha-raft-vs-sync-repl.png)

| Compare | orchestrator/raft | synchronous replication backend |
| --- | --- | --- |
General wiring | Each `orchestrator` node has a private backend DB; `orchestrator` nodes communicate by `raft` protocol | Each `orchestrator` node connects to a different `MySQL` member in a synchronous replication group. `orchestrator` nodes do not communicate with each other.
Backend DB | `MySQL` or `SQLite` | `MySQL`
Backend DB dependency | Service panics if cannot access its own private backend DB | Service _unhealthy_ if cannot access its own private backend DB
DB data | Independent across DB backends. May vary, but on a stable system converges to same overall picture | Single dataset, synchronously replicated across DB backends.
DB access | Never write directly. Only `raft` nodes access the backend DB while coordinating/cooperating. Or else inconsistencies can be introduced. Reads are OK. | Possible to access & write directly; all `orchestrator` nodes/clients see exact same picture.
Leader and actions | Single leader. Only the leader runs recoveries. All nodes run discoveries (probing) and self-analysis | Single leader. Only the leader runs discoveries (probing), analysis and recoveries.
HTTP Access | Must only access the leader (should be enforced by proxy) | May access any healthy node (should be enforced by proxy). For read consistency always best to speak to leader only (can be enforced by proxy)
Command line | HTTP/API access (e.g. `curl`, `jq`) or `orchestrator-client` script which wraps common HTTP /API calls with familiar command line interface | HTTP/API, and/or `orchestrator-client` script, or `orchestrator ...` command line invocation.
Install | `orchestrator` service on service nodes only. `orchestrator-client` script anywhere (requires access to HTTP/API). | `orchestrator` service on service nodes. `orchestrator-client` script anywhere (requires access to HTTP/API). `orchestrator` client anywhere (requires access to backend DBs)
Proxy | HTTP. Must only direct traffic to the leader (`/api/leader-check`) | HTTP. Must only direct traffic to healthy nodes (`/api/status`) ; best to only direct traffic to leader node (`/api/leader-check`)
Cross DC | Each `orchestrator` node (along with private backend) can run on a different DC. Nodes do not communicate much, low traffic. | Each `orchestrator` node (along with associated backend) can run on a different DC. `orchestrator` nodes do not communicate directly. `MySQL` group replication is chatty. Amount of traffic mostly linear by size of topologies and by polling rate. Write latencies.
Probing | Each topology server probed by all `orchestrator` nodes | Each topology server probed by the single active node
Failure analysis | Performed independently by all nodes | Performed by leader only (DB is shared so all nodes see exact same picture anyhow)
Failover | Performed by leader node only | Performed by leader node only
Resiliency to failure | `1` node may go down (`2` on a `5` node cluster) | `1` node may go down (`2` on a `5` node cluster)
Node back from short failure | Node rejoins cluster, gets updated with changes. | DB node rejoins cluster, gets updated with changes.
Node back from long outage | DB must be cloned from healthy node. | Depends on your MySQL backend implementation. Potentially SST/restore from backup.

### Considerations

Here are considerations for choosing between the two approaches:

- You only have a single data center (DC): pick shared DB or even a [simpler setup](high-availability.md)
- You are comfortable with Galera/XtraDB Cluster/InnoDB Cluster and have the automation to set them up and maintain them: pick shared DB backend.
- You have high-latency cross DC network: choose `orchestrator/raft`.
- You don't want to allocate MySQL servers for the `orchestrator` backend: choose `orchestrator/raft` with `SQLite` backend
- You have thousands of MySQL boxes: choose either, but choose `MySQL` backend which is more write performant than `SQLite`.

### Notes

- Another synchronous replication setup is that of a single writer. This would require an additional proxy between the `orchestrator` nodes and the underlying cluster, and is not considered above.
Loading