openark · shlomi-noach · Aug 3, 2017 · Mar 15, 2017 · Mar 15, 2017 · Apr 21, 2017
diff --git a/.gitignore b/.gitignore
@@ -10,4 +10,5 @@ vagrant/db4-post-install.sh
 vagrant/vagrant-ssh-key
 vagrant/vagrant-ssh-key.pub
 Godeps/_workspace
+.gopath/
 main
diff --git a/docs/high-availability.md b/docs/high-availability.md
@@ -1,57 +1,101 @@
 # Orchestrator High Availability
 
-`Orchestrator` makes your MySQL topologies available, but what makes `orchestrator` highly available?
+`orchestrator` runs as a highly available service. This document lists the various ways of achieving HA for `orchestrator`, as well as less/not highly available setups.
 
-Before drilling down into the details, we should first observe that orchestrator is a service that runs with a MySQL backend. Thus, we need to substantiate the HA of both these components, as well as the continued correctness in the failover process of either of the two or of both.
+### TL;DR ways to get HA
 
-### High Availability of the Orchestrator service
+HA is achieved by choosing either:
 
-`Orchestrator` runs as a service and its configuration needs to
-reference a MySQL backend. You can quite easily add more orchestrator
-applications probably running on different hosts to provide redundancy.
-These servers would have an identical configuration.  `Orchestrator`
-uses the database to record the different applications which are
-running and through it will allow an election process to choose one
-of the processes to be the active node.  If that process fails the
-remaining processes will notice and shortly afterwards choose a new
-active node. The active node is the one which periodically checks
-each of the MySQL servers being monitored to determine if they are
-healthy.  If it detects a failure it will recover the topology if
-so configured.
+- `orchestrator/raft` setup, where `orchestrator` nodes communicate by raft consensus. Each `orchestrator` node [has a private database backend](#ha-via-raft), either `MySQL` or `sqlite`. See also [orchestrator/raft documentation](raft.md)
+- [Shared backend](#ha-via-shared-backend) setup. Multiple `orchestrator` nodes all talk to the same backend, which may be a Galera/XtraDB Cluster/InnoDB Cluster/NDB Cluster. Synchronization is done at the database level.
 
-If you use the web interface to look at the topology information
-or to relocate replicas within a cluster and you have multiple
-orchestrator processes running then you need to use a load balancer
-to provide the redundant web service through a common URL.  Requests
-through the load balancer may not hit the active node but that is
-not an issue as any of the running processes can serve the web
-requests.
+See also: [orchestrator/raft vs. synchronous replication setup](raft-vs-sync-repl.md)
 
-### High Availability of the Orchestrator backend database
+### Availability types
 
-At this time `Orchestrator` relies on a MySQL backend. The state of the clusters is persisted to tables and is queried via SQL. It is worth considering the following:
+You may choose different availability types, based on your requirements.
 
-- The backend database is very small, and is linear with your number of servers. For most setups it's a matter of a few MB and well below 1GB, depending on history configuration, rate of polling etc. This easily allows for a fully in-memory database even on simplest machines.
+- No high availability: easiest, simplest setup, good for testing or dev setups. Can use `MySQL` or `sqlite`
+- Semi HA: backend is based on normal MySQL replication. `orchestrator` does not eat its own dog food and cannot failover its on backend.
+- HA: as depicted above; support for no single point of failure. Different solutions have different tradeoff in terms of resource utilization, supported software, type of client access.
 
-- Write rate is dependent on the frequency the MySQL hosts are polled and the number of servers involved.  For most orchestrator installations the write rate is low.
+Discussed below are all options.
 
-To that extent you may use one of the following solutions in order to make the backend database highly available:
+### No high availability
 
-- 2-node MySQL Cluster
-  This is a synchronous solution; anything you write on one node is guaranteed to exist on the second. Data is available and up to date even in the face of a death of one server.
-  Suggestion: abstract via HAProxy with `first` load-balancing algorithm.
-  > NOTE: right now table creation explicitly creates tables using InnoDB engine; you may `ALTER TABLE ... ENGINE=NDB`
+![orchestrator no HA](images/orchestrator-ha--no-ha.png)
 
-- 3-node Galera/XtraDB cluster
-  This is a synchronous solution; anything you write on one node is guaranteed to exist on both other servers.
-  Galera is eventually consistent.
-  Data is available and up to date even in the face of a death of one server.
-  Suggestion: abstract via HAProxy with `first` load-balancing algorithm.
+This setup is good for CI testing, for local dev machines or otherwise experiments. It is a single-`orchestrator` node with a single DB backend.
 
-- MySQL Group Replication
-  This is similar to the MySQL Cluster (but uses the InnoDB engine) or the Galera/XtraDB cluster and is available in MySQL 5.7.17 (December 2016) and later.
-  Similar considerations apply as for the previous two options.
+The DB backend may be a `MySQL` server or it may be a `sqlite` DB, bundled with `orchestrator` (no dependencies, no additional software required)
 
-- 2-node active-passive master-master configuration
+### Semi HA
 
-> NOTE: there has been an initial discussion on supporting Consul/etcd as backend datastore; there is no pending work on that at this time.
+![orchestrator semi HA](images/orchestrator-ha--semi-ha.png)
+
+This setup provides semi HA for `orchestrator`. Two variations available:
+
+- Multiple `orchestrator` nodes talk to the same backend database. HA of the `orchestrator` services is achieved. However HA of the backend database is not achieved. Backend database may be a `master` with replicas, but `orchestrator` is unable to eat its own dog food and failover its very backend DB.
+
+  If the backend `master` dies, it takes someone or something else to failover the `orchestrator` service onto a promoted replica.
+
+- Multiple `orchestrator` services all talk to a proxy server, which load balances an active-active `MySQL` master-master setup with `STATEMENT` based replication.
+
+  - The proxy always directs to same server (e.g. `first` algorithm for `HAProxy`) unless that server is dead.
+  - Death of the active master causes `orchestrator` to talk to other master, which may be somewhat behind. `orchestrator` will typically self reapply the missing changes by nature of its continuous discovery.
+  - `orchestrator` queries guarantee `STATEMENT` based replication will not cause duplicate errors, and master-master setup will always achieve consistency.
+  - `orchestrator` will be able to recover the death of a backend master even if in the middle of runnign a recovery (recovery will re-initiate on alternate master)
+  - **Split brain is possible**. Depending on your setup, physical locations, type of proxy, there can be different `orchestrator` service nodes speaking to different backend `MySQL` servers. This scenario can lead two two `orchestrator` services which consider themselves as "active", both of which will run failovers independently, which would lead to topology corruption.
+
+To access your `orchestrator` service you may speak to any healthy node.
+
+Both these setups are well known to run in production for very large environments.
+
+### HA via shared backend
+
+![orchestrator HA via shared backend](images/orchestrator-ha--shared-backend.png)
+
+HA is achieved by highly available shared backend. Existing solutions are:
+
+- Galera
+- XtraDB Cluster
+- InnoDB Cluster
+- NDB Cluster
+
+In all of the above the MySQL nodes run synchronous replication (using the common terminology).
+
+Two variations exist:
+
+- Your Galera/XtraDB Cluster/InnoDB Cluster runs with a single-writer node. Multiple `orchestrator` nodes will speak to the single writer DB, probably via proxy. If the writer DB fails, the backend cluster promotes a different DB as writer; it is up to your proxy to identify that and direct `orchestrator`'s traffic to the promoted server.
+
+- Your Galera/XtraDB Cluster/InnoDB Cluster runs in multiple writers mode. A nice setup would couple each `orchestrator` node with a DB server (possibly on the very same box). Since replication is synchronous there is no split brain. Only one `orchestrator` node can ever be the leader, and that leader will only speak with a consensus of the DB nodes.
+
+In this setup there could be a substantial amount of traffic between the MySQL nodes. In cross-DC setups this may imply larger commit latencies (each commit may need to travel cross DC).
+
+To access your `orchestrator` service you may speak to any healthy node. It is advisable you speak only to the leader via proxy (use `/api/leader-check` as HTTP health check for your proxy).
+
+The latter setup is known to run in production at a very large environment on `3` or `5` nodes setup.
+
+### HA via raft
+
+![orchestrator HA via raft](images/orchestrator-ha--raft.png)
+
+`orchestrator` nodes will directly communicate via `raft` consensus algorithm. Each `orchestrator` node has its own private backend database. This can be `MySQL` or `sqlite`.
+
+Only one `orchestrator` node assumes leadership, and is always a part of a consensus. However all other nodes are independently active and are polling your topologies.
+
+In this setup there is:
+- No communication between the DB nodes.
+- Minimal communication between the `orchestrator`.
+- `*n` communication to `MySQL` topology nodes. A `3` node setup means each topology `MySQL` servr is probed by `3` different `orchestrator` nodes, independently.
+
+It is recommended to run a `3`-node or a `5`-node setup.
+
+`sqlite` is embedded within `orchestrator` and does not require an external dependency. `MySQL` outperforms `sqlite` on busy setups.
+
+To access your `orchestrator` service you may **only** speak to the leader node. Use `/api/leader-check` as HTTP health check for your proxy.
+
+![orchestrator HA via raft](images/orchestrator-ha--raft-proxy.png)
+
+
+`orchestrator/raft` is a newer development, and is being tested in production at this time. Please read the [orchestrator/raft documentation](raft.md) for all implications.
diff --git a/docs/images/orchestrator-ha--no-ha.png b/docs/images/orchestrator-ha--no-ha.png
diff --git a/docs/images/orchestrator-ha--raft-proxy.png b/docs/images/orchestrator-ha--raft-proxy.png
diff --git a/docs/images/orchestrator-ha--raft.png b/docs/images/orchestrator-ha--raft.png
diff --git a/docs/images/orchestrator-ha--semi-ha.png b/docs/images/orchestrator-ha--semi-ha.png
diff --git a/docs/images/orchestrator-ha--shared-backend.png b/docs/images/orchestrator-ha--shared-backend.png
diff --git a/docs/images/orchestrator-ha-raft-vs-sync-repl.png b/docs/images/orchestrator-ha-raft-vs-sync-repl.png
diff --git a/docs/raft-vs-sync-repl.md b/docs/raft-vs-sync-repl.md
@@ -0,0 +1,47 @@
+# orchestrator/raft vs. synchronous replication setup
+
+This compares deployment, behavioral, limitations and benefits of two high availability deployment approaches: `orchestrator/raft` vs `orchestrator/[galera|xtradb cluster|innodb cluster]`
+
+We will assume and compare:
+
+- `3` data-centers setup (an _availability zone_ may count as a data-center)
+- `3` node `orchestrator/raft` setup
+- `3` node `orchestrator` on multi-writer `galera|xtradb cluster|innodb cluster` (each MySQL in cluster may accept writes)
+- A proxy able to run `HTTP` or `mysql` health checks
+- `MySQL`, `MariaDB`, `Percona Server` all considered under the term `MySQL`.
+
+![orchestrator HA via raft](images/orchestrator-ha-raft-vs-sync-repl.png)
+
+| Compare | orchestrator/raft | synchronous replication backend |
+| --- | --- | --- |
+General wiring | Each `orchestrator` node has a private backend DB; `orchestrator` nodes communicate by `raft` protocol | Each `orchestrator` node connects to a different `MySQL` member in a synchronous replication group. `orchestrator` nodes do not communicate with each other.
+Backend DB | `MySQL` or `SQLite` | `MySQL`
+Backend DB dependency | Service panics if cannot access its own private backend DB | Service _unhealthy_ if cannot access its own private backend DB
+DB data | Independent across DB backends. May vary, but on a stable system converges to same overall picture | Single dataset, synchronously replicated across DB backends.
+DB access | Never write directly. Only `raft` nodes access the backend DB while coordinating/cooperating. Or else inconsistencies can be introduced. Reads are OK. | Possible to access & write directly; all `orchestrator` nodes/clients see exact same picture.
+Leader and actions | Single leader. Only the leader runs recoveries. All nodes run discoveries (probing) and self-analysis | Single leader. Only the leader runs discoveries (probing), analysis and recoveries.
+HTTP Access | Must only access the leader (should be enforced by proxy) | May access any healthy node (should be enforced by proxy). For read consistency always best to speak to leader only (can be enforced by proxy)
+Command line | HTTP/API access (e.g. `curl`, `jq`) or `orchestrator-client` script which wraps common HTTP /API calls with familiar command line interface | HTTP/API, and/or `orchestrator-client` script, or `orchestrator ...` command line invocation.
+Install | `orchestrator` service on service nodes only. `orchestrator-client` script anywhere (requires access to HTTP/API). | `orchestrator` service on service nodes. `orchestrator-client` script anywhere (requires access to HTTP/API). `orchestrator` client anywhere (requires access to backend DBs)
+Proxy | HTTP. Must only direct traffic to the leader (`/api/leader-check`) | HTTP. Must only direct traffic to healthy nodes (`/api/status`) ; best to only direct traffic to leader node (`/api/leader-check`)
+Cross DC | Each `orchestrator` node (along with private backend) can run on a different DC. Nodes do not communicate much, low traffic. | Each `orchestrator` node (along with associated backend) can run on a different DC. `orchestrator` nodes do not communicate directly. `MySQL` group replication is chatty. Amount of traffic mostly linear by size of topologies and by polling rate. Write latencies.
+Probing | Each topology server probed by all `orchestrator` nodes | Each topology server probed by the single active node
+Failure analysis | Performed independently by all nodes | Performed by leader only (DB is shared so all nodes see exact same picture anyhow)
+Failover | Performed by leader node only | Performed by leader node only
+Resiliency to failure | `1` node may go down (`2` on a `5` node cluster) | `1` node may go down (`2` on a `5` node cluster)
+Node back from short failure | Node rejoins cluster, gets updated with changes. | DB node rejoins cluster, gets updated with changes.
+Node back from long outage | DB must be cloned from healthy node. | Depends on your MySQL backend implementation. Potentially SST/restore from backup.
+
+### Considerations
+
+Here are considerations for choosing between the two approaches:
+
+- You only have a single data center (DC): pick shared DB or even a [simpler setup](high-availability.md)
+- You are comfortable with Galera/XtraDB Cluster/InnoDB Cluster and have the automation to set them up and maintain them: pick shared DB backend.
+- You have high-latency cross DC network: choose `orchestrator/raft`.
+- You don't want to allocate MySQL servers for the `orchestrator` backend: choose `orchestrator/raft` with `SQLite` backend
+- You have thousands of MySQL boxes: choose either, but choose `MySQL` backend which is more write performant than `SQLite`.
+
+### Notes
+
+- Another synchronous replication setup is that of a single writer. This would require an additional proxy between the `orchestrator` nodes and the underlying cluster, and is not considered above.