diff --git a/docs/Manual/Deployment/Kubernetes/Drain.md b/docs/Manual/Deployment/Kubernetes/Drain.md new file mode 100644 index 000000000..6e5a277fb --- /dev/null +++ b/docs/Manual/Deployment/Kubernetes/Drain.md @@ -0,0 +1,440 @@ +# Draining Kubernetes nodes + +{% hint 'danger' %} +If Kubernetes nodes with ArangoDB pods on them are drained without care +data loss can occur! The recommended procedure is described below. +{% endhint %} + +For maintenance work in k8s it is sometimes necessary to drain a k8s node, +which means removing all pods from it. Kubernetes offers a standard API +for this and our operator supports this - to the best of its ability. + +Draining nodes is easy enough for stateless services, which can simply be +re-launched on any other node. However, for a stateful service this +operation is more difficult, and as a consequence more costly and there +are certain risks involved, if the operation is not done carefully +enough. To put it simply, the operator must first move all the data +stored on the node (which could be in a locally attached disk) to +another machine, before it can shut down the pod gracefully. Moving data +takes time, and even after the move, the distributed system ArangoDB has +to recover from this change, for example by ensuring data synchronicity +between the replicas in their new location. + +Therefore, a systematic drain of all k8s nodes in sequence has to follow +a careful procedure, in particular to ensure that ArangoDB is ready to +move to the next step. This is necessary to avoid catastrophic data +loss, and is simply the price one pays for running a stateful service. + +## Anatomy of a drain procedure in k8s: the grace period + +When a `kubectl drain` operation is triggered for a node, k8s first +checks if there are any pods with local data on disk. Our ArangoDB pods have +this property (the _Coordinators_ do use `EmptyDir` volumes, and _Agents_ +and _DBServers_ could have persistent volumes which are actually stored on +a locally attached disk), so one has to override this with the +`--delete-local-data=true` option. + +Furthermore, quite often, the node will contain pods which are managed +by a `DaemonSet` (which is not the case for ArangoDB), which makes it +necessary to override this check with the `--ignore-daemonsets=true` +option. + +Finally, it is checked if the node has any pods which are not managed by +anything, either by k8s itself (`ReplicationController`, `ReplicaSet`, +`Job`, `DaemonSet` or `StatefulSet`) or by an operator. If this is the +case, the drain operation will be refused, unless one uses the option +`--force=true`. Since the ArangoDB operator manages our pods, we do not +have to use this option for ArangoDB, but you might have to use it for +other pods. + +If all these checks have been overcome, k8s proceeds as follows: All +pods are notified about this event and are put into a `Terminating` +state. During this time, they have a chance to take action, or indeed +the operator managing them has. In particular, although the pods get +termination notices, they can keep running until the operator has +removed all _finalizers_. This gives the operator a chance to sort out +things, for example in our case to move data away from the pod. + +However, there is a limit to this tolerance by k8s, and that is the +grace period. If the grace period has passed but the pod has not +actually terminated, then it is killed the hard way. If this happens, +the operator has no chance but to remove the pod, drop its persistent +volume claim and persistent volume. This will obviously lead to a +failure incident in ArangoDB and must be handled by fail-over management. +Therefore, **this event should be avoided**. + +## Things to check in ArangoDB before a node drain + +There are basically two things one should check in an ArangoDB cluster +before a node drain operation can be started: + + 1. All cluster nodes are up and running and healthy. + 2. For all collections and shards all configured replicas are in sync. + +{% hint 'warning' %} +If any cluster node is unhealthy, there is an increased risk that the +system does not have enough resources to cope with a failure situation. + +If any shard replicas are not currently in sync, then there is a serious +risk that the cluster is currently not as resilient as expected. +{% endhint %} + +One possibility to verify these two things is via the ArangoDB web interface. +Node health can be monitored in the _Overview_ tab under _NODES_: + +![Cluster Health Screen](./HealthyCluster.png) + +**Check that all nodes are green** and that there is **no node error** in the +top right corner. + +As to the shards being in sync, see the _Shards_ tab under _NODES_: + +![Shard Screen](./ShardsInSync.png) + +**Check that all collections have a green check mark** on the right side. +If any collection does not have such a check mark, you can click on the +collection and see the details about shards. Please keep in +mind that this has to be done **for each database** separately! + +Obviously, this might be tedious and calls for automation. Therefore, there +are APIs for this. The first one is [Cluster Health](../../../HTTP/Cluster/Health.html): + +``` +POST /_admin/cluster/health +``` + +… which returns a JSON document looking like this: + +```JSON +{ + "Health": { + "CRDN-rxtu5pku": { + "Endpoint": "ssl://my-arangodb-cluster-coordinator-rxtu5pku.my-arangodb-cluster-int.default.svc:8529", + "LastAckedTime": "2019-02-20T08:09:22Z", + "SyncTime": "2019-02-20T08:09:21Z", + "Version": "3.4.2-1", + "Engine": "rocksdb", + "ShortName": "Coordinator0002", + "Timestamp": "2019-02-20T08:09:22Z", + "Status": "GOOD", + "SyncStatus": "SERVING", + "Host": "my-arangodb-cluster-coordinator-rxtu5pku.my-arangodb-cluster-int.default.svc", + "Role": "Coordinator", + "CanBeDeleted": false + }, + "PRMR-wbsq47rz": { + "LastAckedTime": "2019-02-21T09:14:24Z", + "Endpoint": "ssl://my-arangodb-cluster-dbserver-wbsq47rz.my-arangodb-cluster-int.default.svc:8529", + "SyncTime": "2019-02-21T09:14:24Z", + "Version": "3.4.2-1", + "Host": "my-arangodb-cluster-dbserver-wbsq47rz.my-arangodb-cluster-int.default.svc", + "Timestamp": "2019-02-21T09:14:24Z", + "Status": "GOOD", + "SyncStatus": "SERVING", + "Engine": "rocksdb", + "ShortName": "DBServer0006", + "Role": "DBServer", + "CanBeDeleted": false + }, + "AGNT-wrqmwpuw": { + "Endpoint": "ssl://my-arangodb-cluster-agent-wrqmwpuw.my-arangodb-cluster-int.default.svc:8529", + "Role": "Agent", + "CanBeDeleted": false, + "Version": "3.4.2-1", + "Engine": "rocksdb", + "Leader": "AGNT-oqohp3od", + "Status": "GOOD", + "LastAckedTime": 0.312 + }, + ... [some more entries, one for each instance] + }, + "ClusterId": "210a0536-fd28-46de-b77f-e8882d6d7078", + "error": false, + "code": 200 +} +``` + +Check that each instance has a `Status` field with the value `"GOOD"`. +Here is a shell command which makes this check easy, using the +[`jq` JSON pretty printer](https://stedolan.github.io/jq/): + +```bash +curl -k https://arangodb.9hoeffer.de:8529/_admin/cluster/health --user root: | jq . | grep '"Status"' | grep -v '"GOOD"' +``` + +For the shards being in sync there is the +[Cluster Inventory](../../../HTTP/Replications/ReplicationDump.html#return-cluster-inventory-of-collections-and-indexes) +API call: + +``` +POST /_db/_system/_api/replication/clusterInventory +``` + +… which returns a JSON body like this: + +```JSON +{ + "collections": [ + { + "parameters": { + "cacheEnabled": false, + "deleted": false, + "globallyUniqueId": "c2010061/", + "id": "2010061", + "isSmart": false, + "isSystem": false, + "keyOptions": { + "allowUserKeys": true, + "type": "traditional" + }, + "name": "c", + "numberOfShards": 6, + "planId": "2010061", + "replicationFactor": 2, + "shardKeys": [ + "_key" + ], + "shardingStrategy": "hash", + "shards": { + "s2010066": [ + "PRMR-vzeebvwf", + "PRMR-e6hbjob1" + ], + "s2010062": [ + "PRMR-e6hbjob1", + "PRMR-vzeebvwf" + ], + "s2010065": [ + "PRMR-e6hbjob1", + "PRMR-vzeebvwf" + ], + "s2010067": [ + "PRMR-vzeebvwf", + "PRMR-e6hbjob1" + ], + "s2010064": [ + "PRMR-vzeebvwf", + "PRMR-e6hbjob1" + ], + "s2010063": [ + "PRMR-e6hbjob1", + "PRMR-vzeebvwf" + ] + }, + "status": 3, + "type": 2, + "waitForSync": false + }, + "indexes": [], + "planVersion": 132, + "isReady": true, + "allInSync": true + }, + ... [more collections following] + ], + "views": [], + "tick": "38139421", + "state": "unused" +} +``` + +Check that for all collections the attribute `"allInSync"` has +the value `true`. Note that it is necessary to do this for all databases! + +Here is a shell command which makes this check easy: + +```bash +curl -k https://arangodb.9hoeffer.de:8529/_db/_system/_api/replication/clusterInventory --user root: | jq . | grep '"allInSync"' | sort | uniq -c +``` + +If all these checks are performed and are okay, the cluster is ready to +run a risk-free drain operation. + +{% hint 'danger' %} +If there are some collections with `replicationFactor` set to +1, the system is not resilient and cannot tolerate the failure of even a +single server! One can still perform a drain operation in this case, but +if anything goes wrong, in particular if the grace period is chosen too +short and a pod is killed the hard way, data loss can happen. +{% endhint %} + +If all `replicationFactor`s of all collections are at least 2, then the +system can tolerate the failure of a single _DBserver_. If you have set +the `Environment` to `Production` in the specs of the ArangoDB +deployment, you will only ever have one _DBserver_ on each k8s node and +therefore the drain operation is relatively safe, even if the grace +period is chosen too small. + +Furthermore, we recommend to have one k8s node more than _DBservers_ in +you cluster, such that the deployment of a replacement _DBServer_ can +happen quickly and not only after the maintenance work on the drained +node has been completed. However, with the necessary care described +below, the procedure should also work without this. + +Finally, one should **not run a rolling upgrade or restart operation** +at the time of a node drain. + +## Clean out a DBserver manually (optional) + +In this step we clean out a _DBServer_ manually, before even issuing the +`kubectl drain` command. This step is optional, but can speed up things +considerably. Here is why: + +If this step is not performed, we must choose +the grace period long enough to avoid any risk, as explained in the +previous section. However, this has a disadvantage which has nothing to +do with ArangoDB: We have observed, that some k8s internal services like +`fluentd` and some DNS services will always wait for the full grace +period to finish a node drain. Therefore, the node drain operation will +always take as long as the grace period. Since we have to choose this +grace period long enough for ArangoDB to move all data on the _DBServer_ +pod away to some other node, this can take a considerable amount of +time, depending on the size of the data you keep in ArangoDB. + +Therefore it is more time-efficient to perform the clean-out operation +beforehand. One can observe completion and as soon as it is completed +successfully, we can then issue the drain command with a relatively +small grace period and still have a nearly risk-free procedure. + +To clean out a _DBServer_ manually, we have to use this API: + +``` +POST /_admin/cluster/cleanOutServer +``` + +… and send as body a JSON document like this: + +```JSON +{"server":"DBServer0006"} +``` + +(please compare the above output of the `/_admin/cluster/health` API). +The value of the `"server"` attribute should be the name of the DBserver +which is one the pod which shall be drained next. This uses the UI short +name, alternatively one can use the internal name, which corresponds to +the pod name. In our example, the pod name is: + +``` +my-arangodb-cluster-prmr-wbsq47rz-5676ed +``` + +… where `my-arangodb-cluster` is the ArangoDB deployment name, therefore +the internal name of the _DBserver_ is `PRMR-wbsq47rz`. Note that `PRMR` +must be all capitals since pod names are always all lower case. So, we +could use the body: + +```JSON +{"server":"PRMR-wbsq47rz"} +``` + +The API call will return immediately with a body like this: + +```JSON +{"error":false,"id":"38029195","code":202} +``` + +The given `id` in this response can be used to query the outcome or +completion status of the clean out server job with this API: + +``` +GET /_admin/cluster/queryAgencyJob?id=38029195 +``` + +… which will return a body like this: + +```JSON +{ + "error": false, + "id": "38029195", + "status": "Pending", + "job": { + "timeCreated": "2019-02-21T10:42:14.727Z", + "server": "PRMR-wbsq47rz", + "timeStarted": "2019-02-21T10:42:15Z", + "type": "cleanOutServer", + "creator": "CRDN-rxtu5pku", + "jobId": "38029195" + }, + "code": 200 +} +``` + +It indicates that the job is still ongoing (`"Pending"`). As soon as +the job has completed, the answer will be: + +```JSON +{ + "error": false, + "id": "38029195", + "status": "Finished", + "job": { + "timeCreated": "2019-02-21T10:42:14.727Z", + "server": "PRMR-e6hbjob1", + "jobId": "38029195", + "timeStarted": "2019-02-21T10:42:15Z", + "timeFinished": "2019-02-21T10:45:39Z", + "type": "cleanOutServer", + "creator": "CRDN-rxtu5pku" + }, + "code": 200 +} +``` + +From this moment on the _DBserver_ can no longer be used to move +shards to. At the same time, it will no longer hold any data of the +cluster. + +Now the drain operation involving a node with this pod on it is +completely risk-free, even with a small grace period. + +## Performing the drain + +After all above [checks before a node drain](#things-to-check-in-arangodb-before-a-node-drain) +have been done successfully, it is safe to perform the drain +operation, similar to this command: + +```bash +kubectl drain gke-draintest-default-pool-394fe601-glts --delete-local-data --ignore-daemonsets --grace-period=300 +``` + +As described above, the options `--delete-local-data` for ArangoDB and +`--ignore-daemonsets` for other services have been added. A `--grace-period` of +300 seconds has been chosen because for this example we are confident that all the data on our _DBServer_ pod +can be moved to a different server within 5 minutes. Note that this is +**not saying** that 300 seconds will always be enough, regardless of how +much data is stored in the pod, your mileage may vary, moving a terabyte +of data can take considerably longer! + +If the optional step of +[cleaning out a DBserver manually](#clean-out-a-dbserver-manually-optional) +has been performed beforehand, the grace period can easily be reduced to 60 +seconds - at least from the perspective of ArangoDB, since the server is already +cleaned out, so it can be dropped readily and there is still no risk. + +At the same time, this guarantees now that the drain is completed +approximately within a minute. + +## Things to check after a node drain + +After a node has been drained, there will usually be one of the +_DBservers_ gone from the cluster. As a replacement, another _DBServer_ has +been deployed on a different node, if there is a different node +available. If not, the replacement can only be deployed when the +maintenance work on the drained node has been completed and it is +uncordoned again. In this latter case, one should wait until the node is +back up and the replacement pod has been deployed there. + +After that, one should perform the same checks as described in +[things to check before a node drain](#things-to-check-in-arangodb-before-a-node-drain) +above. + +Finally, it is likely that the shard distribution in the "new" cluster +is not balanced out. In particular, the new _DBSserver_ is not automatically +used to store shards. We recommend to +[re-balance](../../Administration/Cluster/#movingrebalancing-shards) the shard distribution, +either manually by moving shards or by using the _Rebalance Shards_ +button in the _Shards_ tab under _NODES_ in the web UI. This redistribution can take +some time again and progress can be monitored in the UI. + +After all this has been done, **another round of checks should be done** +before proceeding to drain the next node. diff --git a/docs/Manual/Deployment/Kubernetes/HealthyCluster.png b/docs/Manual/Deployment/Kubernetes/HealthyCluster.png new file mode 100644 index 000000000..2670e5bd2 Binary files /dev/null and b/docs/Manual/Deployment/Kubernetes/HealthyCluster.png differ diff --git a/docs/Manual/Deployment/Kubernetes/ShardsInSync.png b/docs/Manual/Deployment/Kubernetes/ShardsInSync.png new file mode 100644 index 000000000..f013ba30a Binary files /dev/null and b/docs/Manual/Deployment/Kubernetes/ShardsInSync.png differ