Skip to content
This repository has been archived by the owner on Dec 13, 2023. It is now read-only.

Commit

Permalink
Added rts documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
afritzler committed Oct 29, 2018
1 parent 183103d commit a7a4b71
Show file tree
Hide file tree
Showing 3 changed files with 139 additions and 0 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@
gen/
tmp/
*.sw[opq]
.vscode/
25 changes: 25 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Kubify

.. image:: kubify.png
:alt Kubify logo
:align center

Kubify is a [Teraform](https://www.terraform.io/) based provisioning project for setting up production ready [Kubernetes](https://kubernetes.io/) clusters on public and private Cloud infrastructures. Kubify currently supports:

* OpenStack
* AWS
* Azure

Key features of Kubify are:

* Multi master node setup
* Etcd backup and restore
* Supports rolling updates

Contents
========

.. toctree::
:maxdepth: 2

troubleshoot.rst
113 changes: 113 additions & 0 deletions docs/troubleshoot.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
Kubify Troubleshooting Guide
============================

Before you start make sure that your Terraform version is `v0.11.3`.

SSH into Machines
-----------------

If `k8s/bin/{master X/worker X}` isn't working, make sure the following:

* Bastion hosts might be offline, reboot/start the machine
* The SSH daemon on the target machine stopped working. If it is a worker machine, just __restart it__. If it is a Master node, first make sure that all 3 `kube-etcd-{000X}` pods are working. If that is the case, restart the affected Master node. If ETCD has not all the members, make sure, that the one missing is on the master you are about to restart.

Fix Control plane by using the Bootstrap Controlplane
-----------------------------------------------------

In case the `kube-{apiserver,controller-manager,scheduler}` isn't working, you can use the bootstrap control plane on `master-0` to recover the Kubernetes control plane. To do that you need to copy the `bootstrap-sercrets` into the `/etc` folder

```
k8s/bin/master
sudo cp -r /opt/bootkube/assets/tls/ /etc/kubernetes/bootstrap-secrets
```

And then copy the corresponding boostrap manifest into the Kubernetres manifest folder

```
# Kube API Server
sudo cp /opt/bootkube/assets/bootstrap-manifests/bootstrap-apiserver.yaml /etc/kubernetes/manifests/
# Kube Controller Manager
sudo cp /opt/bootkube/assets/bootstrap-manifests/bootstrap-controller-manager.yaml /etc/kubernetes/manifests/
# Kube Scheduler
sudo cp /opt/bootkube/assets/bootstrap-manifests/bootstrap-scheduler.yaml /etc/kubernetes/manifests/
```

Once the Bootstrap components have reconciled the broken Kubernetes control plane, their corresponding manifests can be safely removed again.

```
sudo rm /etc/kubernetes/manifests/bootstrap-apiserver.yaml
sudo rm /etc/kubernetes/manifests/bootstrap-controller-manager.yaml
sudo rm /etc/kubernetes/manifests/bootstrap-scheduler.yaml
```
The Boostrap secrets are also not needed anymore and can be removed.
```
sudo rm -rf /etc/kubernetes/bootstrap-secrets
```

Defrag ETCD and free up Disk Space
----------------------------------

Startup ETCD client pod

Spin up an Pod which has the ETCD certificates mounted
```
k8s/bin/ks apply -f https://raw.githubusercontent.com/gardener/kubify/master/docs/manifests/etcdctl.yaml
k8s/bin/ks exec -it etcdctl sh
```
Show current memory consumption

Run a `k8s/bin/ks get pods | grep kube-etcd` to figure out which active ETCD pods are currently running and adapt the `kube-etcdXXXX` number in the command below. The etcd-operator might have reacted pods with different numbers (due to an outage/recreation)
```
export ETCDCTL_API=3 && /usr/local/bin/etcdctl --cert /tls/etcd-client.crt --key /tls/etcd-client.key --cacert /tls/etcd-client-ca.crt --endpoints https://kube-etcd-0000.kube-etcd.kube-system.svc:2379,https://kube-etcd-0001.kube-etcd.kube-system.svc:2379,https://kube-etcd-0002.kube-etcd.kube-system.svc:2379 endpoint status -w table
```
This should output something like that
```
+-------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://kube-etcd-0000.kube-etcd.kube-system.svc:2379 | 4fa3b433fd377be7 | 3.1.8 | 201 MB | false | 45 | 156474994 |
| https://kube-etcd-0001.kube-etcd.kube-system.svc:2379 | 97432066a63c6f3a | 3.1.8 | 201 MB | true | 45 | 156474994 |
| https://kube-etcd-0002.kube-etcd.kube-system.svc:2379 | f419c478000bdea4 | 3.1.8 | 201 MB | false | 45 | 156474995 |
+-------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
```

Defrag ETCD storage

```
export ETCDCTL_API=3 && /usr/local/bin/etcdctl --cert /tls/etcd-client.crt --key /tls/etcd-client.key --cacert /tls/etcd-client-ca.crt --endpoints https://kube-etcd-0000.kube-etcd.kube-system.svc:2379,https://kube-etcd-0001.kube-etcd.kube-system.svc:2379,https://kube-etcd-0002.kube-etcd.kube-system.svc:2379 defrag
```
Validate the storage consumtion with the `show operation` above once the `defrag` finished.

Restore Cluster from Backup
---------------------------

Download from S3 the latest snapshot of your ETCD backup. The snapshot file typically looks something like this `3.1.8_0000000000676d59_etcd.backup`.

Add those two lines to your `terraform.tfvars` file
```
recover_cluster = true
etcd_backup_file = PATH_TO_ETCD_SNAPSHOT
```

Before you run the restore operation first run a plan operation
```
k8s/bin/plan -s
```
You should see, that the compute nodes `storege` (which are the Master nodes) are marked with a `+/-` which means that they will be recreated.

To start the restore run the apply operation
```
k8s/bin/apply -y
```
If you run into any errors, re-run the apply operation until the master nodes have been replaced. Especially on OpenStack, sometimes the hypervisor placement (anti-affinity) is causing some trouble due to an OpenStack race condition. Here a re-run is the safest bet. In theory, if the first master [0] has been created, commenting out the 2 options from the `terraform.tfvars` file and a re-run of `k8s/bin/ks apply -y` will recreated the missing machines.

__IMPORTANT__: Remove/comment out the `recovery_cluster` and the `etcd_backup_file` flag from the `terraform.tfvars` file after the restore has been completed. Otherwise the cluster will be recovered every time you run terraform again.

Post Restore Check

After the cluster is restored, you need to make sure that the worker nodes have successfully joined the cluster. A restart of the worker nodes might be necessary

```
k8s/bin/worker {0,1,2,...}
sudo reboot
```

0 comments on commit a7a4b71

Please sign in to comment.