Skip to content

drae/k8s-home-ops

Repository files navigation

k8s-home-ops

Flux management of my home cluster

Discord talos k3s Renovate


👋 Overview

This repo contains my flux2 based gitops workflow for maintaining my talos home cluster. Storage for stateful containers utilises rook-ceph, and nfs is used for accessing data stored on my white box, Ubuntu powered, ZFS backed NAS.

🤖 Renovate and Github actions are used to automatically open pull requests, produce diffs, and do security checks for updated charts, images, and other resources.

The following applications are used to install and manage the cluster:

💻 Hardware

The single node cluster comprises 3 identical a Lenovo M720Q Thinkcentre, specs: Intel Core i5-9500T, 16GB DDR4, 240GB SATA SSD (OS), and 500GB NVME SSD (Data).

Other hardware includes an aging self built NAS (Celeron based, 16GB DDR3 in a neat U-NAS 800 case) and, currently, a Raspberry Pi 4B router running OpenWRT.

The Pi replaces an older Atom based self-built router who's external PSU went 💥 after many years of service. Given that router pulled nearly 20W, and with rising energy costs in mind, I thought I'd try the Pi 4. It works surprisingly well - and sips power.

Total power use (inc. switch, zigbee transceivers, modem, etc.) varies from around 66-80W (with disks spundown and cluster idling), 105-115W with Plex direct playing or (hardware) transcoding, to 130-150W when the cluster/NAS are particularly busy (scrubing, etc.). I would say it averages out over 24 hours to probably 90-110W.

🤔 Before we start

The repo has lots of encrypted data that is tied to my GPG private key. You cannot simply clone this repo, follow this walkthrough, and have a functioning cluster. You will need to use your own key (GPG, age, azure keystore, etc.), update the .sops.yaml file in the root of the repo, replacing my public gpg key(s) with your own key.

Then you will need to re-create all the individual XXXXX.sops.yaml files in the repo (infrastructure/* and cluster/*) with your own data and encrypt them with your own gpg, age, azure, ... or other key.

Creating your own key is "out of scope" for this readme, a quick google will get you sorted!

💾 Installing the cluster

Clone the repo, change to the new folder and run direnv allow to enable the loading of certain environment variables:

git clone https://github.com/drae/k8s-home-ops.git && cd k8s-home-ops && direnv allow

I use the stable version of talos as the operating system for running my home cluster. I do not use pxe booting or anything fancy, I burn the .iso to a usb and install directly.

talhelper, a great tool by budimanjojo simplifies creation of the necessary configuration. From scratch run:

cd infrastructure/talos
talhelper gensecret --patch-configfile > talenv.sops.yaml
sops -e -i talenv.sops.yaml
talhelper genconfig
cd ../..

Unless the secrets are updated, future updates of the configuration only requires talhelper genconfig to be run.

All the necessary node and talosconfig files are created in the infrastructure/talos/clusterconfig folder. Note that I add additional parameters to the talenv.sops.yaml file, see the talconfig.yaml for more info (look for variables of the form ${<VAR NAME>} and replicate any missing in talenv file with relevant values).

I then apply the configuration to each prepared node (i.e. booted with the talos usb flash drive):

talosctl -n <NODE IP> apply-config infrastructure/talos/clusterconfig/<NODE CONFIG>.yaml --insecure

Then sit back and watch the node prepare itself, rinse and repeat for all the nodes. With the nodes prepared it is time to issue the bootstrap command to just one of the nodes:

talosctl -n <NODE IP> bootstrap

This initiates the installation of kubernetes cluster wide. Once bootstraped I can download the kubeconfig:

talosctl kubeconfig > cluster/kubeconfig

and apply a "temporary" CNI configuration (I use cilium as my CNI) and the csr auto approver. This is just enough configuration to enable networking (full configuration is managed by flux):

kubectl kustomize infrastructure/talos --enable-helm | k apply -f -

Doing it this way ensures all the relevant helm annotations are included in the manifest. Without these flux will fail to take over management of the installation. With this complete that should be it, cluster is ready 🎉🎉🎉

✳️ See the following folder for more details on the talos configuration: /infrastructure/talos. Note that sops is used to encrypt some of the more sensitive information!

✳️ I use haproxy as the load balancer for both the talos and kubernetes control planes. Previously I have used the shared layer-2 vip method but it can sometimes throw a fit that is difficult or even impossible to recover from (probably due to my lack of knowledge and pushing capabilities). An example configuration for haproxy can be found here

🥾 Bootstraping the cluster

Create the flux-system namespace:

kubectl create namespace flux-system

Export the SOPS secret key from GPG and import to the cluster:

gpg --export-secret-keys --armor "<GPG>" | kubectl create secret \
  generic sops-gpg --namespace=flux-system --from-file=sops.asc=/dev/stdin

The above commands can also be invoked using task:

task cluster:bootstrap-sops KEY=<GPG>

where GPG is the relevant key from your gpg keyring. If you do not specify a key mine will be used (and will fail because you do not have my private key ... I hope!).

If this is a new installation or no pre-existing github token is available run (replacing <REPO URL> as appropriate):

flux create source git flux-system --url=<REPO URL> --branch=main -n flux-system

Now apply the flux manifests using kustomize:

kubectl apply -k cluster/base/flux-system

This will need to be run twice due to race conditions. Sit back and watch the cluster install itself, magic 🪄.

💽 Backup and recovery

Previously I have tried all the main/usual backup and recovery solutions for K8S. Every one of them has had some kind of issue.

Fortunately, the k8s-at-home peeps (specifically onedr0p) have devised a really simple, yet incredibly effective "Poor Mans Backup" (PMB) solution. It uses a Kyverno deployed cronjob to directly backup (using kopia) specifically labelled PVC's to a user defined location (in my case, my NAS). Recovery is incredibly simple, a task routine is called with the name of the app to be recovered and tada, recovery. I've tried this out cough many times now and it has worked successfully every single time. A++ would recommend 👍

✳️ See the following files for more information: snapshot-cronjob-controller.yaml and SnapshotTasks.yml

🤝 Thanks

There exists a great community of small (and not so small!) scale enthusiasts (and professionals) running K8S privately. My repo is based large on many things I have learnt or borrowed from these peeps. Checkout the k8s@home discord, awesome-home-kubernetes and other repos for more information.