diff --git a/infrastructure/cicd/overview.mdx b/infrastructure/cicd/overview.mdx index 750e2f3..817b48d 100644 --- a/infrastructure/cicd/overview.mdx +++ b/infrastructure/cicd/overview.mdx @@ -23,6 +23,18 @@ Pick by what the workload actually needs: The decision tree is workload-first: a macOS build picks the Mac tier; an IaC apply picks RunsOn; a public-repo lint picks GitHub-hosted; a sensitive-credential job picks the locked-down self-hosted runner. The cost ordering is "free → very cheap → host-cost → host-cost", but the cost is rarely what drives the choice. +## Self-hosted runner reliability + +The two self-hosted tiers (Mac and locked-down) are the only ones the org physically operates. Each runner is a single point of failure for any E2E gate that targets it. Every self-hosted runner MUST satisfy all five: + +1. **GitHub App auth, not personal access token.** The runner image authenticates via `APP_ID` + `APP_PRIVATE_KEY` and mints registration tokens from installation tokens internally. Installation tokens auto-refresh and never expire while the App stays installed. PATs are forbidden — fine-grained PATs cap at one year and the expiry is invisible upstream. +2. **Digest-pinned runner image or VM template.** No floating tags (`:latest`, `:ubuntu-jammy` alone). Use `image@sha256:...` with Renovate's docker-compose / docker-image manager tracking the digest, or pin the VM build artifact and bump deliberately. +3. **Process-level healthcheck** — Docker `healthcheck:`, systemd `WatchdogSec`, or equivalent — that probes the runner's actual ability to do its job (reach `api.github.com`, talk to the cluster, etc.). Failed health surfaces in standard inspection tools (`docker compose ps`, `systemctl status`). +4. **Dead-man's-switch heartbeat** to healthchecks.io or equivalent, pinged only when the runner is healthy. healthchecks.io fires the on-call page on missed beats. +5. **Pre-flight secret check** that asserts required secrets (App key, kubeconfig, age key) are non-empty in the injected env before launching the runner process. Fail loud with the actionable error. + +Reference implementation: [`orbstack-kubernetes/docker/actions-runner/`](https://github.com/JacobPEvans/orbstack-kubernetes/tree/main/docker/actions-runner) (`docker-compose.yml`, `Makefile` `runner-*` targets, `docs/TESTING.md`). + ## The shape of every IaC pipeline | Stage | Trigger | Where it runs | What it does |