Skip to content

chore(runner): pin image digest, add healthcheck, preflight token check#237

Merged
JacobPEvans merged 1 commit into
mainfrom
chore/runner-hardening
May 25, 2026
Merged

chore(runner): pin image digest, add healthcheck, preflight token check#237
JacobPEvans merged 1 commit into
mainfrom
chore/runner-hardening

Conversation

@JacobPEvans
Copy link
Copy Markdown
Collaborator

Summary

Three additive hardening changes to the self-hosted runner so the recurring token-refresh / silent-failure loop that blocked #234 has fewer ways to bite. Per the request to make the runner reliable through best practices and latest versions — not by piling on validation logic.

  • Image digest pin. myoung34/github-runner:ubuntu-jammy@sha256:0d48… so pulls are deterministic and Renovate's docker-compose manager auto-bumps the digest when the upstream ubuntu-jammy build moves.
  • Drop DISABLE_AUTO_UPDATE=1. Lets the actions/runner binary self-update on registration; combined with the Renovate-managed image digest both layers stay current without manual docker pull cycles.
  • Docker healthcheck. curl https://api.github.com/zen every 60s. Surfaces in docker compose ps and make runner-doctor-container so DNS/reachability regressions fail loud instead of silent.
  • Preflight token check. runner-preflight is now a prerequisite of runner-foreground and runner-start. Asserts GH_PAT_RUNNER_TOKEN is non-empty in the Doppler-injected env before docker compose up — the LaunchAgent's silent 30s retry loop becomes a single actionable line in ~/Library/Logs/orbstack-runner/stderr.log.

Out of scope (separate follow-up)

Migration from fine-grained PAT to GitHub App auth. The myoung34 image natively supports APP_ID + APP_PRIVATE_KEY and mints installation tokens internally (auto-refreshing, no rotation), but it requires creating a GitHub App in the org and updating Doppler — captured as its own issue. This PR keeps the existing PAT path intact.

Test plan

  • pre-commit run --files Makefile docker/actions-runner/docker-compose.yml — passes
  • make -n runner-foreground — dependency chain renders correctly (kubeconfig → preflight → compose up)
  • make runner-preflight against live Doppler — exits 0
  • make runner-doctor-container against the currently-running legacy container — passes (the {{if .State.Health}}…{{else}}healthy{{end}} template gracefully handles the no-healthcheck container during the transition)
  • After merge: respawn runner via launchctl kickstart -k gui/\$(id -u)/\$(make -s runner-print-label); confirm make runner-doctor reports health: healthy on the new container
  • E2E gate passes on this PR (this PR itself exercises the runner end-to-end)

Assisted-by: Claude noreply@anthropic.com

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a preflight check to verify the presence of the GitHub runner token and adds container health checks to both the Docker Compose configuration and the Makefile's diagnostic tools. Additionally, the runner image is now pinned by digest, and automatic updates for the runner agent have been enabled. Feedback was provided regarding an inconsistency in the Makefile's health check validation logic, where the default status value should be unified to ensure accurate reporting and validation.

Comment thread Makefile Outdated
@JacobPEvans JacobPEvans force-pushed the chore/runner-hardening branch from fb2be64 to 8ec28ed Compare May 24, 2026 15:43
JacobPEvans added a commit to dryvist/ai-assistant-instructions that referenced this pull request May 24, 2026
The existing Runner Choice rule covered RunsOn vs github-hosted but said
nothing about on-prem self-hosted runners. Two repos use them
(orbstack-kubernetes, ansible-proxmox-apps), and the recurring
token-refresh failure in orbstack-kubernetes (PR #234, dryvist/orbstack-kubernetes#237)
shows what happens when the on-prem path has no documented requirements.

- Reframe the lead paragraph so RunsOn is the default and on-prem is an
  explicit exception for hardware-bound jobs, not a routine choice.
- Add an on-prem row to the decision table with the actual in-use labels.
- Add an "On-prem runner requirements" subsection listing the five
  non-negotiables for any on-prem runner: GitHub App auth (not PAT),
  digest-pinned image, healthcheck, dead-man's-switch heartbeat,
  pre-flight secret check.

Reference implementation called out as the orbstack-kubernetes runner
once dryvist/orbstack-kubernetes#237 and #238 land.

Assisted-by: Claude <noreply@anthropic.com>
JacobPEvans added a commit that referenced this pull request May 24, 2026
The ignorePatterns entry pointed at `kubernetes-monitoring` — the
repo's previous name before c21d9ca renamed it to `orbstack-kubernetes`.
The pattern hasn't matched any actual self-references since that rename,
which means every github.com self-ref in CHANGELOG.md (release-please
compare links), docs, and READMEs has been getting hit by markdown-link-check
and randomly failing pre-commit with transient GitHub 502s. PRs #234,
#237, #240, #241 have all hit this flake in the last 24 hours.

Fix: update the pattern to the current repo name. Self-references are
safe to skip because the CHANGELOG compare links and issue links are
machine-generated by release-please and can't have typos; cross-repo
links are still validated.

Assisted-by: Claude <noreply@anthropic.com>
JacobPEvans added a commit that referenced this pull request May 24, 2026
…242)

The ignorePatterns entry pointed at `kubernetes-monitoring` — the
repo's previous name before c21d9ca renamed it to `orbstack-kubernetes`.
The pattern hasn't matched any actual self-references since that rename,
which means every github.com self-ref in CHANGELOG.md (release-please
compare links), docs, and READMEs has been getting hit by markdown-link-check
and randomly failing pre-commit with transient GitHub 502s. PRs #234,
#237, #240, #241 have all hit this flake in the last 24 hours.

Fix: update the pattern to the current repo name. Self-references are
safe to skip because the CHANGELOG compare links and issue links are
machine-generated by release-please and can't have typos; cross-repo
links are still validated.

Assisted-by: Claude <noreply@anthropic.com>
JacobPEvans added a commit that referenced this pull request May 24, 2026
The configmap template for Cribl Stream wraps the HEC URL in double
quotes (k8s/monitoring/cribl-stream-standalone/configmap-cribl-config.yaml
line 12), so after sed substitution the deployed outputs.yml renders as:

    url: "https://10.0.1.200:8088/services/collector"

The helper's regex required strictly unquoted values, giving a false
negative on every run and failing test_splunk_hec_url_matches_secret.
This has blocked PR #234 and #237 on E2E for two days.

Fix: allow an optional matching pair of single or double quotes around
the URL value. Tested against six representative cases (double-quoted,
single-quoted, unquoted, mismatched-quote, unsubstituted placeholder,
different URL) — all behave correctly.

Assisted-by: Claude <noreply@anthropic.com>
JacobPEvans added a commit that referenced this pull request May 24, 2026
The configmap template for Cribl Stream wraps the HEC URL in double
quotes (k8s/monitoring/cribl-stream-standalone/configmap-cribl-config.yaml
line 12), so after sed substitution the deployed outputs.yml renders as:

    url: "https://10.0.1.200:8088/services/collector"

The helper's regex required strictly unquoted values, giving a false
negative on every run and failing test_splunk_hec_url_matches_secret.
This has blocked PR #234 and #237 on E2E for two days.

Fix: allow an optional matching pair of single or double quotes around
the URL value. Tested against six representative cases (double-quoted,
single-quoted, unquoted, mismatched-quote, unsubstituted placeholder,
different URL) — all behave correctly.

Assisted-by: Claude <noreply@anthropic.com>
JacobPEvans added a commit that referenced this pull request May 24, 2026
…241)

The configmap template for Cribl Stream wraps the HEC URL in double
quotes (k8s/monitoring/cribl-stream-standalone/configmap-cribl-config.yaml
line 12), so after sed substitution the deployed outputs.yml renders as:

    url: "https://10.0.1.200:8088/services/collector"

The helper's regex required strictly unquoted values, giving a false
negative on every run and failing test_splunk_hec_url_matches_secret.
This has blocked PR #234 and #237 on E2E for two days.

Fix: allow an optional matching pair of single or double quotes around
the URL value. Tested against six representative cases (double-quoted,
single-quoted, unquoted, mismatched-quote, unsubstituted placeholder,
different URL) — all behave correctly.

Assisted-by: Claude <noreply@anthropic.com>
Three additive hardening changes to the self-hosted runner; addresses the
recurring token-refresh failure loop that blocked #234 at merge.

- Pin myoung34/github-runner to the current ubuntu-jammy multi-arch
  manifest digest (sha256:0d48...) so pulls are deterministic and
  Renovate's docker-compose manager tracks future builds.
- Drop DISABLE_AUTO_UPDATE=1 so the actions/runner binary self-updates
  on registration; combined with the Renovate-managed image digest this
  keeps both layers current without manual docker pull cycles.
- Add a Docker healthcheck (curl https://api.github.com/zen) and surface
  it through runner-doctor-container so reachability regressions show up
  in `docker compose ps` and the doctor target instead of failing silently.
- Add runner-preflight as a dependency of runner-foreground and
  runner-start; doppler-injected GH_PAT_RUNNER_TOKEN is asserted non-empty
  before docker compose up so the LaunchAgent stops the silent 30s retry
  loop and surfaces the actionable error in stderr.log.

Follow-up: migrate from fine-grained PAT to GitHub App auth
(myoung34/github-runner natively supports APP_ID/APP_PRIVATE_KEY and
mints installation tokens internally — no expiry, no manual rotation).
Captured as a separate issue because it requires creating a GitHub App
in the org and updating Doppler.

Assisted-by: Claude <noreply@anthropic.com>
@JacobPEvans JacobPEvans force-pushed the chore/runner-hardening branch from 8ec28ed to 219fcaf Compare May 25, 2026 00:19
@JacobPEvans JacobPEvans merged commit 7621f8e into main May 25, 2026
17 checks passed
@JacobPEvans JacobPEvans deleted the chore/runner-hardening branch May 25, 2026 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant