Skip to content

Plan: Launchplane-owned runner host hygiene #474

@cbusillo

Description

@cbusillo

Objective

Make runner-host hygiene a Launchplane-owned operational workflow instead of scattered repo-local Docker pruning.

Finish Line

chris-testing hygiene is globally scheduled and auditable via Launchplane

Current Status

State: Phase-one mutating hygiene completed; typed Docker reclaimable evidence and structured timeout handling are implemented and merged. Runner-host hygiene report, dry-run apply planning, adapter-boundary planning, Launchplane-owned audit storage, service audit evidence ingress, response summary fields, dedicated self-hosted ops executor lane, typed report counters, and command-timeout normalization are implemented. PR #886 added typed observation counters to RunnerHostHygieneReport, parses live docker system df --format reclaimable values into docker_reclaimable_bytes, and fails closed if Docker summary evidence cannot produce reclaimable bytes. PR #887 converts local runner command timeouts into structured RemoteCommandResult(returncode=124) failures with captured stdout/stderr instead of uncaught exceptions. The late auto-review's active-build self-match finding was stale for current main: PR #885 already fixed the probe with bracketed patterns, and mutate run 26366919592 proved the lane can execute.
Next action: Add the next read-only inventory layer for per-resource image/volume facts: image repository/tag/id/age/size/dangling/in-use hints, volume name/driver/labels/mountpoint/container references/size where feasible, and explicit no-touch classification for warm builders and runner/bootstrap state. Separately draft the chris-testing replacement/runbook.
Blocked by: No native issue blocker.
Waiting for: Operator decision after read-only inventory shows concrete image/volume candidates; do not approve phase-two mutation from aggregate reclaimable totals alone.
Last verified: 2026-05-24 after PR #887 merge commit 8f20676205b4a5ca660ca37a2566b7df46231c4c; main CI, Security, CodeQL passed and live health returned status: ok with storage_backend: postgres. The most recent live hygiene dry-run proof remains Runner Host Hygiene run 26367597023 from PR #886, which wrote typed pre-apply evidence with free_disk_bytes=331590270976, docker_reclaimable_bytes=148520000000, runner_workdir_bytes=0, orphan_buildkit_containers=0, orphan_buildkit_volumes=0, and preserved warm builders odoo-docker:verify-devtools and odoo-docker:verify-runtime.

Scope

  • Add a Launchplane-owned global maintenance model for self-hosted runner hosts such as chris-testing.
  • Preserve a narrow host-side executor/script for privileged Docker operations.
  • Record before/after disk and Docker evidence, cleanup mode, retained builder budgets, and any skipped legacy state.
  • Move shared host Docker pruning away from product repos, especially the existing verireel runner Docker prune workflow.
  • Keep product-specific GHCR retention and preview lifecycle cleanup in their product/Launchplane domains.

Acceptance Criteria

  • There is one canonical scheduled owner for chris-testing Docker/BuildKit hygiene.
  • Routine hygiene is bounded and preserves known warm builders: odoo-docker-chris-testing and odoo-enterprise-chris-testing.
  • The first mutating pass is treated as phase-one bounded BuildKit pruning; post-run evidence decides whether to push forward into broader cleanup.
  • Legacy/orphan BuildKit containers, images, and volumes are reported by default and removed only through an explicit reviewed retirement mode.
  • Launchplane stores or exposes durable evidence for each run: host, caller, mode, before/after df, Docker summary, builder volumes, reclaimed estimate, and failures.
  • Product repos no longer run broad shared-host Docker prune jobs.
  • Docs describe when to use Launchplane hygiene versus repo-specific cleanup.
  • The plan includes a chris-testing fragility/replacement runbook: what roles the host performs, what labels/service users/config it needs, what caches are intentionally warm, and how to stand up a replacement or parallel runner if the host fails.

Relationships

Validation

  • Run report mode against chris-testing without mutation.
  • Run apply mode on schedule or manual dispatch and verify bounded cleanup only.
  • Verify Odoo warm builders remain after cleanup and warm publish stays fast.
  • Verify Launchplane evidence/audit records are written and visible.

Decisions

  • Prefer Launchplane as the global control plane for runner-host hygiene.
  • Keep privileged host mutation narrow and explicit; Launchplane should own intent, authorization, schedule, and evidence.
  • Do not make this repo-by-repo cleanup.

Open Questions

  • After the first mutate=true bounded BuildKit prune, does post-run evidence show enough reclaimed disk, or should Launchplane add an explicitly reviewed second mode for orphan image/volume cleanup?
  • What retention budgets should be encoded after the Odoo consolidation: images, generic BuildKit, Odoo builders, action runner _work, and logs?
  • How fragile is chris-testing now that it has grown from a basic runner into a multi-role host for Odoo verification, warm builders, and Launchplane hygiene operations?
  • What is the target recovery design: rebuild chris-testing from documented steps, maintain a warm standby, or split responsibilities across dedicated runner hosts?
  • What minimum replacement runbook is required before relying on the host for scheduled hygiene: OS/packages, Docker/BuildKit setup, GitHub runner registration, labels, service users, Launchplane repo variables, OIDC grants, warm builder seeding, and dry-run validation?

Metadata

Metadata

Assignees

No one assigned

    Labels

    planDurable planning issueplan:waitingPlan is waiting on non-issue evidence or decision

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions