Objective
Make runner-host hygiene a Launchplane-owned operational workflow instead of scattered repo-local Docker pruning.
Finish Line
chris-testing hygiene is globally scheduled and auditable via Launchplane
Current Status
State: Phase-one mutating hygiene completed; typed Docker reclaimable evidence and structured timeout handling are implemented and merged. Runner-host hygiene report, dry-run apply planning, adapter-boundary planning, Launchplane-owned audit storage, service audit evidence ingress, response summary fields, dedicated self-hosted ops executor lane, typed report counters, and command-timeout normalization are implemented. PR #886 added typed observation counters to RunnerHostHygieneReport, parses live docker system df --format reclaimable values into docker_reclaimable_bytes, and fails closed if Docker summary evidence cannot produce reclaimable bytes. PR #887 converts local runner command timeouts into structured RemoteCommandResult(returncode=124) failures with captured stdout/stderr instead of uncaught exceptions. The late auto-review's active-build self-match finding was stale for current main: PR #885 already fixed the probe with bracketed patterns, and mutate run 26366919592 proved the lane can execute.
Next action: Add the next read-only inventory layer for per-resource image/volume facts: image repository/tag/id/age/size/dangling/in-use hints, volume name/driver/labels/mountpoint/container references/size where feasible, and explicit no-touch classification for warm builders and runner/bootstrap state. Separately draft the chris-testing replacement/runbook.
Blocked by: No native issue blocker.
Waiting for: Operator decision after read-only inventory shows concrete image/volume candidates; do not approve phase-two mutation from aggregate reclaimable totals alone.
Last verified: 2026-05-24 after PR #887 merge commit 8f20676205b4a5ca660ca37a2566b7df46231c4c; main CI, Security, CodeQL passed and live health returned status: ok with storage_backend: postgres. The most recent live hygiene dry-run proof remains Runner Host Hygiene run 26367597023 from PR #886, which wrote typed pre-apply evidence with free_disk_bytes=331590270976, docker_reclaimable_bytes=148520000000, runner_workdir_bytes=0, orphan_buildkit_containers=0, orphan_buildkit_volumes=0, and preserved warm builders odoo-docker:verify-devtools and odoo-docker:verify-runtime.
Scope
- Add a Launchplane-owned global maintenance model for self-hosted runner hosts such as
chris-testing.
- Preserve a narrow host-side executor/script for privileged Docker operations.
- Record before/after disk and Docker evidence, cleanup mode, retained builder budgets, and any skipped legacy state.
- Move shared host Docker pruning away from product repos, especially the existing
verireel runner Docker prune workflow.
- Keep product-specific GHCR retention and preview lifecycle cleanup in their product/Launchplane domains.
Acceptance Criteria
- There is one canonical scheduled owner for
chris-testing Docker/BuildKit hygiene.
- Routine hygiene is bounded and preserves known warm builders:
odoo-docker-chris-testing and odoo-enterprise-chris-testing.
- The first mutating pass is treated as phase-one bounded BuildKit pruning; post-run evidence decides whether to push forward into broader cleanup.
- Legacy/orphan BuildKit containers, images, and volumes are reported by default and removed only through an explicit reviewed retirement mode.
- Launchplane stores or exposes durable evidence for each run: host, caller, mode, before/after
df, Docker summary, builder volumes, reclaimed estimate, and failures.
- Product repos no longer run broad shared-host Docker prune jobs.
- Docs describe when to use Launchplane hygiene versus repo-specific cleanup.
- The plan includes a
chris-testing fragility/replacement runbook: what roles the host performs, what labels/service users/config it needs, what caches are intentionally warm, and how to stand up a replacement or parallel runner if the host fails.
Relationships
Validation
- Run report mode against
chris-testing without mutation.
- Run apply mode on schedule or manual dispatch and verify bounded cleanup only.
- Verify Odoo warm builders remain after cleanup and warm publish stays fast.
- Verify Launchplane evidence/audit records are written and visible.
Decisions
- Prefer Launchplane as the global control plane for runner-host hygiene.
- Keep privileged host mutation narrow and explicit; Launchplane should own intent, authorization, schedule, and evidence.
- Do not make this repo-by-repo cleanup.
Open Questions
- After the first
mutate=true bounded BuildKit prune, does post-run evidence show enough reclaimed disk, or should Launchplane add an explicitly reviewed second mode for orphan image/volume cleanup?
- What retention budgets should be encoded after the Odoo consolidation: images, generic BuildKit, Odoo builders, action runner
_work, and logs?
- How fragile is
chris-testing now that it has grown from a basic runner into a multi-role host for Odoo verification, warm builders, and Launchplane hygiene operations?
- What is the target recovery design: rebuild
chris-testing from documented steps, maintain a warm standby, or split responsibilities across dedicated runner hosts?
- What minimum replacement runbook is required before relying on the host for scheduled hygiene: OS/packages, Docker/BuildKit setup, GitHub runner registration, labels, service users, Launchplane repo variables, OIDC grants, warm builder seeding, and dry-run validation?
Objective
Make runner-host hygiene a Launchplane-owned operational workflow instead of scattered repo-local Docker pruning.
Finish Line
chris-testing hygiene is globally scheduled and auditable via Launchplane
Current Status
State: Phase-one mutating hygiene completed; typed Docker reclaimable evidence and structured timeout handling are implemented and merged. Runner-host hygiene report, dry-run apply planning, adapter-boundary planning, Launchplane-owned audit storage, service audit evidence ingress, response summary fields, dedicated self-hosted ops executor lane, typed report counters, and command-timeout normalization are implemented. PR #886 added typed observation counters to
RunnerHostHygieneReport, parses livedocker system df --formatreclaimable values intodocker_reclaimable_bytes, and fails closed if Docker summary evidence cannot produce reclaimable bytes. PR #887 converts local runner command timeouts into structuredRemoteCommandResult(returncode=124)failures with captured stdout/stderr instead of uncaught exceptions. The late auto-review's active-build self-match finding was stale for current main: PR #885 already fixed the probe with bracketed patterns, and mutate run26366919592proved the lane can execute.Next action: Add the next read-only inventory layer for per-resource image/volume facts: image repository/tag/id/age/size/dangling/in-use hints, volume name/driver/labels/mountpoint/container references/size where feasible, and explicit no-touch classification for warm builders and runner/bootstrap state. Separately draft the
chris-testingreplacement/runbook.Blocked by: No native issue blocker.
Waiting for: Operator decision after read-only inventory shows concrete image/volume candidates; do not approve phase-two mutation from aggregate reclaimable totals alone.
Last verified: 2026-05-24 after PR #887 merge commit
8f20676205b4a5ca660ca37a2566b7df46231c4c; main CI, Security, CodeQL passed and live health returnedstatus: okwithstorage_backend: postgres. The most recent live hygiene dry-run proof remains Runner Host Hygiene run26367597023from PR #886, which wrote typed pre-apply evidence withfree_disk_bytes=331590270976,docker_reclaimable_bytes=148520000000,runner_workdir_bytes=0,orphan_buildkit_containers=0,orphan_buildkit_volumes=0, and preserved warm buildersodoo-docker:verify-devtoolsandodoo-docker:verify-runtime.Scope
chris-testing.verireelrunner Docker prune workflow.Acceptance Criteria
chris-testingDocker/BuildKit hygiene.odoo-docker-chris-testingandodoo-enterprise-chris-testing.df, Docker summary, builder volumes, reclaimed estimate, and failures.chris-testingfragility/replacement runbook: what roles the host performs, what labels/service users/config it needs, what caches are intentionally warm, and how to stand up a replacement or parallel runner if the host fails.Relationships
cbusillo/claude-local-machinerunner-cache docs andscripts/chris-testing-docker-hygiene.sh.cbusillo/verireel.github/workflows/runner-docker-prune.yml, which currently performs sharedchris-testingDocker pruning from a product repo. The Launchplane-owned hygiene path should replace this workflow before broad scheduled pruning is enabled globally.Validation
chris-testingwithout mutation.Decisions
Open Questions
mutate=truebounded BuildKit prune, does post-run evidence show enough reclaimed disk, or should Launchplane add an explicitly reviewed second mode for orphan image/volume cleanup?_work, and logs?chris-testingnow that it has grown from a basic runner into a multi-role host for Odoo verification, warm builders, and Launchplane hygiene operations?chris-testingfrom documented steps, maintain a warm standby, or split responsibilities across dedicated runner hosts?