fix(dind): clean per-job state + add per-repo image cache by luthermonson · Pull Request #70 · ephpm/ephemerd

luthermonson · 2026-05-16T14:46:10Z

Summary

Two related problems land together because they're tightly coupled: fixing the cleanup leak would force a full re-pull of kindest/node (~1 GB) every job without the cache, and the cache only works because cleanup never touches its long-lived namespaces.

Problem 1: per-job state leak

dind.Server.Stop() destroyed in-memory tracked containers but left the underlying containerd per-job namespace (ephemerd-dind-<runner-name>) populated with Image records, leases, snapshots, and content blobs. Over ~2 days we accumulated 73 leaked namespaces holding ~98 GB on a 100 GB VHDX, eventually blocking new jobs with No space left on device.

Fix

pkg/dind/cleanup.go CleanupJobNamespace:

Enumerate and remove containers (with WithSnapshotCleanup).
Delete Image records (drops gc.ref.content.* labels).
Delete leases.
Walk the snapshotter and remove snapshots leaf-first in a multi-pass loop — containerd refuses to delete a snapshot that still has children, so the obvious order-doesn't-matter walk fails on every image layer above the bottom.
Walk the content store and remove blobs that aren't pinned elsewhere.
NamespaceService().Delete() the metadata bucket.

Plus a 3-attempt async-GC-catchup retry for transient FailedPrecondition from containerd's eventually-consistent state, and a diagnostic log of the stuck snapshot's Stat() (name/parent/kind/labels) if the loop bails so future failures tell us what's pinning.

On worker-mode boot, CleanupStaleDindNamespaces sweeps anything left behind by ungraceful exits (DeadlineExceeded, SIGKILL, host reboot).

Verified live

4 consecutive ephpm E2E jobs across two parallel runners exit-and-clean with zero leftover namespaces. VHDX growth bounded at ~59 GB after extensive testing vs the pre-fix unbounded growth. Console log shows dind cleanup: namespace removed for each completed runner.

Problem 2: per-repo image cache to avoid the re-download tax

Without it, every job would re-pull kindest/node and any other dind image because the Image records get deleted in step 2 above, dropping the gc.refs on the underlying content blobs.

Fix

pkg/dind/cache.go introduces a long-lived per-(provider, repo) namespace:

ephemerd-dind-cache-<provider>-<sanitized-repo>

On image pull or container create that picked up a previously-pulled image, dind mirrors the Image record into the cache namespace (or refreshes its ephemerd.io/last-accessed label if already there). The cache's gc.ref labels keep the content blobs alive after the per-job namespace is cleaned up, so subsequent jobs in the same repo get a content-store hit instead of a network pull.

Privacy boundary

Containerd namespace isolation means a content blob referenced only from dind-cache-foo-private's Image records is invisible to any other namespace's resolver. Two forges with same-named repos (e.g. github/ephpm vs gitea/ephpm) get distinct cache namespaces; two repos within a forge get distinct caches keyed by full owner/repo. Provider + Repo plumb through CreateJobRequest → runtime.CreateConfig → dind.Config so the cache namespace is derived from the dispatching forge, not parsed from the runner name (which would lose provider info).

Cache pruner

cmd/ephemerd/main.go starts a goroutine in worker mode that walks every dind-cache-* namespace every [dind].cache_prune_interval (default 24h) and evicts Image records whose last-accessed label is older than [dind].cache_max_age (default 168h / 7 days). Empty cache namespaces are removed entirely. Records pre-dating the label fall back to UpdatedAt so a deploy of this change doesn't nuke existing caches on first prune.

[dind]
  enabled = true
  cache_prune_interval = "24h"   # how often the sweeper wakes up
  cache_max_age        = "168h"  # 7 days — LRU threshold

Tests

TestCleanup_DindNamespaces/RemovesImageLeaseAndNamespace — full delete cycle.
TestCleanup_DindNamespaces/StaleSweepFiltersByPrefix — sweep touches only ephemerd-dind-* (non-cache) namespaces.
TestCacheNamespace_FormatAndIsolation — cross-provider + nested-repo sanitization, empty-input → empty-result (caching disabled).
TestSanitizeForNamespace_CollapsesAndTrims — no leading/trailing separators, no consecutive separators, alphanumerics/._- preserved.
TestCache_MirrorAndPrune — full mirror → refresh → backdate → prune → empty-namespace cleanup lifecycle.
TestCachePrune_KeepsFreshAndPrefixedOnly — fresh records survive, non-cache namespaces untouched.
TestPushHandlerEndToEnd still passes (rewired to sharedTestContainerd because containerd's prometheus metrics use a process-global registry — two containerdpkg.New() in one test binary panics).

All 26 packages green locally: CGO_ENABLED=0 go test -tags containers_image_openpgp -count=1 ./.... Local mage lint blocked by the known Windows-only miekg/pkcs11 cgo typecheck issue (documented in AGENTS.md); CI lint on Linux runs clean.

Test plan

Live verified: 4-of-4 ephpm E2E job cycles clean exit on the Hyper-V Linux VM dispatcher, no leftover namespaces.
Cache namespace persists across job cycles in the same repo.
VHDX growth bounded (~59 GB after sustained testing vs unbounded growth pre-fix).
Watch first scheduled prune cycle (24h after first cached image) to confirm TTL works in production with real timestamps. Worth checking in a few days.
Cross-provider isolation: when a Gitea/Forgejo dispatcher is wired up alongside the GitHub one, confirm dind-cache-gitea-* and dind-cache-github-* don't share blobs even with same-named repos.

Why one PR

The two changes are coupled by design: the leak fix would defeat the cache, the cache work assumes the cleanup never touches its namespaces. Splitting would require landing the leak fix in a state where every job has the re-download tax, then later landing the cache work — strictly worse in production for as long as PR1 is alone on main.

Two related problems landed together because they're tightly coupled: the cleanup work would force a full re-pull of kindest/node (~1 GB) per job without the cache, and the cache trusts cleanup never touches its long-lived namespaces. Problem 1: per-job state leak dind.Server.Stop() destroyed in-memory tracked containers but left the underlying containerd per-job namespace (ephemerd-dind-<runner-name>) populated with Image records, leases, snapshots and content blobs. Over ~2 days we accumulated 73 leaked namespaces holding ~98 GB on a 100 GB VHDX, blocking new jobs with "no space left on device". Fix: pkg/dind/cleanup.go's CleanupJobNamespace enumerates and removes containers (with WithSnapshotCleanup), Image records (drops gc.ref labels), leases, then snapshots in leaf-first multi-pass order (containerd refuses to delete a snapshot with children), then content blobs, then the namespace metadata bucket itself. A 3-attempt async-GC-catchup retry handles transient FailedPrecondition results from containerd's eventually-consistent state. On boot, worker mode runs CleanupStaleDindNamespaces to sweep anything left behind by ungraceful exits (DeadlineExceeded, SIGKILL, host reboot). Verified live: 4 consecutive ephpm E2E jobs across two parallel runners exit-and-clean with zero leftover namespaces, VHDX growth bounded at ~59 GB after extensive testing vs the pre-fix unbounded growth. Problem 2: per-repo image cache to avoid the re-download tax Without it, every job would re-pull kindest/node and any other dind image because the Image records get deleted in step 2 of cleanup above, dropping the gc.refs on the underlying content blobs. Fix: pkg/dind/cache.go introduces a per-(provider, repo) long-lived namespace at ephemerd-dind-cache-<provider>-<sanitized-repo>. On image pull or container create that picked up a previously-pulled image, dind mirrors the Image record into the cache namespace (or refreshes its ephemerd.io/last-accessed label if already there). The cache's gc.refs keep the content blobs alive after the per-job namespace is cleaned up, so subsequent jobs in the same repo get a content-store hit instead of a network pull. Privacy boundary: containerd namespace isolation means a content blob referenced only from `dind-cache-foo-private`'s Image records is invisible to any other namespace's resolver. Two forges with same-named repos (e.g. github/ephpm vs gitea/ephpm) get distinct cache namespaces; two repos within a forge get distinct caches keyed by full owner/repo. Provider + Repo plumb through CreateJobRequest → runtime.CreateConfig → dind.Config so the cache namespace is derived from the dispatching forge, not parsed from the runner name. Cache pruner (cmd/ephemerd/main.go): a goroutine started in worker mode walks every dind-cache-* namespace every [dind].cache_prune_interval (default 24h) and evicts Image records whose last-accessed label is older than [dind].cache_max_age (default 168h / 7 days). Empty cache namespaces are removed entirely. Records pre-dating the label fall back to UpdatedAt so a deploy of this change doesn't nuke existing caches on first prune. Tests (all green locally with shared in-process containerd): - TestCleanup_DindNamespaces — image + lease + namespace removal - TestCleanup_DindNamespaces/StaleSweep — prefix filter, doesn't touch cache namespaces or non-dind namespaces - TestCacheNamespace_FormatAndIsolation — cross-provider + nested-repo sanitization, empty-input handling - TestSanitizeForNamespace_CollapsesAndTrims — no leading/trailing separator, no consecutive separators - TestCache_MirrorAndPrune — full mirror → refresh → backdate → prune → empty-namespace cleanup lifecycle - TestCachePrune_KeepsFreshAndPrefixedOnly — fresh records survive, non-cache namespaces untouched - TestPushHandlerEndToEnd still passes against the shared containerd (rewired to sharedTestContainerd because containerd's prometheus metrics use a process-global registry — two containerdpkg.New() in one test binary panics).

Extends docs/architecture/fake-docker-daemon.md with two new sections: - "Per-Job Namespace and Cleanup" covers the ephemerd-dind-<runner-name> namespace, the 6-step CleanupJobNamespace sequence (containers, images, leases, leaf-first snapshots, content, namespace), the FailedPrecondition retry, and the worker-mode CleanupStaleDindNamespaces sweep for crash recovery. - "Per-Repo Image Cache" covers the ephemerd-dind-cache-<provider>-<sanitized-repo> namespace, the Provider/Repo plumbing path, the two cache-write events (pull and container-create), the containerd namespace-isolation privacy guarantee (and the explicit "don't ever set namespace.shareable" caveat), and the prune semantics including the UpdatedAt fallback for records pre-dating the last-accessed label. Updates docs/getting-started/configuration.md to surface cache_prune_interval and cache_max_age on the example config block and adds a [dind] section reference paragraph explaining the cache behavior, privacy boundary, and disable knobs. Also refreshes the Key Files table to include cleanup.go, cache.go, and their respective test files.

luthermonson added 2 commits May 16, 2026 07:44

luthermonson merged commit 2fa3b52 into main May 16, 2026
4 checks passed

luthermonson deleted the fix/dind-namespace-cleanup branch May 16, 2026 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dind): clean per-job state + add per-repo image cache#70

fix(dind): clean per-job state + add per-repo image cache#70
luthermonson merged 2 commits into
mainfrom
fix/dind-namespace-cleanup

luthermonson commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

luthermonson commented May 16, 2026

Summary

Problem 1: per-job state leak

Fix

Verified live

Problem 2: per-repo image cache to avoid the re-download tax

Fix

Privacy boundary

Cache pruner

Tests

Test plan

Why one PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant