Skip to content

fix(dind): clean per-job state + add per-repo image cache#70

Merged
luthermonson merged 2 commits into
mainfrom
fix/dind-namespace-cleanup
May 16, 2026
Merged

fix(dind): clean per-job state + add per-repo image cache#70
luthermonson merged 2 commits into
mainfrom
fix/dind-namespace-cleanup

Conversation

@luthermonson
Copy link
Copy Markdown
Contributor

Summary

Two related problems land together because they're tightly coupled: fixing the cleanup leak would force a full re-pull of kindest/node (~1 GB) every job without the cache, and the cache only works because cleanup never touches its long-lived namespaces.

Problem 1: per-job state leak

dind.Server.Stop() destroyed in-memory tracked containers but left the underlying containerd per-job namespace (ephemerd-dind-<runner-name>) populated with Image records, leases, snapshots, and content blobs. Over ~2 days we accumulated 73 leaked namespaces holding ~98 GB on a 100 GB VHDX, eventually blocking new jobs with No space left on device.

Fix

pkg/dind/cleanup.go CleanupJobNamespace:

  1. Enumerate and remove containers (with WithSnapshotCleanup).
  2. Delete Image records (drops gc.ref.content.* labels).
  3. Delete leases.
  4. Walk the snapshotter and remove snapshots leaf-first in a multi-pass loop — containerd refuses to delete a snapshot that still has children, so the obvious order-doesn't-matter walk fails on every image layer above the bottom.
  5. Walk the content store and remove blobs that aren't pinned elsewhere.
  6. NamespaceService().Delete() the metadata bucket.

Plus a 3-attempt async-GC-catchup retry for transient FailedPrecondition from containerd's eventually-consistent state, and a diagnostic log of the stuck snapshot's Stat() (name/parent/kind/labels) if the loop bails so future failures tell us what's pinning.

On worker-mode boot, CleanupStaleDindNamespaces sweeps anything left behind by ungraceful exits (DeadlineExceeded, SIGKILL, host reboot).

Verified live

4 consecutive ephpm E2E jobs across two parallel runners exit-and-clean with zero leftover namespaces. VHDX growth bounded at ~59 GB after extensive testing vs the pre-fix unbounded growth. Console log shows dind cleanup: namespace removed for each completed runner.

Problem 2: per-repo image cache to avoid the re-download tax

Without it, every job would re-pull kindest/node and any other dind image because the Image records get deleted in step 2 above, dropping the gc.refs on the underlying content blobs.

Fix

pkg/dind/cache.go introduces a long-lived per-(provider, repo) namespace:

ephemerd-dind-cache-<provider>-<sanitized-repo>

On image pull or container create that picked up a previously-pulled image, dind mirrors the Image record into the cache namespace (or refreshes its ephemerd.io/last-accessed label if already there). The cache's gc.ref labels keep the content blobs alive after the per-job namespace is cleaned up, so subsequent jobs in the same repo get a content-store hit instead of a network pull.

Privacy boundary

Containerd namespace isolation means a content blob referenced only from dind-cache-foo-private's Image records is invisible to any other namespace's resolver. Two forges with same-named repos (e.g. github/ephpm vs gitea/ephpm) get distinct cache namespaces; two repos within a forge get distinct caches keyed by full owner/repo. Provider + Repo plumb through CreateJobRequestruntime.CreateConfigdind.Config so the cache namespace is derived from the dispatching forge, not parsed from the runner name (which would lose provider info).

Cache pruner

cmd/ephemerd/main.go starts a goroutine in worker mode that walks every dind-cache-* namespace every [dind].cache_prune_interval (default 24h) and evicts Image records whose last-accessed label is older than [dind].cache_max_age (default 168h / 7 days). Empty cache namespaces are removed entirely. Records pre-dating the label fall back to UpdatedAt so a deploy of this change doesn't nuke existing caches on first prune.

[dind]
  enabled = true
  cache_prune_interval = "24h"   # how often the sweeper wakes up
  cache_max_age        = "168h"  # 7 days — LRU threshold

Tests

  • TestCleanup_DindNamespaces/RemovesImageLeaseAndNamespace — full delete cycle.
  • TestCleanup_DindNamespaces/StaleSweepFiltersByPrefix — sweep touches only ephemerd-dind-* (non-cache) namespaces.
  • TestCacheNamespace_FormatAndIsolation — cross-provider + nested-repo sanitization, empty-input → empty-result (caching disabled).
  • TestSanitizeForNamespace_CollapsesAndTrims — no leading/trailing separators, no consecutive separators, alphanumerics/._- preserved.
  • TestCache_MirrorAndPrune — full mirror → refresh → backdate → prune → empty-namespace cleanup lifecycle.
  • TestCachePrune_KeepsFreshAndPrefixedOnly — fresh records survive, non-cache namespaces untouched.
  • TestPushHandlerEndToEnd still passes (rewired to sharedTestContainerd because containerd's prometheus metrics use a process-global registry — two containerdpkg.New() in one test binary panics).

All 26 packages green locally: CGO_ENABLED=0 go test -tags containers_image_openpgp -count=1 ./.... Local mage lint blocked by the known Windows-only miekg/pkcs11 cgo typecheck issue (documented in AGENTS.md); CI lint on Linux runs clean.

Test plan

  • Live verified: 4-of-4 ephpm E2E job cycles clean exit on the Hyper-V Linux VM dispatcher, no leftover namespaces.
  • Cache namespace persists across job cycles in the same repo.
  • VHDX growth bounded (~59 GB after sustained testing vs unbounded growth pre-fix).
  • Watch first scheduled prune cycle (24h after first cached image) to confirm TTL works in production with real timestamps. Worth checking in a few days.
  • Cross-provider isolation: when a Gitea/Forgejo dispatcher is wired up alongside the GitHub one, confirm dind-cache-gitea-* and dind-cache-github-* don't share blobs even with same-named repos.

Why one PR

The two changes are coupled by design: the leak fix would defeat the cache, the cache work assumes the cleanup never touches its namespaces. Splitting would require landing the leak fix in a state where every job has the re-download tax, then later landing the cache work — strictly worse in production for as long as PR1 is alone on main.

Two related problems landed together because they're tightly coupled:
the cleanup work would force a full re-pull of kindest/node (~1 GB)
per job without the cache, and the cache trusts cleanup never touches
its long-lived namespaces.

Problem 1: per-job state leak

dind.Server.Stop() destroyed in-memory tracked containers but left the
underlying containerd per-job namespace (ephemerd-dind-<runner-name>)
populated with Image records, leases, snapshots and content blobs. Over
~2 days we accumulated 73 leaked namespaces holding ~98 GB on a 100 GB
VHDX, blocking new jobs with "no space left on device".

Fix: pkg/dind/cleanup.go's CleanupJobNamespace enumerates and removes
containers (with WithSnapshotCleanup), Image records (drops gc.ref
labels), leases, then snapshots in leaf-first multi-pass order
(containerd refuses to delete a snapshot with children), then content
blobs, then the namespace metadata bucket itself. A 3-attempt
async-GC-catchup retry handles transient FailedPrecondition results
from containerd's eventually-consistent state. On boot, worker mode
runs CleanupStaleDindNamespaces to sweep anything left behind by
ungraceful exits (DeadlineExceeded, SIGKILL, host reboot).

Verified live: 4 consecutive ephpm E2E jobs across two parallel runners
exit-and-clean with zero leftover namespaces, VHDX growth bounded at
~59 GB after extensive testing vs the pre-fix unbounded growth.

Problem 2: per-repo image cache to avoid the re-download tax

Without it, every job would re-pull kindest/node and any other dind
image because the Image records get deleted in step 2 of cleanup
above, dropping the gc.refs on the underlying content blobs.

Fix: pkg/dind/cache.go introduces a per-(provider, repo) long-lived
namespace at ephemerd-dind-cache-<provider>-<sanitized-repo>. On image
pull or container create that picked up a previously-pulled image,
dind mirrors the Image record into the cache namespace (or refreshes
its ephemerd.io/last-accessed label if already there). The cache's
gc.refs keep the content blobs alive after the per-job namespace is
cleaned up, so subsequent jobs in the same repo get a content-store
hit instead of a network pull.

Privacy boundary: containerd namespace isolation means a content blob
referenced only from `dind-cache-foo-private`'s Image records is
invisible to any other namespace's resolver. Two forges with same-named
repos (e.g. github/ephpm vs gitea/ephpm) get distinct cache namespaces;
two repos within a forge get distinct caches keyed by full owner/repo.
Provider + Repo plumb through CreateJobRequest → runtime.CreateConfig →
dind.Config so the cache namespace is derived from the dispatching
forge, not parsed from the runner name.

Cache pruner (cmd/ephemerd/main.go): a goroutine started in worker
mode walks every dind-cache-* namespace every [dind].cache_prune_interval
(default 24h) and evicts Image records whose last-accessed label is
older than [dind].cache_max_age (default 168h / 7 days). Empty cache
namespaces are removed entirely. Records pre-dating the label fall
back to UpdatedAt so a deploy of this change doesn't nuke existing
caches on first prune.

Tests (all green locally with shared in-process containerd):
- TestCleanup_DindNamespaces — image + lease + namespace removal
- TestCleanup_DindNamespaces/StaleSweep — prefix filter, doesn't touch
  cache namespaces or non-dind namespaces
- TestCacheNamespace_FormatAndIsolation — cross-provider + nested-repo
  sanitization, empty-input handling
- TestSanitizeForNamespace_CollapsesAndTrims — no leading/trailing
  separator, no consecutive separators
- TestCache_MirrorAndPrune — full mirror → refresh → backdate → prune
  → empty-namespace cleanup lifecycle
- TestCachePrune_KeepsFreshAndPrefixedOnly — fresh records survive,
  non-cache namespaces untouched
- TestPushHandlerEndToEnd still passes against the shared containerd
  (rewired to sharedTestContainerd because containerd's prometheus
  metrics use a process-global registry — two containerdpkg.New() in
  one test binary panics).
Extends docs/architecture/fake-docker-daemon.md with two new sections:

- "Per-Job Namespace and Cleanup" covers the
  ephemerd-dind-<runner-name> namespace, the 6-step CleanupJobNamespace
  sequence (containers, images, leases, leaf-first snapshots, content,
  namespace), the FailedPrecondition retry, and the worker-mode
  CleanupStaleDindNamespaces sweep for crash recovery.

- "Per-Repo Image Cache" covers the
  ephemerd-dind-cache-<provider>-<sanitized-repo> namespace, the
  Provider/Repo plumbing path, the two cache-write events (pull and
  container-create), the containerd namespace-isolation privacy
  guarantee (and the explicit "don't ever set namespace.shareable"
  caveat), and the prune semantics including the UpdatedAt fallback
  for records pre-dating the last-accessed label.

Updates docs/getting-started/configuration.md to surface
cache_prune_interval and cache_max_age on the example config block
and adds a [dind] section reference paragraph explaining the cache
behavior, privacy boundary, and disable knobs.

Also refreshes the Key Files table to include cleanup.go, cache.go,
and their respective test files.
@luthermonson luthermonson merged commit 2fa3b52 into main May 16, 2026
4 checks passed
@luthermonson luthermonson deleted the fix/dind-namespace-cleanup branch May 16, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant