fix(agents): reclaim zombie sandbox containers and create them lazily by msfstef · Pull Request #4513 · electric-sql/electric

msfstef · 2026-06-04T17:56:49Z

What & why

Opening the desktop app could leave 15+ leftover electric-sbx-* Docker containers running that the user never started. They piled up because, in several places, a sandbox container could outlive the work it was created for. This PR closes each of those gaps so a container only exists while something is actually using it.

Why the containers piled up

Sandbox containers are meant to be short-lived: created for an agent's work, then torn down. Three gaps let them survive instead:

We created containers we didn't need. Every time an agent woke up — even for trivial bookkeeping (a scheduled tick with nothing to do, a drained message queue) — it started a container before knowing whether it would run anything. When the app reconnected and replayed a backlog of these wakes, each one spun up its own container at once.
We didn't clean up on quit. Idle containers are torn down by a short (~2-minute) delay timer. On quit, the app exits before those timers fire — so every quit left a container behind.
We didn't clean up leftovers at the next startup either. The startup cleanup deliberately skipped running containers (so it wouldn't kill one another process is actively using). But a container left by a crash or quit is still running — its main process is an idle loop that never exits on its own — so the cleanup never touched the very leftovers it was meant to reclaim.

Two smaller bugs made it worse: a container could be stranded if its one-time setup step failed, and a Docker error code was occasionally misread so we'd "reconnect" to a container that had already been removed.

How we fix it

The approach attacks all three gaps so a container can't outlive its purpose:

Create only when actually used (lazy creation). A container is now started the first time an agent really uses it — runs a command, or reads/writes a file. Trivial wakes create nothing. (Edge cases preserved: an agent finishing for good still cleans up any workspace earlier runs created, and a child agent that "inherits" its parent's sandbox still makes sure that sandbox exists first.)
Clean up on quit. On shutdown we now run the pending teardowns immediately instead of letting the delay timers die with the process. This covers every exit path (desktop quit, restarts, Ctrl-C / kill) and is time-bounded so a stuck Docker daemon can't hang the quit.
Reclaim true leftovers at startup. Every container is tagged with the process that created it. At startup we check whether that process is still alive: if it's gone, the container is a genuine leftover and we reclaim it — deleting throwaway ones and stopping reusable ones (so their files survive for later reuse). Containers a live process is still using are left untouched, so we never disturb a peer.

As a quality-of-life touch, every container is also labelled so Docker Desktop groups them under one collapsible electric-sandboxes entry you can stop or delete together (docker compose -p electric-sandboxes down).

Implementation details (for reviewers)

Zombie reclamation

Containers carry a com.electric.sandbox.owner-pid label; the boot sweep probes it (kill(pid, 0)) and reclaims running orphans whose owner is dead — remove ephemeral, stop persistent (writable layer survives for reattach by key). The sweep probes + reclaims leftovers concurrently so boot latency stays flat as they accumulate.
Labels are immutable, so reattaching a dead creator's container records adoption in an in-container marker (/tmp/.electric-sbx-owner-pid, tmpfs ⇒ wiped on stop); the sweep probes it before reclaiming, so a live sibling's adopted container is never swept. The sweep is now awaited at boot so it can't race the first wake's reattach.
New shutdownAllDockerSandboxes() flushes pending debounced teardowns on runtime shutdown, wired through AgentHandlerResult.shutdownSandboxes → BuiltinAgentsServer.stop() (covers desktop quit, runtime restarts, CLI SIGINT/SIGTERM; bounded 5s). Live-leased entries are left to their own dispose (a sibling runtime in the same process may own them) — if the process dies first, the pid-sweep reclaims them at next boot.
Post-start init verifies the mkdir exit code and force-removes the container on any failure; isNameConflict() now matches the daemon's actual name-conflict message instead of bare 409.

Lazy sandbox creation

New lazySandbox() wrapper (agents-runtime/sandbox/lazy.ts) defers the provider factory until the sandbox is actually used (exec/fs/fetch). The bootstrap docker profile returns it, so trivial wakes never create a container. Materialization is single-flight and retried on failure.
Terminal reclaim still works without use: reclaimDockerSandboxByKey() wipes an earlier wake's persistent workspace by key without creating a container just to delete it (owner leases only; defers to live sibling leases — the last one draining wipes it).
Spawn-inherit force-materializes the owner's workspace (ensureSandboxMaterialized) before spawning, so a child can attach even when the parent never ran a tool.
Concurrent container creations are capped at 4 process-wide (withCreationSlot) — bursts queue against the daemon instead of stampeding it; reattaches/execs are unlimited, total creations unbounded.

Grouping

All sandboxes carry com.docker.compose.project=electric-sandboxes (+ com.docker.compose.service=<entity-type>).

Testing

13 new docker integration tests (running-orphan reclaim, adoption sparing, legacy-label safety, init-failure cleanup, shutdown flush, lazy composition, reclaim-by-key) and 13 unit tests for lazySandbox — written first, confirmed red, then green.
Full suites: agents-runtime 803 tests / 63 files, agents 55 / 11; tsc and eslint clean.

Out of scope (follow-ups)

No cap on concurrent wakes (they're whole agent runs; capping would queue user-visible messages behind backlog replay — needs a product call).
The e2b remote profile stays eager (working directory not statically known at profile-build time); the same wrapper applies once it is.
Server-side orphaned-claim recovery (dispatchRecoveryIntervalMs is defined but unused) — expired runner leases still queue wakes forever.

🤖 Generated with Claude Code

github-actions · 2026-06-04T17:57:32Z

Electric Agents Desktop Builds

Build artifacts for commit 9cc6805.

Platform	Status	Artifact
macOS Apple Silicon	Passed	DMG
macOS Intel	Passed	DMG
Windows x64	Passed	Installer
Linux x64	Passed	AppImage / deb

Workflow run

codecov · 2026-06-04T17:58:44Z

Codecov Report

❌ Patch coverage is 81.90789% with 55 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.69%. Comparing base (7892079) to head (9cc6805).
⚠️ Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
packages/agents-runtime/src/sandbox/docker.ts	86.22%	23 Missing ⚠️
packages/agents-runtime/src/sandbox/lazy.ts	80.37%	21 Missing ⚠️
packages/agents/src/bootstrap.ts	65.00%	7 Missing ⚠️
packages/agents-runtime/src/process-wake.ts	60.00%	2 Missing ⚠️
packages/agents/src/server.ts	60.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #4513       +/-   ##
===========================================
+ Coverage   32.48%   56.69%   +24.20%     
===========================================
  Files         216      359      +143     
  Lines       18368    39329    +20961     
  Branches     6478    11052     +4574     
===========================================
+ Hits         5967    22296    +16329     
- Misses      12369    16962     +4593     
- Partials       32       71       +39

Flag	Coverage Δ
packages/agents	`70.53% <64.00%> (?)`
packages/agents-mcp	`77.54% <ø> (?)`
packages/agents-mobile	`66.92% <ø> (ø)`
packages/agents-runtime	`80.12% <83.51%> (?)`
packages/agents-server	`74.19% <ø> (+0.29%)`	⬆️
packages/agents-server-ui	`6.21% <ø> (ø)`
packages/electric-ax	`46.42% <ø> (?)`
packages/experimental	`87.73% <ø> (?)`
packages/react-hooks	`86.48% <ø> (?)`
packages/start	`82.83% <ø> (?)`
packages/typescript-client	`91.83% <ø> (?)`
packages/y-electric	`56.05% <ø> (?)`
typescript	`56.69% <81.90%> (+24.20%)`	⬆️
unit-tests	`56.69% <81.90%> (+24.20%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

netlify · 2026-06-04T17:58:47Z

✅ Deploy Preview for electric-next ready!

Name	Link
🔨 Latest commit	`8011831`
🔍 Latest deploy log	https://app.netlify.com/projects/electric-next/deploys/6a26ce9fab3ee20007fec96a
😎 Deploy Preview	https://deploy-preview-4513--electric-next.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

github-actions · 2026-06-04T18:01:40Z

Electric Agents Mobile Build

Local mobile checks ran for commit 9cc6805.

The EAS Android preview build was skipped because the mobile-eas-build label is not present.
Add the mobile-eas-build label to this PR to produce an installable preview build.

Workflow run

Users opening the desktop app found 15+ electric-sbx-* containers running that they never asked for. Several compounding bugs: - The boot sweep only removed *exited ephemeral* leftovers, but crash/ quit leftovers are RUNNING (PID 1 is an infinite sleep loop), so it never reclaimed anything real. Containers now carry an owner-pid label; the sweep reclaims running orphans whose owner is dead (remove ephemeral / stop persistent), consults an in-container adoption marker so a live process that reattached a dead creator's container is never swept, and is awaited at boot so it can't race reattaches. - The debounced idle teardowns are unref'd timers that died with the process: every graceful quit leaked the recently-active containers as running zombies. Runtime shutdown now flushes pending teardowns (BuiltinAgentsServer.stop), leaving live-leased entries to their own dispose or the next boot's sweep. - A failed post-start init (mkdir exec) left a started container that was never registered - invisible to dispose and, while running, to the sweep. Creation now verifies the init exit code and removes the container on any failure. Fixing this exposed that isNameConflict() treated any HTTP 409 as a lost create race (exec on an exited container is also 409), silently "reattaching" to a removed container; it now matches the daemon's name-conflict message. - Sandbox creation was eager on every claimed wake, so a reconnect backlog of trivial wakes (cron ticks, bookkeeping) stampeded the daemon with containers. The docker profile now returns a lazySandbox wrapper that defers the provider factory to first actual use. Terminal reclaim without use goes through reclaimDockerSandboxByKey (no create-to-delete), spawn-inherit force-materializes the owner's workspace before the child can attach, and concurrent creations are capped at 4 to smooth real bursts. - All sandbox containers now carry compose project/service labels (com.docker.compose.project=electric-sandboxes) so Docker Desktop groups them under one entry and they can be stopped/deleted together (docker compose -p electric-sandboxes down). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Rewrite the changeset as a user-facing release note: lead with the symptom (leftover containers piling up) and the three-part fix (create-only-when-used, clean-up-on-quit, reclaim-at-startup), dropping implementation jargon. Matches the rewritten PR description; no code change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude · 2026-06-08T14:25:06Z

Claude Code Review

Summary

This PR closes three gaps that let electric-sbx-* Docker sandbox containers outlive their work (eager creation on trivial wakes, no teardown on quit, a boot sweep that skipped running leftovers) via lazy materialization (lazySandbox), owner-pid-tagged zombie reclamation at boot, and an immediate shutdown flush. Since my last review the author has added an image-pull optimization and consolidated the test-container reaping into a shared helper. The change remains high-quality and ready to merge.

What's Working Well

ensureImage now honors its own contract. pullIfMissing is documented as "pulls the image when it's not present locally," but the code previously called docker.pull on every create — a registry round-trip even for a fully cached digest-pinned image, making creation slow and hostage to transient registry reachability. The new getImage(image).inspect() local-only check, pulling only on a miss, matches the documented semantics exactly (the old always-pull was the deviation, not the new behavior). The catch-and-fall-through-to-pull is a safe conservative fallback when inspect fails for a non-404 reason.
Test reaping is now centralized and correct. installDockerSandboxTestCleanup() replaces three near-identical beforeAll/afterEach/afterAll blocks across the docker test files. The helper reaps strictly by the test-only label ${TEST_LABEL}=1 (never the production com.electric.sandbox label), so a developer's real local sandboxes are never touched, and it's self-gating + fully swallowed so a missing/slow daemon can't redden a run. The rationale comments (why afterEach is needed given dispose() only schedules a ~2-min unref'd teardown) are excellent.
Conformance flakiness fix is well-reasoned. Binding testTimeout: 60_000 at collection time for the docker provider — and the comment explaining that each test's timeout is fixed at registration, so a hook would be too late — correctly addresses container-create latency exceeding the 5s default while keeping the host-fs/in-memory providers snappy.
Coverage is faithful. The new sandbox-docker-ensure-image.test.ts unit test (fake daemon, no Docker required) covers all three branches: present ⇒ skip pull, absent ⇒ pull, pullIfMissing: false ⇒ neither. Written against the documented contract.

Issues Found

Critical (Must Fix)

None found.

Important (Should Fix)

None.

Suggestions (Nice to Have)

Minor fragility in the conformance timeout (non-blocking): vi.setConfig({ testTimeout: 60_000 }) relies on the docker provider being last in the providers array (the comment says as much), since vi.setConfig mutates global config and only it()s registered after it inherit the new value. Docker is indeed last today, so this is correct — but a future reorder of providers would silently give the longer timeout to providers registered after docker (harmless: a slow test just won't fail prematurely) or, if docker moved earlier, leave docker tests on the 5s default again (reintroducing the flake). A describe-scoped timeout (e.g. passing the timeout per-it, or a local constant) would be more robust, but the current form is documented and works.
The prior forward-looking note still stands (the boot-sweep reclaim fan-out is unbounded, unlike the creation cap — fine at the motivating "15+ leftovers" scale, only worth revisiting at hundreds of leftovers). Non-blocking.

Issue Conformance

No linked issue is referenced — a process note rather than a blocker, given the unusually thorough PR description (problem statement, root-cause breakdown, fix rationale, explicit out-of-scope/follow-ups). The new image-pull commit is well-scoped and includes its own changeset (sandbox-image-pull-if-missing.md); the existing zombie-sandbox-containers-fix.md changeset covers the core change — monorepo changeset requirements satisfied.

Previous Review Status

Two new commits since iteration 3 (b57b6fc), both clean:

b0feb1f5e (test): extracts sweepTestContainers / installDockerSandboxTestCleanup into test/helpers/docker-sandbox-cleanup.ts and adopts it in sandbox-docker.test.ts, sandbox-docker-smoke.test.ts, and the docker branch of sandbox-conformance.test.ts. Pure consolidation of previously-inlined logic plus the conformance timeout fix — no production code touched. The extracted sweep is behaviorally identical to the three inlined copies it replaces.
9cc6805d1 (fix): the inspect-before-pull ensureImage change above, with getImage added to the minimal Dockerode loader interface and the new unit test.

Both previously-open inline threads (kevin-dp) are resolved: the creationSlots-vs-creationQueue semaphore discussion was closed by mutual agreement to keep the explicit two-variable form ("basically a counting semaphore… fine keeping it like this for now"), and the writeOwnerMarker compaction was addressed in iteration 3. No regressions, no remaining open feedback. All earlier nice-to-haves remain addressed.

Review iteration: 4 | 2026-06-09

From the Claude bot review (all non-blocking nice-to-haves): - Boot sweep now probes + reclaims leftovers concurrently (Promise.all) instead of one awaited exec round-trip per orphan — keeps boot latency flat when many leftovers have accumulated (the sweep is awaited before profiles build). - Document that reclaimDockerSandboxByKey / shutdownAllDockerSandboxes are best-effort: an unreachable daemon is swallowed and left to the next boot sweep; note why the shutdown flush runs after the wake drain (it only reclaims idle, lease-free containers). - Add a lazySandbox unit test for dispose({reclaim}) racing an in-flight factory that then fails — the reclaim callback must still run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

From kevin-dp's review: replace the try/catch wrapper around the best-effort owner-marker write with `.catch(() => {})`. Kept `await` (not a bare `return runOneOff(...).catch(...)`) so the non-void RunOneOffResult doesn't leak into the Promise<void> return type. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…akiness The docker sandbox test files spin up real containers, but only sandbox-docker.test.ts cleaned up after itself. The smoke and conformance suites relied solely on `sandbox.dispose()`, which doesn't remove a container synchronously — it schedules teardown ~2 minutes out via an unref'd timer that dies with the vitest process. So running those suites left RUNNING containers behind that accumulated across runs (3 smoke runs ⇒ 2 live zombies; one conformance run ⇒ 18). Lift the label-scoped sweep into a shared `installDockerSandboxTestCleanup()` helper (beforeAll + afterEach + afterAll) and call it from every describe that creates containers. The sweep filters only on the `electric-test-sandbox` test label, never the production `com.electric.sandbox` label, so a developer's real local sandboxes are untouched; it's self-gating and swallows daemon errors. The main file's duplicated sweep and six inline hooks collapse into one call. Also give the docker conformance provider a 60s timeout (matching the dedicated docker test files) — real container creation outgrows vitest's 5s default, which was flaking those tests independently of the leak. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`pullIfMissing` is documented as pulling "when it's not present locally", but `ensureImage` never checked presence — it called `docker.pull()` on every container create. `docker.pull` round-trips to the registry even for a fully cached digest-pinned image, so creation was needlessly slow and failed whenever the registry was briefly unreachable (an intermittent TLS-handshake timeout to registry-1.docker.io surfaced this). Inspect the image first (a local-only check) and pull only when it's actually absent, matching the documented semantics. Adds `getImage` to the minimal `Dockerode` interface and a unit test (fake daemon) covering present ⇒ skip, absent ⇒ pull, and `pullIfMissing: false` ⇒ neither. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

msfstef added the claude label Jun 8, 2026

msfstef and others added 2 commits June 8, 2026 17:15

msfstef force-pushed the worktree-zombie-sandbox-containers-fix branch from f409434 to 8011831 Compare June 8, 2026 14:15

msfstef marked this pull request as ready for review June 8, 2026 14:38

msfstef requested review from kevin-dp and samwillis June 8, 2026 15:05

kevin-dp approved these changes Jun 9, 2026

View reviewed changes

Comment thread packages/agents-runtime/src/sandbox/docker.ts

Comment thread packages/agents-runtime/src/sandbox/docker.ts Outdated

msfstef and others added 3 commits June 9, 2026 13:02

msfstef merged commit 1099366 into main Jun 9, 2026
66 checks passed

msfstef deleted the worktree-zombie-sandbox-containers-fix branch June 9, 2026 13:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agents): reclaim zombie sandbox containers and create them lazily#4513

fix(agents): reclaim zombie sandbox containers and create them lazily#4513
msfstef merged 6 commits into
mainfrom
worktree-zombie-sandbox-containers-fix

msfstef commented Jun 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

netlify Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

msfstef commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

Why the containers piled up

How we fix it

Zombie reclamation

Lazy sandbox creation

Grouping

Testing

Out of scope (follow-ups)

Uh oh!

github-actions Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Electric Agents Desktop Builds

Uh oh!

codecov Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

netlify Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for electric-next ready!

Uh oh!

github-actions Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Electric Agents Mobile Build

Uh oh!

claude Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Summary

What's Working Well

Issues Found

Critical (Must Fix)

Important (Should Fix)

Suggestions (Nice to Have)

Issue Conformance

Previous Review Status

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

msfstef commented Jun 4, 2026 •

edited

Loading

github-actions Bot commented Jun 4, 2026 •

edited

Loading

codecov Bot commented Jun 4, 2026 •

edited

Loading

netlify Bot commented Jun 4, 2026 •

edited

Loading

github-actions Bot commented Jun 4, 2026 •

edited

Loading

claude Bot commented Jun 8, 2026 •

edited

Loading