Skip to content

Compose shellout: missing --no-recreate flag destroys container state across docker daemon restarts #71

@bilby91

Description

@bilby91

Summary

The shellout compose backend (runtime/docker/compose.godocker compose up -d) does not pass --no-recreate, which causes docker compose to destroy and recreate primary-service containers whenever it detects config drift — even when the caller passed Recreate=false and an existing container is present.

The container's writable layer (and anything in $HOME inside the container, e.g. ~/.claude/projects/<encoded>/<id>.jsonl) is lost as a result.

Upstream devcontainers/cli gates --no-recreate on whether a container already exists:

const args = ['--project-name', projectName, ...composeGlobalArgs];
args.push('up', '-d');
if (container || params.expectExistingContainer) {
    args.push('--no-recreate');
}

Our shellout path (runtime/docker/compose.go:75) builds only up -d <services> — never --no-recreate. The lib's outer code (up.go:189-206) already knows whether existing != nil, but never threads that signal into the compose argv.

How we hit it

DAP workspaces are k8s pods with the docker data-root persisted on a PVC (/workspace/docker). When a session's pod is destroyed (idle timeout, deploy, eviction) and a new pod boots from the same PVC, dockerd restores the prior containers from the PVC and the runtime calls Engine.Up with Recreate=false (session resume). Expected behavior: the existing app container restarts and ~/.claude/projects/... survives so the Claude SDK can resume the conversation.

Observed: the primary service container is recreated, its writable layer is gone, and the Claude SDK fails with "No conversation found with session ID: <id>".

Sidecar services with a restart policy (e.g. mailcatcher) keep the same container ID across pod restarts because dockerd auto-starts them — so by the time the orchestrator inspects them they're already Running and config-hash drift doesn't trigger a recreate. The primary service has no restart policy, sits Exited, and gets caught by docker compose's default recreate-on-drift behavior.

Reproduction (in DAP context, but mechanism is generic)

  1. Cold-start a compose-based devcontainer workspace in a k8s pod with /var/lib/docker (or equivalent) on a PVC.
  2. Touch a file in the container, e.g. docker exec <primary> sh -c 'echo hi > /home/<user>/marker'.
  3. Kill the pod. Wait for a new pod to come up against the same PVC.
  4. Call Engine.Up again with Recreate=false.
  5. Observe: primary service container has a new ID; /home/<user>/marker is gone.

Root cause

runtime/docker/compose.go:75-80:

func buildUpArgs(spec runtime.ComposeUpSpec) []string {
    args := composeArgs(spec.ProjectName, spec.Files)
    args = append(args, "up", "-d")
    args = append(args, spec.Services...)
    return args
}

Without --no-recreate, compose recreates on any drift in:

  • generated dc-run.yaml override content (any pod-scoped env in ExtraEnvironment)
  • the resolved image digest stamped on the existing container vs the newly resolved one
  • normalized project hash differences

Even when the user's intent (opts.Recreate=false) is unambiguous, the lib can't communicate it to compose.

Fix

Mirror upstream:

  1. Add NoRecreate bool to runtime.ComposeUpSpec.
  2. buildUpArgs appends --no-recreate when spec.NoRecreate is set.
  3. upComposeShellout (up.go:597) sets NoRecreate: existing != nil, where existing is the value already computed at up.go:167. Requires threading existing (or a bool derived from it) into upComposeShellout.

Same class of bug on the native backend (not exercised yet, but worth fixing together)

compose/orchestrator.go:460-470 decides reuse on three conditions:

if details.Labels[LabelConfigHash] == hash &&
    details.Labels[LabelImageDigest] == imageDigest &&
    c.State == runtime.StateRunning {
    return c.ID, nil
}
// Different config or not running — recreate.

The c.State == runtime.StateRunning check fails after a daemon restart (containers are restored in Exited state) and falls through to stop+remove+create. A config-matched stopped container should be started, not recreated. Same root cause as the shellout flag gap; should be fixed in the same PR so the bug doesn't follow us when we flip the backend.

Scope

  • runtime/runtime.go — add NoRecreate to ComposeUpSpec.
  • runtime/docker/compose.go — append --no-recreate when set; update buildUpArgs test.
  • up.go — set NoRecreate: existing != nil in upComposeShellout.
  • compose/orchestrator.go — replace the State == Running gate with a start-if-stopped branch.
  • Integration test: cold-start compose, write a marker into the container, simulate daemon-restart-like recreation conditions, second Engine.Up with Recreate=false, assert the marker survives and the container ID is preserved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions