Skip to content

startup hangs indefinitely when docker-credential-desktop wedges #2972

@aheritier

Description

@aheritier

Summary

docker-agent run … can hang forever before the TUI ever appears, with no error and no log output, when the configured Docker credential helper (docker-credential-desktop on macOS) gets stuck. The agent is uncancellable from Ctrl+C (it's not even in its main loop yet) and consumes no CPU. Multiple invocations in parallel terminals all wedge in the same way.

Environment

  • macOS 26.5.1 (arm64), Apple Silicon
  • Docker Desktop 4.78.0 running and otherwise functional (docker info works)
  • ~/.docker/config.json has "credsStore": "desktop"
  • docker-agent built from current main
  • Config used: a multi-agent YAML with mcps: entries that reference Docker-hosted MCPs (e.g. ref: docker:context7) and sub-agents pulled from registries (docker/gordon:latest, docker/grafana-agent).

Reproduction

  1. Have ~/.docker/config.json with "credsStore": "desktop".
  2. Get docker-credential-desktop into a stuck state. In my case it happened spontaneously, but it can be reproduced by running:
    echo '{"ServerURL":"https://index.docker.io/v1/"}' | docker-credential-desktop get
    
    and observing that the helper never returns. (Possibly a stale lock or dismissed auth prompt inside Docker Desktop — the helper's open() syscall blocks indefinitely while Docker Desktop itself answers /ping and docker info normally.)
  3. Run docker-agent run my-config.yaml. It hangs forever before any UI is drawn.

Observed behaviour

Two stuck instances captured live:

PID   PPID  ELAPSED  COMMAND
97057 93702 4m40s    docker-agent run …/jarvis.yaml
98100 97057 3m39s    └─ docker-credential-desktop get        ← child of docker-agent
98545 98351 2m32s    docker-agent run …/jarvis.yaml
98557 98545 3m22s    └─ docker-credential-desktop get        ← child of docker-agent

docker-agent itself is parked in Go runtime cond-waits with FDs 76/77 connected to the helper child's stdin/stdout — it is synchronously waiting on the helper's reply via a pipe.

sample of docker-credential-desktop:

Thread_…   DispatchQueue_1: com.apple.main-thread (serial)
  open  (in libsystem_kernel.dylib) + 64
    __open  (in libsystem_kernel.dylib) + 8

It has only KQUEUE + a netsrc systm fd open — no Unix socket to Docker Desktop, no progress.

docker info returns immediately. docker-credential-osxkeychain returns immediately. So Docker Desktop's backend is healthy; only the credential-helper IPC channel is wedged.

Root cause hypothesis

docker-agent uses github.com/google/go-containerregistry (crane.Pull / crane.Digest) in pkg/remote/pull.go when resolving registry references. That library's default keychain shells out to whatever helper ~/.docker/config.json declares — here docker-credential-desktop. The helper invocation has no timeout and no cancellation path: cmd.Run() blocks the goroutine, which blocks startup, which blocks the TUI.

The two relevant blast-radius paths I see:

  • pkg/remote/pull.gocrane.Digest / crane.Pull for OCI agent / sub-agent references.
  • pkg/environment/credential_helper.go (runCommand in pkg/environment/cmd_provider.go) — same shape, cmd.Run() with only the caller's context as a deadline. If startup passes a context.Background(), it never returns.

runCommand in pkg/environment/cmd_provider.go:

cmd := exec.CommandContext(ctx, name, args...)
…
if err := cmd.Run(); err != nil { … }

…will only kill the helper if the caller's context is cancelled. Nothing in the startup path appears to apply a deadline.

Impact

  • One unresponsive credential helper = one totally unusable docker-agent on that machine, with no error message and no way to know what's wrong without ps / sample.
  • Affects every user with credsStore: desktop (the default on Docker Desktop installations) any time Desktop's helper IPC misbehaves — which seems to happen occasionally without visible cause.

Suggested fix

Bound every credential-helper invocation with a short, aggressive deadline (5–10s feels right) and surface a clear error/log line on timeout, so the agent can either:

  1. fall back to no-credentials / anonymous pull for public artifacts, and/or
  2. start the TUI anyway and let the user see what's happening.

Concretely:

  • Wrap runCommand (and the equivalent path inside crane's keychain) with context.WithTimeout independent of the caller's context.
  • On timeout, log WARN (credential helper %s timed out after %s, falling back to anonymous) and return ("", false) instead of blocking.
  • Make sure the exec.Cmd is killed (cmd.Cancel / process-group kill) so we don't leak docker-credential-* zombies as observed above.

Optionally, gate any registry-pull on Docker Desktop being responsive (desktop.IsDockerDesktopRunning is already used in pkg/remote/transport.go) before consulting the Desktop keychain at all.

Workaround

Switch ~/.docker/config.json to "credsStore": "osxkeychain" (or quit Docker Desktop fully and reopen it).

Diagnostic snippets

lsof -p <docker-agent-pid> showing the pipe to the helper child:

docker-ag … 76     PIPE 0x9ef929cc65a95036 16384 ->0x179fdf520fadd15d
docker-cr … 1      PIPE 0x179fdf520fadd15d 16384 ->0x9ef929cc65a95036

sample <docker-agent-pid> (truncated): all goroutines in __psynch_cvwait, no progress.

Metadata

Metadata

Assignees

Labels

area/cliCLI commands, flags, output formattingarea/distributionAgent registry, packaging, distribution, sharingarea/securityAuthentication, authorization, secrets, vulnerabilitiesarea/tuiFor features/issues/fixes related to the TUIstatus/needs-triageFor issues that need to be triaged

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions