Skip to content

fix(docker): use ContainerInspect polling instead of ContainerWait to avoid hangs#8265

Merged
rfay merged 5 commits into
ddev:mainfrom
stasadev:20260330_stasadev_container_wait_hang
Apr 1, 2026
Merged

fix(docker): use ContainerInspect polling instead of ContainerWait to avoid hangs#8265
rfay merged 5 commits into
ddev:mainfrom
stasadev:20260330_stasadev_container_wait_hang

Conversation

@stasadev
Copy link
Copy Markdown
Member

@stasadev stasadev commented Mar 30, 2026

The Issue

ContainerWait could hang indefinitely because its streaming HTTP response never delivered data on proxied Docker sockets. The select had no ctx.Done() case and ctx was context.Background(), so there was no escape.

Seen in https://buildkite.com/ddev/macos-colima-vz/builds/6133#019d3669-d045-4ac4-a7a3-ea6752273a90/L3753

panic: test timed out after 4h0m0s
	running tests:
		TestWriteConfig (3h56m53s)
goroutine 916 [running]:
testing.(*M).startAlarm.func1()
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/testing/testing.go:2802 +0x2c4
created by time.goFunc
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/time/sleep.go:215 +0x38
goroutine 1 [chan receive, 236 minutes]:
testing.(*T).Run(0x22fa36c44b48, {0x1030eeab1?, 0x22fa36b39698?}, 0x103f5c2b0)
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/testing/testing.go:2109 +0x3bc
testing.runTests.func1(0x22fa36c44b48)
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/testing/testing.go:2585 +0x38
testing.tRunner(0x22fa36c44b48, 0x22fa36b397c8)
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/testing/testing.go:2036 +0xc4
testing.runTests({0x1030f71f3, 0x14}, {0x10310e2d9, 0x20}, 0x22fa36ed8090, {0x104093e00, 0xa8, 0xa8}, {0x22fa36b962a0?, 0x10cd30740?, ...})
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/testing/testing.go:2583 +0x3f0
testing.(*M).Run(0x22fa36f0c0a0)
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/testing/testing.go:2443 +0x578
github.com/ddev/ddev/pkg/ddevapp_test.TestMain(0x22fa36f0c0a0)
	/Users/testbot/tmp/buildkite-agent/builds/tb-macos-arm64-5/ddev/macos-colima-vz/pkg/ddevapp/ddevapp_test.go:506 +0xa00
main.main()
	_testmain.go:384 +0x88
goroutine 35 [select, 236 minutes]:
net/http.(*persistConn).readLoop(0x22fa36d08000)
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/net/http/transport.go:2450 +0xa44
created by net/http.(*Transport).dialConn in goroutine 10
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/net/http/transport.go:1994 +0x109c
goroutine 36 [select, 236 minutes]:
net/http.(*persistConn).writeLoop(0x22fa36d08000)
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/net/http/transport.go:2652 +0x98
created by net/http.(*Transport).dialConn in goroutine 10
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/net/http/transport.go:1995 +0x10e4
goroutine 889 [IO wait, 236 minutes]:
internal/poll.runtime_pollWait(0x10cd22a00, 0x72)
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/runtime/netpoll.go:351 +0xa0
internal/poll.(*pollDesc).wait(0x22fa36d2e000?, 0x22fa36f0f000?, 0x0)
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/internal/poll/fd_poll_runtime.go:84 +0x28
internal/poll.(*pollDesc).waitRead(...)
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0x22fa36d2e000, {0x22fa36f0f000, 0x1000, 0x1000})
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/internal/poll/fd_unix.go:165 +0x234
net.(*netFD).Read(0x22fa36d2e000, {0x22fa36f0f000?, 0x22fa36db79c8?, 0x1026846cc?})
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/net/fd_posix.go:68 +0x28
net.(*conn).Read(0x22fa371a6000, {0x22fa36f0f000?, 0x22fa36db79e8?, 0x1027803ec?})
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/net/net.go:196 +0x34
net/http.(*persistConn).Read(0x22fa36d08000, {0x22fa36f0f000?, 0x103d0fe40?, 0x22fa36d154d8?})
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/net/http/transport.go:2174 +0x48
bufio.(*Reader).fill(0x22fa37120000)
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/bufio/bufio.go:113 +0xe0
bufio.(*Reader).ReadSlice(0x22fa37120000, 0xa)
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/bufio/bufio.go:377 +0x30
net/http/internal.readChunkLine(0x22fa36db7b90?)
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/net/http/internal/chunked.go:156 +0x20
net/http/internal.(*chunkedReader).beginChunk(0x22fa372826f0)
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/net/http/internal/chunked.go:49 +0x24
net/http/internal.(*chunkedReader).Read(0x22fa372826f0, {0x22fa37024800?, 0x200?, 0x200?})
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/net/http/internal/chunked.go:125 +0x120
net/http.(*body).readLocked(0x22fa372b0540, {0x22fa37024800?, 0x22fa36f4fc28?, 0x1026cb600?})
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/net/http/transfer.go:845 +0x34
net/http.(*body).Read(0x22fa36f4fc88?, {0x22fa37024800?, 0x22fa36f4fca8?, 0x102622b68?})
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/net/http/transfer.go:837 +0x104
net/http.(*bodyEOFSignal).Read(0x22fa372b0580, {0x22fa37024800, 0x200, 0x200})
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/net/http/transport.go:3075 +0x14c
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*wrappedBody).Read(0x22fa372b05c0, {0x22fa37024800?, 0x22fa36f4fe08?, 0x102689720?})
	/Users/testbot/tmp/buildkite-agent/builds/tb-macos-arm64-5/ddev/macos-colima-vz/vendor/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp/transport.go:238 +0x30
io.(*teeReader).Read(0x22fa36c17080, {0x22fa37024800, 0x22fa36f4fe98?, 0x200})
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/io/io.go:628 +0x30
encoding/json.(*Decoder).refill(0x22fa36bca8c0)
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/encoding/json/stream.go:167 +0x13c
encoding/json.(*Decoder).readValue(0x22fa36bca8c0)
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/encoding/json/stream.go:142 +0x70
encoding/json.(*Decoder).Decode(0x22fa36bca8c0, {0x103cd0140, 0x22fa373067d0})
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/encoding/json/stream.go:65 +0x5c
github.com/moby/moby/client.(*Client).ContainerWait.func1()
	/Users/testbot/tmp/buildkite-agent/builds/tb-macos-arm64-5/ddev/macos-colima-vz/vendor/github.com/moby/moby/client/container_wait.go:70 +0x144
created by github.com/moby/moby/client.(*Client).ContainerWait in goroutine 853
	/Users/testbot/tmp/buildkite-agent/builds/tb-macos-arm64-5/ddev/macos-colima-vz/vendor/github.com/moby/moby/client/container_wait.go:63 +0x258
goroutine 853 [select, 236 minutes]:
github.com/ddev/ddev/pkg/dockerutil.RunSimpleContainerExtended({0x22fa36e50360, 0x23}, 0x22fa36c2ea20, 0x22fa36d48488, 0x1, 0x0)
	/Users/testbot/tmp/buildkite-agent/builds/tb-macos-arm64-5/ddev/macos-colima-vz/pkg/dockerutil/containers.go:765 +0xe08
github.com/ddev/ddev/pkg/dockerutil.RunSimpleContainer({0x22fa36c02400, 0x35}, {0x22fa36e50360, 0x23}, {0x22fa372823f0, 0x3, 0x3}, {0x1040c7d20, 0x0, 0x0}, ...)
	/Users/testbot/tmp/buildkite-agent/builds/tb-macos-arm64-5/ddev/macos-colima-vz/pkg/dockerutil/containers.go:595 +0x170
github.com/ddev/ddev/pkg/ddevapp.(*DdevApp).Stop(0x22fa36b08008, 0x1, 0x0)
	/Users/testbot/tmp/buildkite-agent/builds/tb-macos-arm64-5/ddev/macos-colima-vz/pkg/ddevapp/ddevapp.go:3288 +0x614
github.com/ddev/ddev/pkg/ddevapp_test.TestWriteConfig.func1()
	/Users/testbot/tmp/buildkite-agent/builds/tb-macos-arm64-5/ddev/macos-colima-vz/pkg/ddevapp/config_test.go:834 +0x50
testing.(*common).Cleanup.func1()
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/testing/testing.go:1317 +0xdc
testing.(*common).runCleanup(0x22fa36e46b48, 0x22fa37102130?)
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/testing/testing.go:1667 +0xd4
testing.tRunner.func2()
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/testing/testing.go:2030 +0x2c
testing.tRunner(0x22fa36e46b48, 0x103f5c2b0)
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/testing/testing.go:2042 +0xf0
created by testing.(*T).Run in goroutine 1
	/opt/homebrew/Cellar/go/1.26.1/libexec/src/testing/testing.go:2101 +0x3a8
FAIL	github.com/ddev/ddev/pkg/ddevapp	14421.262s
FAIL
make: *** [testddevapp] Error 1

How This PR Solves The Issue

Replace apiClient.ContainerWait (streaming HTTP) with ContainerInspect polling in RunSimpleContainerExtended. The streaming API hangs indefinitely on proxied Docker sockets because the proxy keeps the long-lived HTTP connection open but never delivers data. ContainerInspect uses short request-response calls that are not affected by this.

A context.WithTimeout is used for the entire poll loop and passed to each ContainerInspect call, so an in-flight inspect is also cancelled when the deadline expires — not just the select case.

RunSimpleContainerExtended now takes a timeout time.Duration instead of a detach bool0 means don't wait (detach), any positive value sets the polling timeout. RunSimpleContainer defaults to 30 minutes (it may be adjusted in the future, if we decide to refactor the logic). auth-ssh uses 60 seconds (plenty enough for ssh-add), addons uses 30 minutes (I don't know what people can run in their add-ons).

Manual Testing Instructions

Run tests on macOS with Colima vz backend. Previously TestWriteConfig and TestDdevStartMultipleHostnames would hang for hours; now they complete correctly.

Automated Testing Overview

Added TestRunSimpleContainerExtended covering:

  • timeout=0 returns immediately with the container still running
  • A short timeout expires with the expected error message

Release/Deployment Notes

No impact on normal operation. The timeout only triggers when Docker daemon genuinely stalls.

## The Issue

Seen in https://buildkite.com/ddev/macos-colima-vz/builds/6133#019d3669-d045-4ac4-a7a3-ea6752273a90/L3753

ContainerWait could hang indefinitely if the container exited before the
wait was registered with the Docker daemon — the daemon had already
published the exit event and would never publish it again, leaving the
HTTP chunked stream open forever. The select had no ctx.Done() case and
ctx was context.Background(), so there was no escape.

## How This PR Solves The Issue

Two fixes:
1. Register ContainerWait before ContainerStart so the exit event cannot
   be missed regardless of how fast the container runs.
2. Add a 1-hour timeout with ctx.Done() as a safety net for genuine Docker
   daemon failures (broken connection, VM freeze, etc.).

## Manual Testing Instructions

Run tests on macOS with Colima vz backend. Previously TestWriteConfig
would hang for hours; now it completes or fails with a clear timeout error.

## Automated Testing Overview

No new tests added; this is a race condition fix that is difficult to
reproduce reliably in automated tests.

## Release/Deployment Notes

No impact on normal operation. The timeout only triggers when Docker
daemon genuinely stalls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@stasadev stasadev requested a review from a team as a code owner March 30, 2026 12:39
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 30, 2026

This comment was marked as outdated.

@stasadev stasadev changed the title fix: avoid ContainerWait hanging indefinitely fix: use ContainerInspect polling instead of ContainerWait to avoid hangs Mar 30, 2026
@stasadev stasadev requested a review from Copilot March 30, 2026 17:01

This comment was marked as outdated.

@ddev ddev deleted a comment from Copilot AI Mar 30, 2026
@stasadev stasadev requested a review from Copilot March 30, 2026 17:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread pkg/dockerutil/containers_test.go
Comment thread pkg/dockerutil/containers.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

@stasadev stasadev changed the title fix: use ContainerInspect polling instead of ContainerWait to avoid hangs fix(docker): use ContainerInspect polling instead of ContainerWait to avoid hangs Mar 30, 2026
@rfay
Copy link
Copy Markdown
Member

rfay commented Mar 31, 2026

Reasonable test times, no restarts for colima/lima/orbstack. Yay!

Copy link
Copy Markdown
Member

@rfay rfay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, and I ran it by Claude as well. Thanks for tracking it down.

The 30-minute timeout is outrageous of course, but shouldn't actually matter.

I'd like to get this in ASAP, as it should prevent us from having to look at bogus timeouts.

Comment thread pkg/ddevapp/addons.go
}

_, out, err := dockerutil.RunSimpleContainerExtended("php-action-"+util.RandString(6), config, hostConfig, true, false)
_, out, err := dockerutil.RunSimpleContainerExtended("php-action-"+util.RandString(6), config, hostConfig, true, 30*time.Minute)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, this only affects one add-on (ddev-upsun) anyway. But it should never take more than a few seconds. This is fine, but probably could be 1*time.Minute (or even less) but it shouldn't matter anyway. It would be a very unusual thing.

@rfay rfay merged commit ee5f436 into ddev:main Apr 1, 2026
44 checks passed
@rfay rfay deleted the 20260330_stasadev_container_wait_hang branch April 1, 2026 13:47
rfay added a commit to rfay/ddev that referenced this pull request Apr 1, 2026
…eContainer 30m hangs

## The Issue

- Related to ddev#8265 - ContainerInspect polling still produces 30m hangs on Lima/Colima-VZ

On macOS with Lima (both lima-VZ and colima-VZ), `RunSimpleContainer` calls for
trivial commands (cat a file, ls, push traefik config) are timing out at exactly
30 minutes with: "timed out after 30m0s waiting for container X to stop"

PR ddev#8265 replaced `ContainerWait` (which hung indefinitely) with `ContainerInspect`
polling and a 30-minute context deadline. The 30m timeout now fires where previously
we saw 4h hangs, but the root cause is not yet resolved.

## How This PR Solves The Issue

Adds `util.Debug` logging around the `ContainerInspect` polling loop to distinguish
between two failure modes:

**Mode A**: `ContainerInspect` blocks on the socket proxy (Lima/Colima) and doesn't
return until the 30-minute context deadline fires. Each individual call blocks.
Symptom: "attempt #1" logged, "returned after" never logged until ~30m later.

**Mode B**: `ContainerInspect` returns quickly but reports `Running=true` for 30
minutes because the Docker daemon on Lima has stale/incorrect container state.
Symptom: rapid "returned after Xms" messages all showing Running=true.

With `DDEV_DEBUG=true` the logs will show which mode is occurring, enabling a
targeted fix.

## Candidate Fixes (to be applied once root cause is confirmed)

### If Mode A (ContainerInspect blocks):

The fix is a per-call short timeout using goroutines, since Go context cancellation
may not unblock a stuck OS-level socket read on Lima's proxy:

```go
const perCallTimeout = 10 * time.Second
for {
    ch := make(chan inspectResult, 1)
    go func() {
        callCtx, cancel := context.WithTimeout(context.Background(), perCallTimeout)
        defer cancel()
        info, err := apiClient.ContainerInspect(callCtx, c.ID, ...)
        ch <- inspectResult{info, err}
    }()
    select {
    case <-waitCtx.Done(): return timeout error
    case res := <-ch:
        if res.err == nil && !res.info.State.Running { break }
        // err or still running: fall through to tick
    }
    select {
    case <-waitCtx.Done(): return timeout error
    case <-tickChan.C:
    }
}
```

Goroutine leak is bounded (max timeout/perCallTimeout per call) and acceptable.
If the container exits and ContainerInspect subsequently hangs once, the goroutine
for that call leaks but the next call returns quickly and we proceed.

### If Mode B (stale Running=true):

The Docker daemon on Lima isn't getting container exit events. Options:
- Use `docker` CLI via `exec.CommandContext` to check state (fresh socket connection each call)
- Force-kill the container after a shorter threshold (e.g. 60s) if it's still "Running"
  but was started for a trivial command
- Investigate Lima's Docker daemon event propagation

### Other considerations

- Both failures have been seen on lima-VZ and colima-VZ builds, not on other platforms
- The commands involved are trivial: read a file, list directory contents, push traefik config
- A container running `cat file && exit` should complete in <100ms

## Manual Testing Instructions

1. On a Lima or Colima-VZ Mac: `DDEV_DEBUG=true ddev start` for a project that triggers `GetExistingDBType` or Traefik config push
2. Look for `RunSimpleContainer: ContainerInspect attempt #1` in output
3. Check if "returned after" appears immediately or only after 30m
4. Report which mode is occurring

## Automated Testing Overview

No new tests - this is a diagnostic-only change to gather information for the fix.

## Release/Deployment Notes

Debug-only logging - no behavior change. Logs only appear with `DDEV_DEBUG=true`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rfay added a commit to rfay/ddev that referenced this pull request Apr 1, 2026
…eContainer 30m hangs

## The Issue

- Related to ddev#8265 - ContainerInspect polling still produces 30m hangs on Lima/Colima-VZ

On macOS with Lima (both lima-VZ and colima-VZ), `RunSimpleContainer` calls for
trivial commands (cat a file, ls, push traefik config) are timing out at exactly
30 minutes with: "timed out after 30m0s waiting for container X to stop"

PR ddev#8265 replaced `ContainerWait` (which hung indefinitely) with `ContainerInspect`
polling and a 30-minute context deadline. The 30m timeout now fires where previously
we saw 4h hangs, but the root cause is not yet resolved.

## How This PR Solves The Issue

Adds `util.Debug` logging around the `ContainerInspect` polling loop to distinguish
between two failure modes:

**Mode A**: `ContainerInspect` blocks on the socket proxy (Lima/Colima) and doesn't
return until the 30-minute context deadline fires. Each individual call blocks.
Symptom: "attempt #1" logged, "returned after" never logged until ~30m later.

**Mode B**: `ContainerInspect` returns quickly but reports `Running=true` for 30
minutes because the Docker daemon on Lima has stale/incorrect container state.
Symptom: rapid "returned after Xms" messages all showing Running=true.

With `DDEV_DEBUG=true` the logs will show which mode is occurring, enabling a
targeted fix.

## Candidate Fixes (to be applied once root cause is confirmed)

### If Mode A (ContainerInspect blocks):

The fix is a per-call short timeout using goroutines, since Go context cancellation
may not unblock a stuck OS-level socket read on Lima's proxy:

```go
const perCallTimeout = 10 * time.Second
for {
    ch := make(chan inspectResult, 1)
    go func() {
        callCtx, cancel := context.WithTimeout(context.Background(), perCallTimeout)
        defer cancel()
        info, err := apiClient.ContainerInspect(callCtx, c.ID, ...)
        ch <- inspectResult{info, err}
    }()
    select {
    case <-waitCtx.Done(): return timeout error
    case res := <-ch:
        if res.err == nil && !res.info.State.Running { break }
        // err or still running: fall through to tick
    }
    select {
    case <-waitCtx.Done(): return timeout error
    case <-tickChan.C:
    }
}
```

Goroutine leak is bounded (max timeout/perCallTimeout per call) and acceptable.
If the container exits and ContainerInspect subsequently hangs once, the goroutine
for that call leaks but the next call returns quickly and we proceed.

### If Mode B (stale Running=true):

The Docker daemon on Lima isn't getting container exit events. Options:
- Use `docker` CLI via `exec.CommandContext` to check state (fresh socket connection each call)
- Force-kill the container after a shorter threshold (e.g. 60s) if it's still "Running"
  but was started for a trivial command
- Investigate Lima's Docker daemon event propagation

### Other considerations

- Both failures have been seen on lima-VZ and colima-VZ builds, not on other platforms
- The commands involved are trivial: read a file, list directory contents, push traefik config
- A container running `cat file && exit` should complete in <100ms

## Manual Testing Instructions

1. On a Lima or Colima-VZ Mac: `DDEV_DEBUG=true ddev start` for a project that triggers `GetExistingDBType` or Traefik config push
2. Look for `RunSimpleContainer: ContainerInspect attempt #1` in output
3. Check if "returned after" appears immediately or only after 30m
4. Report which mode is occurring

## Automated Testing Overview

No new tests - this is a diagnostic-only change to gather information for the fix.

## Release/Deployment Notes

Debug-only logging - no behavior change. Logs only appear with `DDEV_DEBUG=true`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rfay added a commit to rfay/ddev that referenced this pull request Apr 2, 2026
…eContainer 30m hangs

## The Issue

- Related to ddev#8265 - ContainerInspect polling still produces 30m hangs on Lima/Colima-VZ

On macOS with Lima (both lima-VZ and colima-VZ), `RunSimpleContainer` calls for
trivial commands (cat a file, ls, push traefik config) are timing out at exactly
30 minutes with: "timed out after 30m0s waiting for container X to stop"

PR ddev#8265 replaced `ContainerWait` (which hung indefinitely) with `ContainerInspect`
polling and a 30-minute context deadline. The 30m timeout now fires where previously
we saw 4h hangs, but the root cause is not yet resolved.

## How This PR Solves The Issue

Adds `util.Debug` logging around the `ContainerInspect` polling loop to distinguish
between two failure modes:

**Mode A**: `ContainerInspect` blocks on the socket proxy (Lima/Colima) and doesn't
return until the 30-minute context deadline fires. Each individual call blocks.
Symptom: "attempt #1" logged, "returned after" never logged until ~30m later.

**Mode B**: `ContainerInspect` returns quickly but reports `Running=true` for 30
minutes because the Docker daemon on Lima has stale/incorrect container state.
Symptom: rapid "returned after Xms" messages all showing Running=true.

With `DDEV_DEBUG=true` the logs will show which mode is occurring, enabling a
targeted fix.

## Candidate Fixes (to be applied once root cause is confirmed)

### If Mode A (ContainerInspect blocks):

The fix is a per-call short timeout using goroutines, since Go context cancellation
may not unblock a stuck OS-level socket read on Lima's proxy:

```go
const perCallTimeout = 10 * time.Second
for {
    ch := make(chan inspectResult, 1)
    go func() {
        callCtx, cancel := context.WithTimeout(context.Background(), perCallTimeout)
        defer cancel()
        info, err := apiClient.ContainerInspect(callCtx, c.ID, ...)
        ch <- inspectResult{info, err}
    }()
    select {
    case <-waitCtx.Done(): return timeout error
    case res := <-ch:
        if res.err == nil && !res.info.State.Running { break }
        // err or still running: fall through to tick
    }
    select {
    case <-waitCtx.Done(): return timeout error
    case <-tickChan.C:
    }
}
```

Goroutine leak is bounded (max timeout/perCallTimeout per call) and acceptable.
If the container exits and ContainerInspect subsequently hangs once, the goroutine
for that call leaks but the next call returns quickly and we proceed.

### If Mode B (stale Running=true):

The Docker daemon on Lima isn't getting container exit events. Options:
- Use `docker` CLI via `exec.CommandContext` to check state (fresh socket connection each call)
- Force-kill the container after a shorter threshold (e.g. 60s) if it's still "Running"
  but was started for a trivial command
- Investigate Lima's Docker daemon event propagation

### Other considerations

- Both failures have been seen on lima-VZ and colima-VZ builds, not on other platforms
- The commands involved are trivial: read a file, list directory contents, push traefik config
- A container running `cat file && exit` should complete in <100ms

## Manual Testing Instructions

1. On a Lima or Colima-VZ Mac: `DDEV_DEBUG=true ddev start` for a project that triggers `GetExistingDBType` or Traefik config push
2. Look for `RunSimpleContainer: ContainerInspect attempt #1` in output
3. Check if "returned after" appears immediately or only after 30m
4. Report which mode is occurring

## Automated Testing Overview

No new tests - this is a diagnostic-only change to gather information for the fix.

## Release/Deployment Notes

Debug-only logging - no behavior change. Logs only appear with `DDEV_DEBUG=true`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants