fix(docker): use ContainerInspect polling instead of ContainerWait to avoid hangs#8265
Merged
Merged
Conversation
## The Issue Seen in https://buildkite.com/ddev/macos-colima-vz/builds/6133#019d3669-d045-4ac4-a7a3-ea6752273a90/L3753 ContainerWait could hang indefinitely if the container exited before the wait was registered with the Docker daemon — the daemon had already published the exit event and would never publish it again, leaving the HTTP chunked stream open forever. The select had no ctx.Done() case and ctx was context.Background(), so there was no escape. ## How This PR Solves The Issue Two fixes: 1. Register ContainerWait before ContainerStart so the exit event cannot be missed regardless of how fast the container runs. 2. Add a 1-hour timeout with ctx.Done() as a safety net for genuine Docker daemon failures (broken connection, VM freeze, etc.). ## Manual Testing Instructions Run tests on macOS with Colima vz backend. Previously TestWriteConfig would hang for hours; now it completes or fails with a clear timeout error. ## Automated Testing Overview No new tests added; this is a race condition fix that is difficult to reproduce reliably in automated tests. ## Release/Deployment Notes No impact on normal operation. The timeout only triggers when Docker daemon genuinely stalls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Download the artifacts for this pull request:
See Testing a PR. |
Member
|
Reasonable test times, no restarts for colima/lima/orbstack. Yay! |
rfay
approved these changes
Apr 1, 2026
Member
rfay
left a comment
There was a problem hiding this comment.
This is great, and I ran it by Claude as well. Thanks for tracking it down.
The 30-minute timeout is outrageous of course, but shouldn't actually matter.
I'd like to get this in ASAP, as it should prevent us from having to look at bogus timeouts.
| } | ||
|
|
||
| _, out, err := dockerutil.RunSimpleContainerExtended("php-action-"+util.RandString(6), config, hostConfig, true, false) | ||
| _, out, err := dockerutil.RunSimpleContainerExtended("php-action-"+util.RandString(6), config, hostConfig, true, 30*time.Minute) |
Member
There was a problem hiding this comment.
If I'm not mistaken, this only affects one add-on (ddev-upsun) anyway. But it should never take more than a few seconds. This is fine, but probably could be 1*time.Minute (or even less) but it shouldn't matter anyway. It would be a very unusual thing.
rfay
added a commit
to rfay/ddev
that referenced
this pull request
Apr 1, 2026
…eContainer 30m hangs ## The Issue - Related to ddev#8265 - ContainerInspect polling still produces 30m hangs on Lima/Colima-VZ On macOS with Lima (both lima-VZ and colima-VZ), `RunSimpleContainer` calls for trivial commands (cat a file, ls, push traefik config) are timing out at exactly 30 minutes with: "timed out after 30m0s waiting for container X to stop" PR ddev#8265 replaced `ContainerWait` (which hung indefinitely) with `ContainerInspect` polling and a 30-minute context deadline. The 30m timeout now fires where previously we saw 4h hangs, but the root cause is not yet resolved. ## How This PR Solves The Issue Adds `util.Debug` logging around the `ContainerInspect` polling loop to distinguish between two failure modes: **Mode A**: `ContainerInspect` blocks on the socket proxy (Lima/Colima) and doesn't return until the 30-minute context deadline fires. Each individual call blocks. Symptom: "attempt #1" logged, "returned after" never logged until ~30m later. **Mode B**: `ContainerInspect` returns quickly but reports `Running=true` for 30 minutes because the Docker daemon on Lima has stale/incorrect container state. Symptom: rapid "returned after Xms" messages all showing Running=true. With `DDEV_DEBUG=true` the logs will show which mode is occurring, enabling a targeted fix. ## Candidate Fixes (to be applied once root cause is confirmed) ### If Mode A (ContainerInspect blocks): The fix is a per-call short timeout using goroutines, since Go context cancellation may not unblock a stuck OS-level socket read on Lima's proxy: ```go const perCallTimeout = 10 * time.Second for { ch := make(chan inspectResult, 1) go func() { callCtx, cancel := context.WithTimeout(context.Background(), perCallTimeout) defer cancel() info, err := apiClient.ContainerInspect(callCtx, c.ID, ...) ch <- inspectResult{info, err} }() select { case <-waitCtx.Done(): return timeout error case res := <-ch: if res.err == nil && !res.info.State.Running { break } // err or still running: fall through to tick } select { case <-waitCtx.Done(): return timeout error case <-tickChan.C: } } ``` Goroutine leak is bounded (max timeout/perCallTimeout per call) and acceptable. If the container exits and ContainerInspect subsequently hangs once, the goroutine for that call leaks but the next call returns quickly and we proceed. ### If Mode B (stale Running=true): The Docker daemon on Lima isn't getting container exit events. Options: - Use `docker` CLI via `exec.CommandContext` to check state (fresh socket connection each call) - Force-kill the container after a shorter threshold (e.g. 60s) if it's still "Running" but was started for a trivial command - Investigate Lima's Docker daemon event propagation ### Other considerations - Both failures have been seen on lima-VZ and colima-VZ builds, not on other platforms - The commands involved are trivial: read a file, list directory contents, push traefik config - A container running `cat file && exit` should complete in <100ms ## Manual Testing Instructions 1. On a Lima or Colima-VZ Mac: `DDEV_DEBUG=true ddev start` for a project that triggers `GetExistingDBType` or Traefik config push 2. Look for `RunSimpleContainer: ContainerInspect attempt #1` in output 3. Check if "returned after" appears immediately or only after 30m 4. Report which mode is occurring ## Automated Testing Overview No new tests - this is a diagnostic-only change to gather information for the fix. ## Release/Deployment Notes Debug-only logging - no behavior change. Logs only appear with `DDEV_DEBUG=true`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rfay
added a commit
to rfay/ddev
that referenced
this pull request
Apr 1, 2026
…eContainer 30m hangs ## The Issue - Related to ddev#8265 - ContainerInspect polling still produces 30m hangs on Lima/Colima-VZ On macOS with Lima (both lima-VZ and colima-VZ), `RunSimpleContainer` calls for trivial commands (cat a file, ls, push traefik config) are timing out at exactly 30 minutes with: "timed out after 30m0s waiting for container X to stop" PR ddev#8265 replaced `ContainerWait` (which hung indefinitely) with `ContainerInspect` polling and a 30-minute context deadline. The 30m timeout now fires where previously we saw 4h hangs, but the root cause is not yet resolved. ## How This PR Solves The Issue Adds `util.Debug` logging around the `ContainerInspect` polling loop to distinguish between two failure modes: **Mode A**: `ContainerInspect` blocks on the socket proxy (Lima/Colima) and doesn't return until the 30-minute context deadline fires. Each individual call blocks. Symptom: "attempt #1" logged, "returned after" never logged until ~30m later. **Mode B**: `ContainerInspect` returns quickly but reports `Running=true` for 30 minutes because the Docker daemon on Lima has stale/incorrect container state. Symptom: rapid "returned after Xms" messages all showing Running=true. With `DDEV_DEBUG=true` the logs will show which mode is occurring, enabling a targeted fix. ## Candidate Fixes (to be applied once root cause is confirmed) ### If Mode A (ContainerInspect blocks): The fix is a per-call short timeout using goroutines, since Go context cancellation may not unblock a stuck OS-level socket read on Lima's proxy: ```go const perCallTimeout = 10 * time.Second for { ch := make(chan inspectResult, 1) go func() { callCtx, cancel := context.WithTimeout(context.Background(), perCallTimeout) defer cancel() info, err := apiClient.ContainerInspect(callCtx, c.ID, ...) ch <- inspectResult{info, err} }() select { case <-waitCtx.Done(): return timeout error case res := <-ch: if res.err == nil && !res.info.State.Running { break } // err or still running: fall through to tick } select { case <-waitCtx.Done(): return timeout error case <-tickChan.C: } } ``` Goroutine leak is bounded (max timeout/perCallTimeout per call) and acceptable. If the container exits and ContainerInspect subsequently hangs once, the goroutine for that call leaks but the next call returns quickly and we proceed. ### If Mode B (stale Running=true): The Docker daemon on Lima isn't getting container exit events. Options: - Use `docker` CLI via `exec.CommandContext` to check state (fresh socket connection each call) - Force-kill the container after a shorter threshold (e.g. 60s) if it's still "Running" but was started for a trivial command - Investigate Lima's Docker daemon event propagation ### Other considerations - Both failures have been seen on lima-VZ and colima-VZ builds, not on other platforms - The commands involved are trivial: read a file, list directory contents, push traefik config - A container running `cat file && exit` should complete in <100ms ## Manual Testing Instructions 1. On a Lima or Colima-VZ Mac: `DDEV_DEBUG=true ddev start` for a project that triggers `GetExistingDBType` or Traefik config push 2. Look for `RunSimpleContainer: ContainerInspect attempt #1` in output 3. Check if "returned after" appears immediately or only after 30m 4. Report which mode is occurring ## Automated Testing Overview No new tests - this is a diagnostic-only change to gather information for the fix. ## Release/Deployment Notes Debug-only logging - no behavior change. Logs only appear with `DDEV_DEBUG=true`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rfay
added a commit
to rfay/ddev
that referenced
this pull request
Apr 2, 2026
…eContainer 30m hangs ## The Issue - Related to ddev#8265 - ContainerInspect polling still produces 30m hangs on Lima/Colima-VZ On macOS with Lima (both lima-VZ and colima-VZ), `RunSimpleContainer` calls for trivial commands (cat a file, ls, push traefik config) are timing out at exactly 30 minutes with: "timed out after 30m0s waiting for container X to stop" PR ddev#8265 replaced `ContainerWait` (which hung indefinitely) with `ContainerInspect` polling and a 30-minute context deadline. The 30m timeout now fires where previously we saw 4h hangs, but the root cause is not yet resolved. ## How This PR Solves The Issue Adds `util.Debug` logging around the `ContainerInspect` polling loop to distinguish between two failure modes: **Mode A**: `ContainerInspect` blocks on the socket proxy (Lima/Colima) and doesn't return until the 30-minute context deadline fires. Each individual call blocks. Symptom: "attempt #1" logged, "returned after" never logged until ~30m later. **Mode B**: `ContainerInspect` returns quickly but reports `Running=true` for 30 minutes because the Docker daemon on Lima has stale/incorrect container state. Symptom: rapid "returned after Xms" messages all showing Running=true. With `DDEV_DEBUG=true` the logs will show which mode is occurring, enabling a targeted fix. ## Candidate Fixes (to be applied once root cause is confirmed) ### If Mode A (ContainerInspect blocks): The fix is a per-call short timeout using goroutines, since Go context cancellation may not unblock a stuck OS-level socket read on Lima's proxy: ```go const perCallTimeout = 10 * time.Second for { ch := make(chan inspectResult, 1) go func() { callCtx, cancel := context.WithTimeout(context.Background(), perCallTimeout) defer cancel() info, err := apiClient.ContainerInspect(callCtx, c.ID, ...) ch <- inspectResult{info, err} }() select { case <-waitCtx.Done(): return timeout error case res := <-ch: if res.err == nil && !res.info.State.Running { break } // err or still running: fall through to tick } select { case <-waitCtx.Done(): return timeout error case <-tickChan.C: } } ``` Goroutine leak is bounded (max timeout/perCallTimeout per call) and acceptable. If the container exits and ContainerInspect subsequently hangs once, the goroutine for that call leaks but the next call returns quickly and we proceed. ### If Mode B (stale Running=true): The Docker daemon on Lima isn't getting container exit events. Options: - Use `docker` CLI via `exec.CommandContext` to check state (fresh socket connection each call) - Force-kill the container after a shorter threshold (e.g. 60s) if it's still "Running" but was started for a trivial command - Investigate Lima's Docker daemon event propagation ### Other considerations - Both failures have been seen on lima-VZ and colima-VZ builds, not on other platforms - The commands involved are trivial: read a file, list directory contents, push traefik config - A container running `cat file && exit` should complete in <100ms ## Manual Testing Instructions 1. On a Lima or Colima-VZ Mac: `DDEV_DEBUG=true ddev start` for a project that triggers `GetExistingDBType` or Traefik config push 2. Look for `RunSimpleContainer: ContainerInspect attempt #1` in output 3. Check if "returned after" appears immediately or only after 30m 4. Report which mode is occurring ## Automated Testing Overview No new tests - this is a diagnostic-only change to gather information for the fix. ## Release/Deployment Notes Debug-only logging - no behavior change. Logs only appear with `DDEV_DEBUG=true`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Issue
ContainerWaitcould hang indefinitely because its streaming HTTP response never delivered data on proxied Docker sockets. Theselecthad noctx.Done()case andctxwascontext.Background(), so there was no escape.Seen in https://buildkite.com/ddev/macos-colima-vz/builds/6133#019d3669-d045-4ac4-a7a3-ea6752273a90/L3753
How This PR Solves The Issue
Replace
apiClient.ContainerWait(streaming HTTP) withContainerInspectpolling inRunSimpleContainerExtended. The streaming API hangs indefinitely on proxied Docker sockets because the proxy keeps the long-lived HTTP connection open but never delivers data.ContainerInspectuses short request-response calls that are not affected by this.A
context.WithTimeoutis used for the entire poll loop and passed to eachContainerInspectcall, so an in-flight inspect is also cancelled when the deadline expires — not just the select case.RunSimpleContainerExtendednow takes atimeout time.Durationinstead of adetach bool—0means don't wait (detach), any positive value sets the polling timeout.RunSimpleContainerdefaults to 30 minutes (it may be adjusted in the future, if we decide to refactor the logic).auth-sshuses 60 seconds (plenty enough forssh-add),addonsuses 30 minutes (I don't know what people can run in their add-ons).Manual Testing Instructions
Run tests on macOS with Colima vz backend. Previously
TestWriteConfigandTestDdevStartMultipleHostnameswould hang for hours; now they complete correctly.Automated Testing Overview
Added
TestRunSimpleContainerExtendedcovering:timeout=0returns immediately with the container still runningRelease/Deployment Notes
No impact on normal operation. The timeout only triggers when Docker daemon genuinely stalls.