fix(deploy): use signal-based readiness and drain for zero-downtime swaps by bnema · Pull Request #66 · bnema/gordon

bnema · 2026-02-07T20:21:44Z

Summary

switch deploy cutover from timer-first behavior to signal-based strategies with safe fallbacks
add readiness strategies (deploy.readiness_mode: auto|docker-health|delay) with deploy.health_timeout and strict docker-health failure no healthcheck detected
add drain strategies (deploy.drain_mode: auto|inflight|delay) with deploy.drain_timeout, plus proxy in-flight request tracking per container before old-container stop
extend runtime contract with container health status lookup and implement Docker adapter support
wire new deploy settings/defaults in app config and document all new keys, defaults, and tuning guidance in config docs

Test Plan

make mocks
make lint
make test

Summary by CodeRabbit

New Features
- Zero-downtime deploys: coordinated proxy cache invalidation and optional in-flight request draining
- New readiness and drain strategies (delay, docker-health, auto) with configurable timeouts and delays
- Proxy now tracks in-flight requests and exposes waiting for drains
- Docker container health status exposed for readiness checks
Documentation
- Updated deploy configuration docs with new options, defaults, and restart/hot-reload notes

…ng zero-downtime deploy The proxy cache invalidation was happening asynchronously via the event bus while the old container was stopped synchronously. Requests between the old container stop and the event bus processing hit a dead target, causing 503s. Add a ProxyCacheInvalidator boundary interface and call it synchronously in Deploy() before the drain grace period and old container cleanup. This ensures the proxy routes to the new container before the old one is killed.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/usecase/container/service.go (1)
20-40: ⚠️ Potential issue | 🟡 Minor

Wire DrainDelay from config to avoid always falling back to 2s.

Config.DrainDelay is added but not populated in createContainerService, so user settings are ignored and the service always defaults to 2 seconds. Add the wiring to Viper:
Suggested changes

In internal/app/run.go at createContainerService:
 	containerConfig := container.Config{
 		...
 		ReadinessDelay:           v.GetDuration("deploy.readiness_delay"),
+		DrainDelay:               v.GetDuration("deploy.drain_delay"),
 	}
In internal/app/run.go at loadConfig:
 	v.SetDefault("deploy.readiness_delay", "5s")
+	v.SetDefault("deploy.drain_delay", "2s")

🤖 Fix all issues with AI agents

In `@internal/app/run.go`:
- Around line 463-466: The code wires synchronous invalidation via
svc.containerSvc.SetProxyCacheInvalidator(svc.proxySvc) while a
container.deployed handler later also triggers invalidation, risking duplicate
invalidations; fix by choosing one path or making the wiring conditional: either
remove this SetProxyCacheInvalidator call and rely on the container.deployed
handler, or add a guard before calling SetProxyCacheInvalidator (e.g., only set
if no existing invalidator or a config flag enables synchronous wiring), or make
the proxy invalidator implementation (svc.proxySvc.InvalidateTarget) explicitly
idempotent/cheap so duplicate calls are safe; update the logic around
SetProxyCacheInvalidator and the container.deployed handler to ensure only one
active invalidation path.

internal/app/run.go

…date docs

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@internal/usecase/container/service.go`:
- Around line 373-384: The code treats s.config.DrainDelay == 0 as "unset" and
forces a 2s default, preventing an explicit "0s" from disabling the delay;
change the config and logic to distinguish unset from explicit zero (e.g., make
DrainDelay *time.Duration in the config struct or use a sentinel like -1) and
update the invalidation branch in service.go (the block referencing
s.config.DrainDelay and drainDelay) to: read the pointer/sentinel under s.mu, if
nil/unset use 2s default, if pointer/value == 0 use instant switchover, and if
sentinel (-1) treat as disabled/instant as intended (or document behavior if you
choose sentinel).

internal/usecase/container/service.go

…eouts

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@docs/config/index.md`:
- Line 64: Update the inline comment for the drain_delay config entry to use a
grammatically complete and clearer phrase; replace "Wait after proxy
invalidation before old stop" with something like "Wait after proxy invalidation
before stopping the old container" (refer to the drain_delay configuration line
shown in the diff) so the intent matches other comments in the section.

In `@internal/usecase/container/service.go`:
- Around line 327-360: In createStartedContainer, if
s.runtime.InspectContainer(ctx, newContainer.ID) returns an error you must stop
and remove the newly started container before returning; call the same cleanup
path used for readiness failures (e.g., s.cleanupFailedContainer(ctx,
newContainer.ID) or s.runtime.RemoveContainer(ctx, newContainer.ID, true) to
ensure the running container is not left behind) and then return the wrapped
error from InspectContainer so the failed deploy does not leave an untracked
container.
- Around line 409-416: The call to waiter.WaitForNoInFlight in the
shouldUseInFlight block ignores its bool result; capture the returned bool and
handle the timeout case by logging a warning that draining timed out and the old
container (use oldContainerID) may still have in-flight traffic, including the
timeout value (cfg.DrainTimeout or the default 30s) to aid operators; keep the
existing flow (return after the wait) but ensure you log when the function
returns false.

docs/config/index.md

internal/usecase/container/service.go

…drain timeout

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@internal/usecase/container/service.go`:
- Around line 407-410: The default drainMode is set to "delay" but docs state it
should be "auto"; update the logic where drainMode is initialized (variable
drainMode and cfg.DrainMode in internal/usecase/container/service.go) to set
drainMode = "auto" when cfg.DrainMode is empty so runtime behavior matches
documentation.
- Around line 1900-1903: The code sets readinessMode from cfg.ReadinessMode and
defaults to "delay" when empty, but docs say the default should be "auto";
change the fallback so that when cfg.ReadinessMode is empty you assign "auto"
instead of "delay" (update the assignment where readinessMode is computed and
any related comments or constants referencing readiness defaults to reflect
"auto"); ensure the symbol readinessMode and the use of cfg.ReadinessMode are
the only places changed so behavior matches docs.

coderabbitai · 2026-02-07T21:27:20Z

internal/usecase/container/service.go

+	drainMode := cfg.DrainMode
+	if drainMode == "" {
+		drainMode = "delay"
+	}


⚠️ Potential issue | 🟡 Minor

Default drainMode is inconsistent with documentation.

The code defaults drainMode to "delay" when empty, but the documentation specifies the default as "auto". This should be aligned with the documented behavior.

🐛 Proposed fix

drainMode := cfg.DrainMode if drainMode == "" { - drainMode = "delay" + drainMode = "auto" }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

drainMode := cfg.DrainMode

if drainMode == "" {

drainMode = "delay"

}

drainMode := cfg.DrainMode

if drainMode == "" {

drainMode = "auto"

}

🤖 Prompt for AI Agents

In `@internal/usecase/container/service.go` around lines 407 - 410, The default drainMode is set to "delay" but docs state it should be "auto"; update the logic where drainMode is initialized (variable drainMode and cfg.DrainMode in internal/usecase/container/service.go) to set drainMode = "auto" when cfg.DrainMode is empty so runtime behavior matches documentation.

coderabbitai · 2026-02-07T21:27:20Z

internal/usecase/container/service.go

+	readinessMode := cfg.ReadinessMode
+	if readinessMode == "" {
+		readinessMode = "delay"
+	}


⚠️ Potential issue | 🟡 Minor

Default readinessMode is inconsistent with documentation.

The code defaults readinessMode to "delay" when empty, but the documentation in docs/config/index.md specifies the default as "auto". This inconsistency could cause unexpected behavior for users relying on the documented defaults.

🐛 Proposed fix

readinessMode := cfg.ReadinessMode if readinessMode == "" { - readinessMode = "delay" + readinessMode = "auto" }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

readinessMode := cfg.ReadinessMode

if readinessMode == "" {

readinessMode = "delay"

}

readinessMode := cfg.ReadinessMode

if readinessMode == "" {

readinessMode = "auto"

}

🤖 Prompt for AI Agents

In `@internal/usecase/container/service.go` around lines 1900 - 1903, The code sets readinessMode from cfg.ReadinessMode and defaults to "delay" when empty, but docs say the default should be "auto"; change the fallback so that when cfg.ReadinessMode is empty you assign "auto" instead of "delay" (update the assignment where readinessMode is computed and any related comments or constants referencing readiness defaults to reflect "auto"); ensure the symbol readinessMode and the use of cfg.ReadinessMode are the only places changed so behavior matches docs.

bnema added 2 commits February 7, 2026 20:59

fix(container): extract deploy helpers to satisfy gocyclo

777a0a7

Copilot AI review requested due to automatic review settings February 7, 2026 20:21

Copilot started reviewing on behalf of bnema February 7, 2026 20:22 View session

This comment was marked as outdated.

Sign in to view

fix(container): skip old cleanup on canceled deploy context

2632e63

This comment was marked as outdated.

Sign in to view

coderabbitai bot reviewed Feb 7, 2026

View reviewed changes

internal/app/run.go Show resolved Hide resolved

fix(deploy): wire drain delay, prevent duplicate invalidation, and up…

4fba121

…date docs

coderabbitai bot reviewed Feb 7, 2026

View reviewed changes

internal/usecase/container/service.go Show resolved Hide resolved

fix(deploy): add signal-based readiness and drain with documented tim…

7e326be

…eouts

bnema changed the title ~~fix(deploy): make proxy switch safer during container replacement~~ fix(deploy): use signal-based readiness and drain for zero-downtime swaps Feb 7, 2026

coderabbitai bot reviewed Feb 7, 2026

View reviewed changes

docs/config/index.md Outdated Show resolved Hide resolved

internal/usecase/container/service.go Show resolved Hide resolved

internal/usecase/container/service.go Show resolved Hide resolved

bnema added 2 commits February 7, 2026 22:23

fix(deploy): cleanup failed containers on inspect errors and warn on …

b0d058e

…drain timeout

fix(deploy): honor explicit zero drain_delay and document instant switch

99c9ecb

coderabbitai bot reviewed Feb 7, 2026

View reviewed changes

bnema merged commit 5e11a00 into main Feb 7, 2026
2 of 3 checks passed

bnema deleted the fix/proxy-cache-race-condition-clean branch February 7, 2026 21:27

This was referenced Feb 8, 2026

Improve deploy readiness handling and push UX #68

Merged

refactor(cli): local control plane parity + security hardening #69

Merged

coderabbitai bot mentioned this pull request Mar 3, 2026

feat: strict zero-downtime deploy with readiness probes and rollback #91

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(deploy): use signal-based readiness and drain for zero-downtime swaps#66

fix(deploy): use signal-based readiness and drain for zero-downtime swaps#66
bnema merged 7 commits intomainfrom
fix/proxy-cache-race-condition-clean

bnema commented Feb 7, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 7, 2026

Uh oh!

coderabbitai bot Feb 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bnema commented Feb 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Summary by CodeRabbit

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bnema commented Feb 7, 2026 •

edited by coderabbitai bot

Loading