Skip to content

fix(deploy): use signal-based readiness and drain for zero-downtime swaps#66

Merged
bnema merged 7 commits intomainfrom
fix/proxy-cache-race-condition-clean
Feb 7, 2026
Merged

fix(deploy): use signal-based readiness and drain for zero-downtime swaps#66
bnema merged 7 commits intomainfrom
fix/proxy-cache-race-condition-clean

Conversation

@bnema
Copy link
Owner

@bnema bnema commented Feb 7, 2026

Summary

  • switch deploy cutover from timer-first behavior to signal-based strategies with safe fallbacks
  • add readiness strategies (deploy.readiness_mode: auto|docker-health|delay) with deploy.health_timeout and strict docker-health failure no healthcheck detected
  • add drain strategies (deploy.drain_mode: auto|inflight|delay) with deploy.drain_timeout, plus proxy in-flight request tracking per container before old-container stop
  • extend runtime contract with container health status lookup and implement Docker adapter support
  • wire new deploy settings/defaults in app config and document all new keys, defaults, and tuning guidance in config docs

Test Plan

  • make mocks
  • make lint
  • make test

Summary by CodeRabbit

  • New Features

    • Zero-downtime deploys: coordinated proxy cache invalidation and optional in-flight request draining
    • New readiness and drain strategies (delay, docker-health, auto) with configurable timeouts and delays
    • Proxy now tracks in-flight requests and exposes waiting for drains
    • Docker container health status exposed for readiness checks
  • Documentation

    • Updated deploy configuration docs with new options, defaults, and restart/hot-reload notes

bnema added 2 commits February 7, 2026 20:59
…ng zero-downtime deploy

The proxy cache invalidation was happening asynchronously via the event bus
while the old container was stopped synchronously. Requests between the old
container stop and the event bus processing hit a dead target, causing 503s.

Add a ProxyCacheInvalidator boundary interface and call it synchronously in
Deploy() before the drain grace period and old container cleanup. This ensures
the proxy routes to the new container before the old one is killed.
Copilot AI review requested due to automatic review settings February 7, 2026 20:21
@coderabbitai

This comment was marked as outdated.

This comment was marked as outdated.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/usecase/container/service.go (1)

20-40: ⚠️ Potential issue | 🟡 Minor

Wire DrainDelay from config to avoid always falling back to 2s.

Config.DrainDelay is added but not populated in createContainerService, so user settings are ignored and the service always defaults to 2 seconds. Add the wiring to Viper:

Suggested changes

In internal/app/run.go at createContainerService:

 	containerConfig := container.Config{
 		...
 		ReadinessDelay:           v.GetDuration("deploy.readiness_delay"),
+		DrainDelay:               v.GetDuration("deploy.drain_delay"),
 	}

In internal/app/run.go at loadConfig:

 	v.SetDefault("deploy.readiness_delay", "5s")
+	v.SetDefault("deploy.drain_delay", "2s")
🤖 Fix all issues with AI agents
In `@internal/app/run.go`:
- Around line 463-466: The code wires synchronous invalidation via
svc.containerSvc.SetProxyCacheInvalidator(svc.proxySvc) while a
container.deployed handler later also triggers invalidation, risking duplicate
invalidations; fix by choosing one path or making the wiring conditional: either
remove this SetProxyCacheInvalidator call and rely on the container.deployed
handler, or add a guard before calling SetProxyCacheInvalidator (e.g., only set
if no existing invalidator or a config flag enables synchronous wiring), or make
the proxy invalidator implementation (svc.proxySvc.InvalidateTarget) explicitly
idempotent/cheap so duplicate calls are safe; update the logic around
SetProxyCacheInvalidator and the container.deployed handler to ensure only one
active invalidation path.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@internal/usecase/container/service.go`:
- Around line 373-384: The code treats s.config.DrainDelay == 0 as "unset" and
forces a 2s default, preventing an explicit "0s" from disabling the delay;
change the config and logic to distinguish unset from explicit zero (e.g., make
DrainDelay *time.Duration in the config struct or use a sentinel like -1) and
update the invalidation branch in service.go (the block referencing
s.config.DrainDelay and drainDelay) to: read the pointer/sentinel under s.mu, if
nil/unset use 2s default, if pointer/value == 0 use instant switchover, and if
sentinel (-1) treat as disabled/instant as intended (or document behavior if you
choose sentinel).

@bnema bnema changed the title fix(deploy): make proxy switch safer during container replacement fix(deploy): use signal-based readiness and drain for zero-downtime swaps Feb 7, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@docs/config/index.md`:
- Line 64: Update the inline comment for the drain_delay config entry to use a
grammatically complete and clearer phrase; replace "Wait after proxy
invalidation before old stop" with something like "Wait after proxy invalidation
before stopping the old container" (refer to the drain_delay configuration line
shown in the diff) so the intent matches other comments in the section.

In `@internal/usecase/container/service.go`:
- Around line 327-360: In createStartedContainer, if
s.runtime.InspectContainer(ctx, newContainer.ID) returns an error you must stop
and remove the newly started container before returning; call the same cleanup
path used for readiness failures (e.g., s.cleanupFailedContainer(ctx,
newContainer.ID) or s.runtime.RemoveContainer(ctx, newContainer.ID, true) to
ensure the running container is not left behind) and then return the wrapped
error from InspectContainer so the failed deploy does not leave an untracked
container.
- Around line 409-416: The call to waiter.WaitForNoInFlight in the
shouldUseInFlight block ignores its bool result; capture the returned bool and
handle the timeout case by logging a warning that draining timed out and the old
container (use oldContainerID) may still have in-flight traffic, including the
timeout value (cfg.DrainTimeout or the default 30s) to aid operators; keep the
existing flow (return after the wait) but ensure you log when the function
returns false.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@internal/usecase/container/service.go`:
- Around line 407-410: The default drainMode is set to "delay" but docs state it
should be "auto"; update the logic where drainMode is initialized (variable
drainMode and cfg.DrainMode in internal/usecase/container/service.go) to set
drainMode = "auto" when cfg.DrainMode is empty so runtime behavior matches
documentation.
- Around line 1900-1903: The code sets readinessMode from cfg.ReadinessMode and
defaults to "delay" when empty, but docs say the default should be "auto";
change the fallback so that when cfg.ReadinessMode is empty you assign "auto"
instead of "delay" (update the assignment where readinessMode is computed and
any related comments or constants referencing readiness defaults to reflect
"auto"); ensure the symbol readinessMode and the use of cfg.ReadinessMode are
the only places changed so behavior matches docs.

Comment on lines +407 to +410
drainMode := cfg.DrainMode
if drainMode == "" {
drainMode = "delay"
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Default drainMode is inconsistent with documentation.

The code defaults drainMode to "delay" when empty, but the documentation specifies the default as "auto". This should be aligned with the documented behavior.

🐛 Proposed fix
 	drainMode := cfg.DrainMode
 	if drainMode == "" {
-		drainMode = "delay"
+		drainMode = "auto"
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
drainMode := cfg.DrainMode
if drainMode == "" {
drainMode = "delay"
}
drainMode := cfg.DrainMode
if drainMode == "" {
drainMode = "auto"
}
🤖 Prompt for AI Agents
In `@internal/usecase/container/service.go` around lines 407 - 410, The default
drainMode is set to "delay" but docs state it should be "auto"; update the logic
where drainMode is initialized (variable drainMode and cfg.DrainMode in
internal/usecase/container/service.go) to set drainMode = "auto" when
cfg.DrainMode is empty so runtime behavior matches documentation.

Comment on lines +1900 to +1903
readinessMode := cfg.ReadinessMode
if readinessMode == "" {
readinessMode = "delay"
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Default readinessMode is inconsistent with documentation.

The code defaults readinessMode to "delay" when empty, but the documentation in docs/config/index.md specifies the default as "auto". This inconsistency could cause unexpected behavior for users relying on the documented defaults.

🐛 Proposed fix
 	readinessMode := cfg.ReadinessMode
 	if readinessMode == "" {
-		readinessMode = "delay"
+		readinessMode = "auto"
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
readinessMode := cfg.ReadinessMode
if readinessMode == "" {
readinessMode = "delay"
}
readinessMode := cfg.ReadinessMode
if readinessMode == "" {
readinessMode = "auto"
}
🤖 Prompt for AI Agents
In `@internal/usecase/container/service.go` around lines 1900 - 1903, The code
sets readinessMode from cfg.ReadinessMode and defaults to "delay" when empty,
but docs say the default should be "auto"; change the fallback so that when
cfg.ReadinessMode is empty you assign "auto" instead of "delay" (update the
assignment where readinessMode is computed and any related comments or constants
referencing readiness defaults to reflect "auto"); ensure the symbol
readinessMode and the use of cfg.ReadinessMode are the only places changed so
behavior matches docs.

@bnema bnema merged commit 5e11a00 into main Feb 7, 2026
2 of 3 checks passed
@bnema bnema deleted the fix/proxy-cache-race-condition-clean branch February 7, 2026 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants