fix(deploy): use signal-based readiness and drain for zero-downtime swaps#66
fix(deploy): use signal-based readiness and drain for zero-downtime swaps#66
Conversation
…ng zero-downtime deploy The proxy cache invalidation was happening asynchronously via the event bus while the old container was stopped synchronously. Requests between the old container stop and the event bus processing hit a dead target, causing 503s. Add a ProxyCacheInvalidator boundary interface and call it synchronously in Deploy() before the drain grace period and old container cleanup. This ensures the proxy routes to the new container before the old one is killed.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/usecase/container/service.go (1)
20-40:⚠️ Potential issue | 🟡 MinorWire
DrainDelayfrom config to avoid always falling back to 2s.
Config.DrainDelayis added but not populated increateContainerService, so user settings are ignored and the service always defaults to 2 seconds. Add the wiring to Viper:Suggested changes
In
internal/app/run.goatcreateContainerService:containerConfig := container.Config{ ... ReadinessDelay: v.GetDuration("deploy.readiness_delay"), + DrainDelay: v.GetDuration("deploy.drain_delay"), }In
internal/app/run.goatloadConfig:v.SetDefault("deploy.readiness_delay", "5s") + v.SetDefault("deploy.drain_delay", "2s")
🤖 Fix all issues with AI agents
In `@internal/app/run.go`:
- Around line 463-466: The code wires synchronous invalidation via
svc.containerSvc.SetProxyCacheInvalidator(svc.proxySvc) while a
container.deployed handler later also triggers invalidation, risking duplicate
invalidations; fix by choosing one path or making the wiring conditional: either
remove this SetProxyCacheInvalidator call and rely on the container.deployed
handler, or add a guard before calling SetProxyCacheInvalidator (e.g., only set
if no existing invalidator or a config flag enables synchronous wiring), or make
the proxy invalidator implementation (svc.proxySvc.InvalidateTarget) explicitly
idempotent/cheap so duplicate calls are safe; update the logic around
SetProxyCacheInvalidator and the container.deployed handler to ensure only one
active invalidation path.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@internal/usecase/container/service.go`:
- Around line 373-384: The code treats s.config.DrainDelay == 0 as "unset" and
forces a 2s default, preventing an explicit "0s" from disabling the delay;
change the config and logic to distinguish unset from explicit zero (e.g., make
DrainDelay *time.Duration in the config struct or use a sentinel like -1) and
update the invalidation branch in service.go (the block referencing
s.config.DrainDelay and drainDelay) to: read the pointer/sentinel under s.mu, if
nil/unset use 2s default, if pointer/value == 0 use instant switchover, and if
sentinel (-1) treat as disabled/instant as intended (or document behavior if you
choose sentinel).
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@docs/config/index.md`:
- Line 64: Update the inline comment for the drain_delay config entry to use a
grammatically complete and clearer phrase; replace "Wait after proxy
invalidation before old stop" with something like "Wait after proxy invalidation
before stopping the old container" (refer to the drain_delay configuration line
shown in the diff) so the intent matches other comments in the section.
In `@internal/usecase/container/service.go`:
- Around line 327-360: In createStartedContainer, if
s.runtime.InspectContainer(ctx, newContainer.ID) returns an error you must stop
and remove the newly started container before returning; call the same cleanup
path used for readiness failures (e.g., s.cleanupFailedContainer(ctx,
newContainer.ID) or s.runtime.RemoveContainer(ctx, newContainer.ID, true) to
ensure the running container is not left behind) and then return the wrapped
error from InspectContainer so the failed deploy does not leave an untracked
container.
- Around line 409-416: The call to waiter.WaitForNoInFlight in the
shouldUseInFlight block ignores its bool result; capture the returned bool and
handle the timeout case by logging a warning that draining timed out and the old
container (use oldContainerID) may still have in-flight traffic, including the
timeout value (cfg.DrainTimeout or the default 30s) to aid operators; keep the
existing flow (return after the wait) but ensure you log when the function
returns false.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@internal/usecase/container/service.go`:
- Around line 407-410: The default drainMode is set to "delay" but docs state it
should be "auto"; update the logic where drainMode is initialized (variable
drainMode and cfg.DrainMode in internal/usecase/container/service.go) to set
drainMode = "auto" when cfg.DrainMode is empty so runtime behavior matches
documentation.
- Around line 1900-1903: The code sets readinessMode from cfg.ReadinessMode and
defaults to "delay" when empty, but docs say the default should be "auto";
change the fallback so that when cfg.ReadinessMode is empty you assign "auto"
instead of "delay" (update the assignment where readinessMode is computed and
any related comments or constants referencing readiness defaults to reflect
"auto"); ensure the symbol readinessMode and the use of cfg.ReadinessMode are
the only places changed so behavior matches docs.
| drainMode := cfg.DrainMode | ||
| if drainMode == "" { | ||
| drainMode = "delay" | ||
| } |
There was a problem hiding this comment.
Default drainMode is inconsistent with documentation.
The code defaults drainMode to "delay" when empty, but the documentation specifies the default as "auto". This should be aligned with the documented behavior.
🐛 Proposed fix
drainMode := cfg.DrainMode
if drainMode == "" {
- drainMode = "delay"
+ drainMode = "auto"
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| drainMode := cfg.DrainMode | |
| if drainMode == "" { | |
| drainMode = "delay" | |
| } | |
| drainMode := cfg.DrainMode | |
| if drainMode == "" { | |
| drainMode = "auto" | |
| } |
🤖 Prompt for AI Agents
In `@internal/usecase/container/service.go` around lines 407 - 410, The default
drainMode is set to "delay" but docs state it should be "auto"; update the logic
where drainMode is initialized (variable drainMode and cfg.DrainMode in
internal/usecase/container/service.go) to set drainMode = "auto" when
cfg.DrainMode is empty so runtime behavior matches documentation.
| readinessMode := cfg.ReadinessMode | ||
| if readinessMode == "" { | ||
| readinessMode = "delay" | ||
| } |
There was a problem hiding this comment.
Default readinessMode is inconsistent with documentation.
The code defaults readinessMode to "delay" when empty, but the documentation in docs/config/index.md specifies the default as "auto". This inconsistency could cause unexpected behavior for users relying on the documented defaults.
🐛 Proposed fix
readinessMode := cfg.ReadinessMode
if readinessMode == "" {
- readinessMode = "delay"
+ readinessMode = "auto"
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| readinessMode := cfg.ReadinessMode | |
| if readinessMode == "" { | |
| readinessMode = "delay" | |
| } | |
| readinessMode := cfg.ReadinessMode | |
| if readinessMode == "" { | |
| readinessMode = "auto" | |
| } |
🤖 Prompt for AI Agents
In `@internal/usecase/container/service.go` around lines 1900 - 1903, The code
sets readinessMode from cfg.ReadinessMode and defaults to "delay" when empty,
but docs say the default should be "auto"; change the fallback so that when
cfg.ReadinessMode is empty you assign "auto" instead of "delay" (update the
assignment where readinessMode is computed and any related comments or constants
referencing readiness defaults to reflect "auto"); ensure the symbol
readinessMode and the use of cfg.ReadinessMode are the only places changed so
behavior matches docs.
Summary
deploy.readiness_mode:auto|docker-health|delay) withdeploy.health_timeoutand strictdocker-healthfailureno healthcheck detecteddeploy.drain_mode:auto|inflight|delay) withdeploy.drain_timeout, plus proxy in-flight request tracking per container before old-container stopTest Plan
Summary by CodeRabbit
New Features
Documentation