Skip to content

fix(cachewd): decouple scheduler context from signal context#312

Merged
joshfriend merged 1 commit into
mainfrom
jfriend/fix-scheduler-shutdown-context
May 15, 2026
Merged

fix(cachewd): decouple scheduler context from signal context#312
joshfriend merged 1 commit into
mainfrom
jfriend/fix-scheduler-shutdown-context

Conversation

@joshfriend
Copy link
Copy Markdown
Contributor

@joshfriend joshfriend commented May 15, 2026

The graceful shutdown PR (#307) passes the signal-notified context to the scheduler, so SIGTERM immediately kills all workers even though server.Shutdown keeps HTTP handlers alive for up to 150s. In-flight jobs (snapshots, repacks, fetches) get context-cancelled mid-execution.

Gives the scheduler its own context via context.WithoutCancel (same pattern already used for the HTTP server BaseContext) and cancels it after server.Shutdown completes. Workers finish their current jobs during the drain window instead of dying instantly. The drain is bounded to 10s so long-running jobs don't block process exit past the pod's terminationGracePeriodSeconds.

Production evidence from today's rolling deploys:

  • Every SIGTERM produced 32+ simultaneous Worker terminated log lines per pod
  • Snapshot jobs killed mid-write: tar failed: signal: killed, context canceled
  • cash-server snapshot killed 50s in, ios-register mirror snapshot killed 5m31s in

Shutdown order after this change:

  1. SIGTERM → readiness flips to 503
  2. server.Shutdown drains in-flight HTTP requests (up to 150s)
  3. Scheduler context cancelled → workers finish current job and exit
  4. Wait up to 10s for workers to drain, then exit

@joshfriend joshfriend force-pushed the jfriend/fix-scheduler-shutdown-context branch from efc4687 to 692f2e3 Compare May 15, 2026 20:11
@joshfriend joshfriend marked this pull request as ready for review May 15, 2026 20:12
@joshfriend joshfriend requested a review from a team as a code owner May 15, 2026 20:12
@joshfriend joshfriend requested review from worstell and removed request for a team May 15, 2026 20:12
@joshfriend joshfriend force-pushed the jfriend/fix-scheduler-shutdown-context branch from 692f2e3 to 8a6ea50 Compare May 15, 2026 20:14
The graceful shutdown PR (#307) passes the signal-notified context to
the scheduler, so SIGTERM immediately kills all workers even though
server.Shutdown keeps HTTP handlers alive for up to 150s. In-flight
jobs (snapshots, repacks, fetches) get context-cancelled mid-execution.

Give the scheduler its own context via context.WithoutCancel (same
pattern already used for the HTTP server BaseContext) and cancel it
after server.Shutdown completes. This lets workers finish current jobs
during the drain window.

The scheduler drain is bounded to 10s so long-running jobs don't block
process exit past the pod's terminationGracePeriodSeconds.
@joshfriend joshfriend force-pushed the jfriend/fix-scheduler-shutdown-context branch from 8a6ea50 to e913611 Compare May 15, 2026 20:14
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 692f2e3fc7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cmd/cachewd/main.go Outdated
@joshfriend joshfriend merged commit 572fc02 into main May 15, 2026
9 checks passed
@joshfriend joshfriend deleted the jfriend/fix-scheduler-shutdown-context branch May 15, 2026 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants