Skip to content

deploy: exec through pnpm/ts-node so PID 1 catches SIGTERM#4860

Merged
habdelra merged 1 commit into
mainfrom
worker-exec-pid1-sigterm-forwarding
May 18, 2026
Merged

deploy: exec through pnpm/ts-node so PID 1 catches SIGTERM#4860
habdelra merged 1 commit into
mainfrom
worker-exec-pid1-sigterm-forwarding

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

@habdelra habdelra commented May 18, 2026

What this changes

Two tiny exec insertions in 11 files: each deployed Dockerfile's CMD pnpm ... becomes CMD exec pnpm ..., and each deployed start script's ts-node ... invocation becomes exec ts-node ....

That's it for the diff. The interesting part is why.

The incident this fixes

Today's staging deploy of boxel-worker-staging orphaned 4 indexing-job reservations (jobs 409972, 409988, 409999, 410002). The workers were killed mid-job by ECS during the rolling deploy, but the SIGTERM-driven drain code in packages/realm-server/worker-manager.ts — which is supposed to mark in-flight reservations completion_reason='interrupted' so a sibling worker can re-claim within ~10s — never ran. So those reservations sat with completed_at IS NULL and the jobs are now stalled until the 7200s (2h) lease ages out.

Why the drain didn't run

The deployed images have a process tree like this:

PID 1: /bin/sh -c "pnpm --filter ./packages/realm-server $worker_script"     (Docker shell-form CMD)
PID 2:   pnpm
PID 3:     /bin/sh ./scripts/start-worker-<env>.sh                            (npm script)
PID 4:       ts-node --transpileOnly worker-manager ...                       (the actual app)

ECS sends SIGTERM to PID 1. POSIX sh with a foreground child and no trap does not forward SIGTERM. After stopTimeout (60s) ECS sends SIGKILL, which Node cannot catch. So worker-manager's process.on('SIGTERM', ...) handler — and therefore the orphan-reservation drain — never fires.

After this PR, both shell layers replace themselves with their child via exec, collapsing the tree to:

PID 1: pnpm
PID 2:   ts-node ... worker-manager

pnpm 11.x forwards signals to its npm-script child, so SIGTERM reaches Node, the registered handler runs, and the drain finalize commits before the process exits.

Scope

Touches only the 4 deployed Dockerfiles and the 7 deployed start scripts. Dev / local scripts (start-pg.sh, start-matrix.sh, start-host-dist.sh, start-icons.sh, start-without-matrix.sh) are intentionally left alone — their lifetime model is different (mise dev-all Ctrl-C, not ECS rolling deploy).

Test plan

  • docker build -f <dockerfile> . succeeds for each of the 4 modified images.
  • After staging deploy, run aws ecs update-service --force-new-deployment on boxel-worker-staging and confirm the rolling task shows these lines from worker-manager during stop:
    • Shutting down server for worker manager...
    • Stopping N worker(s)...
    • Draining reservations for N worker(s)...
  • Any reservation written during the rollover window should land with completion_reason = 'interrupted' (not NULL) in job_reservations. A sibling worker should re-claim each affected job within ~10s of the kill.
  • Repeat on boxel-realm-server-staging and boxel-prerender-staging to confirm the same change works for those containers (the realm-server's own SIGTERM handlers, and prerender's graceful HTTP shutdown).

[Claude Code 🤖]

ECS sends SIGTERM to PID 1 at deploy / scale-in. With shell-form Docker
CMDs and shell-form start scripts in place, PID 1 was `/bin/sh -c`, which
doesn't forward signals to its foreground child. Result: the worker
manager's SIGTERM handler never ran during rolling deploys and in-flight
job reservations were orphaned until their 2h lease expired.

Prepend `exec` to:
- the CMD line in worker.Dockerfile, realm-server.Dockerfile,
  prerender.Dockerfile, and prerender-manager.Dockerfile so the
  containers' top shell replaces itself with pnpm
- the `ts-node` invocation in the seven deployed start scripts so the
  start-script shell replaces itself with ts-node

After this change PID 1 is pnpm and ts-node/Node is its direct child;
pnpm 11.x forwards signals so SIGTERM reaches Node and the existing
shutdown path runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra habdelra requested review from a team and Copilot May 18, 2026 16:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates deployed container entrypoints so the long-running Node processes receive SIGTERM reliably during ECS rollovers by ensuring PID 1 and intermediate launch steps exec into the intended process.

Changes:

  • Add exec to Dockerfile CMD lines so /bin/sh -c is replaced by pnpm (PID 1) at container start.
  • Add exec to deployed ts-node start scripts so the shell process is replaced by ts-node (and signal handling lands on Node).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file
File Description
packages/realm-server/worker.Dockerfile CMD now execs into pnpm to avoid a long-lived shell as PID 1.
packages/realm-server/realm-server.Dockerfile Same exec adjustment for the realm-server image.
packages/realm-server/prerender.Dockerfile Same exec adjustment for the prerender image.
packages/realm-server/prerender-manager.Dockerfile Same exec adjustment for the prerender-manager image.
packages/realm-server/scripts/start-worker-staging.sh exec ts-node so the script shell is replaced by Node.
packages/realm-server/scripts/start-worker-production.sh Same exec ts-node change for production worker.
packages/realm-server/scripts/start-staging.sh exec ts-node for staging realm-server startup.
packages/realm-server/scripts/start-production.sh exec ts-node for production realm-server startup.
packages/realm-server/scripts/start-prerender-staging.sh exec ts-node for staging prerender startup.
packages/realm-server/scripts/start-prerender-production.sh exec ts-node for production prerender startup.
packages/realm-server/scripts/start-prerender-manager.sh exec ts-node for prerender-manager startup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

Host Test Results

    1 files      1 suites   1h 30m 15s ⏱️
2 661 tests 2 646 ✅ 15 💤 0 ❌
2 680 runs  2 665 ✅ 15 💤 0 ❌

Results for commit 5392418.

Realm Server Test Results

    1 files      1 suites   8m 27s ⏱️
1 386 tests 1 386 ✅ 0 💤 0 ❌
1 467 runs  1 467 ✅ 0 💤 0 ❌

Results for commit 5392418.

@habdelra habdelra merged commit a2bf28b into main May 18, 2026
71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants