deploy: exec through pnpm/ts-node so PID 1 catches SIGTERM#4860
Merged
Conversation
ECS sends SIGTERM to PID 1 at deploy / scale-in. With shell-form Docker CMDs and shell-form start scripts in place, PID 1 was `/bin/sh -c`, which doesn't forward signals to its foreground child. Result: the worker manager's SIGTERM handler never ran during rolling deploys and in-flight job reservations were orphaned until their 2h lease expired. Prepend `exec` to: - the CMD line in worker.Dockerfile, realm-server.Dockerfile, prerender.Dockerfile, and prerender-manager.Dockerfile so the containers' top shell replaces itself with pnpm - the `ts-node` invocation in the seven deployed start scripts so the start-script shell replaces itself with ts-node After this change PID 1 is pnpm and ts-node/Node is its direct child; pnpm 11.x forwards signals so SIGTERM reaches Node and the existing shutdown path runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Updates deployed container entrypoints so the long-running Node processes receive SIGTERM reliably during ECS rollovers by ensuring PID 1 and intermediate launch steps exec into the intended process.
Changes:
- Add
execto DockerfileCMDlines so/bin/sh -cis replaced bypnpm(PID 1) at container start. - Add
execto deployedts-nodestart scripts so the shell process is replaced byts-node(and signal handling lands on Node).
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| packages/realm-server/worker.Dockerfile | CMD now execs into pnpm to avoid a long-lived shell as PID 1. |
| packages/realm-server/realm-server.Dockerfile | Same exec adjustment for the realm-server image. |
| packages/realm-server/prerender.Dockerfile | Same exec adjustment for the prerender image. |
| packages/realm-server/prerender-manager.Dockerfile | Same exec adjustment for the prerender-manager image. |
| packages/realm-server/scripts/start-worker-staging.sh | exec ts-node so the script shell is replaced by Node. |
| packages/realm-server/scripts/start-worker-production.sh | Same exec ts-node change for production worker. |
| packages/realm-server/scripts/start-staging.sh | exec ts-node for staging realm-server startup. |
| packages/realm-server/scripts/start-production.sh | exec ts-node for production realm-server startup. |
| packages/realm-server/scripts/start-prerender-staging.sh | exec ts-node for staging prerender startup. |
| packages/realm-server/scripts/start-prerender-production.sh | exec ts-node for production prerender startup. |
| packages/realm-server/scripts/start-prerender-manager.sh | exec ts-node for prerender-manager startup. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
lukemelia
approved these changes
May 18, 2026
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this changes
Two tiny
execinsertions in 11 files: each deployed Dockerfile'sCMD pnpm ...becomesCMD exec pnpm ..., and each deployed start script'sts-node ...invocation becomesexec ts-node ....That's it for the diff. The interesting part is why.
The incident this fixes
Today's staging deploy of
boxel-worker-stagingorphaned 4 indexing-job reservations (jobs 409972, 409988, 409999, 410002). The workers were killed mid-job by ECS during the rolling deploy, but the SIGTERM-driven drain code inpackages/realm-server/worker-manager.ts— which is supposed to mark in-flight reservationscompletion_reason='interrupted'so a sibling worker can re-claim within ~10s — never ran. So those reservations sat withcompleted_at IS NULLand the jobs are now stalled until the 7200s (2h) lease ages out.Why the drain didn't run
The deployed images have a process tree like this:
ECS sends SIGTERM to PID 1. POSIX
shwith a foreground child and notrapdoes not forward SIGTERM. AfterstopTimeout(60s) ECS sends SIGKILL, which Node cannot catch. Soworker-manager'sprocess.on('SIGTERM', ...)handler — and therefore the orphan-reservation drain — never fires.After this PR, both shell layers replace themselves with their child via
exec, collapsing the tree to:pnpm 11.x forwards signals to its npm-script child, so SIGTERM reaches Node, the registered handler runs, and the drain finalize commits before the process exits.
Scope
Touches only the 4 deployed Dockerfiles and the 7 deployed start scripts. Dev / local scripts (
start-pg.sh,start-matrix.sh,start-host-dist.sh,start-icons.sh,start-without-matrix.sh) are intentionally left alone — their lifetime model is different (mise dev-allCtrl-C, not ECS rolling deploy).Test plan
docker build -f <dockerfile> .succeeds for each of the 4 modified images.aws ecs update-service --force-new-deploymentonboxel-worker-stagingand confirm the rolling task shows these lines fromworker-managerduring stop:Shutting down server for worker manager...Stopping N worker(s)...Draining reservations for N worker(s)...completion_reason = 'interrupted'(not NULL) injob_reservations. A sibling worker should re-claim each affected job within ~10s of the kill.boxel-realm-server-stagingandboxel-prerender-stagingto confirm the same change works for those containers (the realm-server's own SIGTERM handlers, and prerender's graceful HTTP shutdown).[Claude Code 🤖]