deploy: exec through pnpm/ts-node so PID 1 catches SIGTERM by habdelra · Pull Request #4860 · cardstack/boxel

habdelra · 2026-05-18T16:35:21Z

What this changes

Two tiny exec insertions in 11 files: each deployed Dockerfile's CMD pnpm ... becomes CMD exec pnpm ..., and each deployed start script's ts-node ... invocation becomes exec ts-node ....

That's it for the diff. The interesting part is why.

The incident this fixes

Today's staging deploy of boxel-worker-staging orphaned 4 indexing-job reservations (jobs 409972, 409988, 409999, 410002). The workers were killed mid-job by ECS during the rolling deploy, but the SIGTERM-driven drain code in packages/realm-server/worker-manager.ts — which is supposed to mark in-flight reservations completion_reason='interrupted' so a sibling worker can re-claim within ~10s — never ran. So those reservations sat with completed_at IS NULL and the jobs are now stalled until the 7200s (2h) lease ages out.

Why the drain didn't run

The deployed images have a process tree like this:

PID 1: /bin/sh -c "pnpm --filter ./packages/realm-server $worker_script"     (Docker shell-form CMD)
PID 2:   pnpm
PID 3:     /bin/sh ./scripts/start-worker-<env>.sh                            (npm script)
PID 4:       ts-node --transpileOnly worker-manager ...                       (the actual app)

ECS sends SIGTERM to PID 1. POSIX sh with a foreground child and no trap does not forward SIGTERM. After stopTimeout (60s) ECS sends SIGKILL, which Node cannot catch. So worker-manager's process.on('SIGTERM', ...) handler — and therefore the orphan-reservation drain — never fires.

After this PR, both shell layers replace themselves with their child via exec, collapsing the tree to:

PID 1: pnpm
PID 2:   ts-node ... worker-manager

pnpm 11.x forwards signals to its npm-script child, so SIGTERM reaches Node, the registered handler runs, and the drain finalize commits before the process exits.

Scope

Touches only the 4 deployed Dockerfiles and the 7 deployed start scripts. Dev / local scripts (start-pg.sh, start-matrix.sh, start-host-dist.sh, start-icons.sh, start-without-matrix.sh) are intentionally left alone — their lifetime model is different (mise dev-all Ctrl-C, not ECS rolling deploy).

Test plan

docker build -f <dockerfile> . succeeds for each of the 4 modified images.
After staging deploy, run aws ecs update-service --force-new-deployment on boxel-worker-staging and confirm the rolling task shows these lines from worker-manager during stop:
- Shutting down server for worker manager...
- Stopping N worker(s)...
- Draining reservations for N worker(s)...
Any reservation written during the rollover window should land with completion_reason = 'interrupted' (not NULL) in job_reservations. A sibling worker should re-claim each affected job within ~10s of the kill.
Repeat on boxel-realm-server-staging and boxel-prerender-staging to confirm the same change works for those containers (the realm-server's own SIGTERM handlers, and prerender's graceful HTTP shutdown).

[Claude Code 🤖]

ECS sends SIGTERM to PID 1 at deploy / scale-in. With shell-form Docker CMDs and shell-form start scripts in place, PID 1 was `/bin/sh -c`, which doesn't forward signals to its foreground child. Result: the worker manager's SIGTERM handler never ran during rolling deploys and in-flight job reservations were orphaned until their 2h lease expired. Prepend `exec` to: - the CMD line in worker.Dockerfile, realm-server.Dockerfile, prerender.Dockerfile, and prerender-manager.Dockerfile so the containers' top shell replaces itself with pnpm - the `ts-node` invocation in the seven deployed start scripts so the start-script shell replaces itself with ts-node After this change PID 1 is pnpm and ts-node/Node is its direct child; pnpm 11.x forwards signals so SIGTERM reaches Node and the existing shutdown path runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Updates deployed container entrypoints so the long-running Node processes receive SIGTERM reliably during ECS rollovers by ensuring PID 1 and intermediate launch steps exec into the intended process.

Changes:

Add exec to Dockerfile CMD lines so /bin/sh -c is replaced by pnpm (PID 1) at container start.
Add exec to deployed ts-node start scripts so the shell process is replaced by ts-node (and signal handling lands on Node).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
packages/realm-server/worker.Dockerfile	`CMD` now `exec`s into pnpm to avoid a long-lived shell as PID 1.
packages/realm-server/realm-server.Dockerfile	Same `exec` adjustment for the realm-server image.
packages/realm-server/prerender.Dockerfile	Same `exec` adjustment for the prerender image.
packages/realm-server/prerender-manager.Dockerfile	Same `exec` adjustment for the prerender-manager image.
packages/realm-server/scripts/start-worker-staging.sh	`exec ts-node` so the script shell is replaced by Node.
packages/realm-server/scripts/start-worker-production.sh	Same `exec ts-node` change for production worker.
packages/realm-server/scripts/start-staging.sh	`exec ts-node` for staging realm-server startup.
packages/realm-server/scripts/start-production.sh	`exec ts-node` for production realm-server startup.
packages/realm-server/scripts/start-prerender-staging.sh	`exec ts-node` for staging prerender startup.
packages/realm-server/scripts/start-prerender-production.sh	`exec ts-node` for production prerender startup.
packages/realm-server/scripts/start-prerender-manager.sh	`exec ts-node` for prerender-manager startup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-05-18T17:02:01Z

Host Test Results

1 files 1 suites 1h 30m 15s ⏱️
2 661 tests 2 646 ✅ 15 💤 0 ❌
2 680 runs 2 665 ✅ 15 💤 0 ❌

Results for commit 5392418.

Realm Server Test Results

1 files 1 suites 8m 27s ⏱️
1 386 tests 1 386 ✅ 0 💤 0 ❌
1 467 runs 1 467 ✅ 0 💤 0 ❌

Results for commit 5392418.

habdelra requested review from a team and Copilot May 18, 2026 16:37

Copilot started reviewing on behalf of habdelra May 18, 2026 16:39 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

lukemelia approved these changes May 18, 2026

View reviewed changes

habdelra merged commit a2bf28b into main May 18, 2026
71 checks passed

habdelra mentioned this pull request May 18, 2026

Drop pnpm from CMD so Node is PID 1 and SIGTERM reaches its handler #4874

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deploy: exec through pnpm/ts-node so PID 1 catches SIGTERM#4860

deploy: exec through pnpm/ts-node so PID 1 catches SIGTERM#4860
habdelra merged 1 commit into
mainfrom
worker-exec-pid1-sigterm-forwarding

habdelra commented May 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

habdelra commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this changes

The incident this fixes

Why the drain didn't run

Scope

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Host Test Results

Realm Server Test Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

habdelra commented May 18, 2026 •

edited

Loading

github-actions Bot commented May 18, 2026 •

edited

Loading