Skip to content

Drop pnpm from CMD so Node is PID 1 and SIGTERM reaches its handler#4874

Merged
habdelra merged 2 commits into
mainfrom
sigterm-pid1-drop-pnpm-and-logging
May 19, 2026
Merged

Drop pnpm from CMD so Node is PID 1 and SIGTERM reaches its handler#4874
habdelra merged 2 commits into
mainfrom
sigterm-pid1-drop-pnpm-and-logging

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

Summary

PR #4860 tried to fix the worker-reservation orphaning by prepending exec to the CMD line and to the ts-node invocation inside the start scripts, on the hypothesis that pnpm 11.x would forward signals to its child once it was no longer behind an extra sh layer. That hypothesis is wrong, and this PR replaces it with the actual fix plus a layer of diagnostic logging.

Diagnostic data

Staging task-def boxel-worker-staging:2083 (commit a2bf28b — PR #4860's exec-prepend fix) has been deployed for 1+ hour across multiple rolling deploys. Across that window:

  • Zero occurrences of Shutting down server for worker manager..., Stopping N worker(s)..., or Draining reservations for N worker(s)... — the three signature lines from runShutdown in packages/realm-server/worker-manager.ts.
  • Zero rows in job_reservations with completion_reason='interrupted' from the new tasks.
  • The dying workers' last 3 log lines are uniformly pnpm's ERR_PNPM_RECURSIVE_RUN_FIRST_FAIL followed by Command failed with signal "SIGTERM", and nothing from Node — i.e. pnpm received the signal and exited, but Node never logged a thing on its way out.

Net effect: SIGTERM doesn't reach Node's process.on('SIGTERM') handler, so the worker-manager never runs its drain UPDATE that marks in-flight reservations interrupted. The reservations stay locked until the 2-hour lease ages out. Job 410956 (ctse/personal from-scratch) is currently stuck this way.

Fix

Drop pnpm from PID 1. The deployed start scripts already exec into ts-node, so removing the pnpm wrapper makes Node the direct child of sh and Node becomes PID 1 once the script execs.

  • The 4 deployed Dockerfile CMDs change from exec pnpm --filter \"./packages/realm-server\" \$<x>_script to exec /realm-server/packages/realm-server/\$<x>_script.
  • The matching *_script build-args in .github/workflows/manual-deploy.yml change from npm script names (start:staging, start:worker-staging, etc.) to script paths (scripts/start-staging.sh, scripts/start-worker-staging.sh, etc.).
  • Because the scripts no longer inherit CWD/PATH from pnpm --filter, each script now cds into the realm-server package directory and prepends ./node_modules/.bin to PATH up front (matching what pnpm --filter did implicitly).

Process chain after the change: Docker sh -c \"exec /realm-server/packages/realm-server/\$worker_script\" → sh expands the env var, execs into the script (sh replaced — script is PID 1) → the script execs into ts-node (ts-node replaces the script — ts-node is PID 1).

Logging (independent insurance — keep even if the fix works)

Two layers, both using process.stderr.write (synchronous) so the line lands even if Node terminates immediately after.

Layer 1 — start scripts. Each of the 7 deployed start scripts now emits [<script-name>] pid=… ppid=… about to exec ts-node at <iso-ts> to stderr immediately before exec ts-node. Proves the shell got far enough to hand off.

Layer 2 — Node entry files. Each of worker-manager.ts, worker.ts, main.ts, prerender-server.ts, manager-server.ts now writes a [<tag>] STARTUP pid=… ppid=… argv=… line synchronously near the top of the file, and each SIGTERM/SIGINT/disconnect/uncaughtException handler writes [<tag>] <signal> received pid=… ppid=… synchronously as its first action — before calling the existing shutdown function. The body of runShutdown is unchanged.

What we learn from the next staging deploy:

  • [start-worker-staging] … about to exec ts-node then [worker-manager] STARTUP → process chain is healthy.
  • [worker-manager] SIGTERM received → signal delivery to Node is working now.
  • Shutting down server for worker manager... + Stopping N worker(s)... + Draining reservations for N worker(s)... → the existing graceful shutdown is running end-to-end.
  • If we see STARTUP but no SIGTERM received on the next rolling deploy, signal delivery is still broken upstream of Node and we have a different root cause to chase.

Scope

Dev-only scripts (start-pg.sh, start-matrix.sh, start-host-dist.sh, start-icons.sh, start-without-matrix.sh) are intentionally not touched — their lifetime model is different.

Test plan

  • After this PR merges and the next staging deploy completes, the rolling deploy that replaces the worker tasks should emit (per killed task, in CloudWatch):
    • [start-worker-staging] pid=… ppid=… about to exec ts-node at <ts>
    • [worker-manager] STARTUP pid=… ppid=… argv=[…]
    • [worker-manager] SIGTERM received pid=… ppid=…
    • Shutting down server for worker manager...
    • Stopping N worker(s)...
    • Draining reservations for N worker(s)...
  • At least one row in job_reservations from the killed task should land with completion_reason='interrupted'.
  • If only the STARTUP line appears and the SIGTERM-received line doesn't, the fix isn't enough and we follow up with a different root-cause investigation.

🤖 Generated with Claude Code

The previous attempt (prepending `exec` to CMD/ts-node) was insufficient:
after that change, staging logs still show zero `Shutting down server for
worker manager...` / `Stopping N worker(s)...` / `Draining reservations...`
lines on rolling deploys, and zero reservations end up with
completion_reason='interrupted'. The dying workers' last lines are pnpm's
`ERR_PNPM_RECURSIVE_RUN_FIRST_FAIL` and `Command failed with signal "SIGTERM"`
— pnpm is intercepting SIGTERM and not forwarding it to Node, so the
registered SIGTERM handler in worker-manager.ts never fires and in-flight
reservations stay locked until the 2h lease ages out.

Change the four deployed Dockerfile CMDs from
  exec pnpm --filter "./packages/realm-server" $<x>_script
to
  exec /realm-server/packages/realm-server/$<x>_script
and switch the `*_script` build-args from npm script names to script paths
relative to `packages/realm-server/`. The start scripts already `exec` into
ts-node, so removing the pnpm wrapper makes Node the direct child of sh,
and Node becomes PID 1 once the script execs. Because the scripts no longer
inherit CWD/PATH from `pnpm --filter`, each script now `cd`s into the
realm-server package directory and prepends `./node_modules/.bin` to PATH
(matching what `pnpm --filter` set up implicitly).

Independent insurance — keep regardless of whether the fix lands cleanly:

  Layer 1 (start scripts): each of the 7 deployed start scripts now writes
  `[<script-name>] pid=… ppid=… about to exec ts-node` to stderr immediately
  before `exec ts-node`. Proves the shell got far enough to hand off.

  Layer 2 (Node): each of the 5 deployed Node entry files now writes a
  `[<tag>] STARTUP pid=… ppid=… argv=…` line to stderr synchronously near
  the top of the file, and each SIGTERM/SIGINT/disconnect/uncaughtException
  handler writes `[<tag>] <signal> received pid=… ppid=…` synchronously as
  its first action — before calling the existing shutdown function. The
  writes use `process.stderr.write` (not the logger) so the line lands even
  if Node terminates milliseconds later. If after a deploy we see STARTUP
  but no SIGTERM received, signal delivery is broken upstream of Node and
  we know to look elsewhere; if we see SIGTERM received but no draining
  log, the shutdown path itself is wedged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the deployed realm-server/worker/prerender containers so PID 1 ultimately becomes the Node/ts-node process (instead of pnpm), improving SIGTERM delivery to Node signal handlers and adding additional startup/signal diagnostic logging to help confirm signal flow during ECS rolling deploys.

Changes:

  • Switch deployed Dockerfile CMD from pnpm --filter ... $*_script to executing the start-script path directly (/realm-server/packages/realm-server/$*_script).
  • Update manual-deploy.yml build-args to pass script paths (e.g. scripts/start-staging.sh) instead of pnpm script names.
  • Update deployed start scripts to set CWD + prepend ./node_modules/.bin to PATH, and add stderr “about to exec ts-node” stamps; add stderr startup/signal stamps in Node entrypoints.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
.github/workflows/manual-deploy.yml Pass script path build-args (instead of pnpm script names) to match new Dockerfile CMD behavior.
packages/realm-server/main.ts Add stderr startup + SIGTERM/SIGINT receipt stamps near process startup/shutdown wiring.
packages/realm-server/prerender.Dockerfile Execute prerender start script directly from CMD (drop pnpm from PID 1).
packages/realm-server/prerender-manager.Dockerfile Execute prerender-manager start script directly from CMD (drop pnpm from PID 1).
packages/realm-server/prerender/manager-server.ts Add stderr startup + SIGTERM/SIGINT receipt stamps.
packages/realm-server/prerender/prerender-server.ts Add stderr startup + SIGTERM/SIGINT receipt stamps.
packages/realm-server/realm-server.Dockerfile Execute realm-server start script directly from CMD (drop pnpm from PID 1).
packages/realm-server/scripts/start-prerender-manager.sh Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp.
packages/realm-server/scripts/start-prerender-production.sh Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp.
packages/realm-server/scripts/start-prerender-staging.sh Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp.
packages/realm-server/scripts/start-production.sh Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp.
packages/realm-server/scripts/start-staging.sh Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp.
packages/realm-server/scripts/start-worker-production.sh Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp.
packages/realm-server/scripts/start-worker-staging.sh Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp.
packages/realm-server/worker.Dockerfile Execute worker start script directly from CMD (drop pnpm from PID 1).
packages/realm-server/worker-manager.ts Add stderr startup + SIGTERM/SIGINT/disconnect/uncaughtException receipt stamps.
packages/realm-server/worker.ts Add stderr startup + SIGTERM/SIGINT/disconnect receipt stamps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/realm-server/worker.ts Outdated
Comment thread packages/realm-server/worker-manager.ts Outdated
Comment thread packages/realm-server/main.ts Outdated
Comment thread packages/realm-server/prerender/prerender-server.ts Outdated
Comment thread packages/realm-server/prerender/manager-server.ts Outdated
`process.stderr.write` is libuv-async when stderr is a pipe (the
Docker / ECS case for these processes), so writes can be lost if the
process exits before libuv flushes — which is the exact scenario the
SIGTERM / SIGINT / disconnect / uncaughtException stamps are designed
to capture. `writeSync(2, ...)` calls write(2) directly, hands the
bytes to the kernel pipe buffer synchronously, and returns only after
that completes. Apply this to the STARTUP stamp as well, both for
consistency and so an immediate post-startup crash doesn't lose the
proof-of-life line.

No behavioural change other than the write semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra habdelra requested a review from a team May 18, 2026 23:15
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

Host Test Results

    1 files      1 suites   1h 28m 39s ⏱️
2 661 tests 2 646 ✅ 15 💤 0 ❌
2 680 runs  2 665 ✅ 15 💤 0 ❌

Results for commit 6086166.

Realm Server Test Results

    1 files  ±  0      1 suites  ±0   8m 25s ⏱️ + 4m 27s
1 408 tests +546  1 408 ✅ +546  0 💤 ±0  0 ❌ ±0 
1 495 runs  +577  1 495 ✅ +577  0 💤 ±0  0 ❌ ±0 

Results for commit 6086166. ± Comparison against earlier commit 4f23c66.

@habdelra habdelra merged commit fec6659 into main May 19, 2026
68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants