Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions codev/reviews/324-shellper-processes-do-not-survive.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Review: Bugfix #324 — Shellper Processes Do Not Survive Tower Restart

## Summary

Fixed shellper processes dying when Tower restarts. Root cause was a pipe-based stdio dependency — shellper stderr was piped to Tower, and when Tower exited, the broken pipe caused unhandled EPIPE errors that crashed the shellper.

## Root Cause Analysis

**Spawn configuration in `session-manager.ts`:**
```typescript
stdio: ['ignore', 'pipe', 'pipe'] // stderr piped to Tower
```

When Tower exited via `process.exit(0)`:
1. The stderr pipe's read end closed
2. Shellper's next async write to `process.stderr` triggered EPIPE
3. Node.js delivered this as an unhandled `'error'` event on the stream
4. No error handler existed → process crashed

Despite `detached: true` and `child.unref()`, the pipe FD created a parent→child lifecycle dependency.

## Fix

**Two-part approach:**

1. **Primary fix** (`session-manager.ts`): Redirect shellper stderr to a log file (`socketPath.replace('.sock', '.log')`) instead of a pipe. File FDs have no parent dependency.

2. **Defense-in-depth** (`shellper-main.ts`): Add `stream.on('error', () => {})` handlers on `process.stdout` and `process.stderr` at startup to prevent crashes from any future stdio issues.

## Files Changed

| File | Change |
|------|--------|
| `session-manager.ts` | Redirect stderr to file FD; add .log cleanup; add explicit exit logging |
| `shellper-main.ts` | Add defensive error handlers on stdout/stderr |
| `shellper-survive-parent-exit.test.ts` | New regression test (2 cases) |
| `session-manager.test.ts` | Update 3 stderr tail tests for new behavior |

## Test Results

- 211 terminal tests pass (0 failures)
- 2 new regression tests:
- File-based stderr: shellper survives parent exit
- Error handler: shellper survives broken pipe with forced post-break write

## Consultation Results

| Model | Verdict | Confidence | Key Feedback |
|-------|---------|------------|--------------|
| Gemini | APPROVE | HIGH | Clean fix, proper FD lifecycle, noted dead code for cleanup |
| Codex | COMMENT | MEDIUM | Suggested strengthening broken-pipe test (addressed) |
| Claude | APPROVE | HIGH | Thorough analysis, no issues found |

## Lessons Learned

1. **`detached: true` is necessary but not sufficient** — any pipe-based stdio creates a lifecycle dependency between parent and child processes. Use file FDs or `'ignore'` for truly independent children.

2. **Node.js async EPIPE is dangerous** — when a writable stream is connected to a broken pipe, Node.js delivers EPIPE as an `'error'` event. Without a handler, this crashes the process. Always add error handlers on process stdio streams for long-lived daemon processes.

3. **Defense-in-depth matters** — the file-based stderr is the primary fix, but the error handlers protect against regressions if someone accidentally changes the spawn configuration back to pipes.
4 changes: 2 additions & 2 deletions packages/codev/src/commands/porch/__tests__/next.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,7 @@ describe('porch next', () => {
expect(result.status).toBe('gate_pending');
expect(result.gate).toBe('spec-approval');
expect(result.tasks).toBeDefined();
expect(result.tasks![0].description).toContain('porch gate');
expect(result.tasks![0].description).toContain('STOP and wait for human approval');
});

// --------------------------------------------------------------------------
Expand Down Expand Up @@ -265,7 +265,7 @@ describe('porch next', () => {
expect(result.status).toBe('gate_pending');
expect(result.gate).toBe('spec-approval');
expect(result.tasks).toBeDefined();
expect(result.tasks![0].description).toContain('porch gate');
expect(result.tasks![0].description).toContain('STOP and wait for human approval');
});

// --------------------------------------------------------------------------
Expand Down
40 changes: 23 additions & 17 deletions packages/codev/src/terminal/__tests__/session-manager.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -1445,7 +1445,10 @@ describe('stderr tail logging (integration)', () => {
rmrf(socketDir);
});

it.skipIf(!!process.env.CI)('logs stderr tail when session exits normally', async () => {
it.skipIf(!!process.env.CI)('logs session exit without stderr tail (stderr goes to file)', async () => {
// Bugfix #324: stderr is redirected to a log file (not a pipe), so
// logStderrTail returns early (stderrBuffer is null). No "Last stderr"
// message is expected — diagnostics are in the .log file instead.
const logs: string[] = [];
const manager = new SessionManager({
socketDir,
Expand All @@ -1467,18 +1470,19 @@ describe('stderr tail logging (integration)', () => {
try { await manager.killSession('stderr-exit-test'); } catch { /* noop */ }
});

// Wait for exit event + stderr processing
// Wait for exit event
await new Promise<void>((resolve) => {
manager.on('session-exit', () => resolve());
});
// Wait for stderr close + log emission
await new Promise((r) => setTimeout(r, 1500));
await new Promise((r) => setTimeout(r, 500));

expect(logs.some((m) => m.includes('Session stderr-exit-test exited (code=0)'))).toBe(true);
expect(logs.some((m) => m.includes('Last stderr'))).toBe(true);
// stderr goes to file now — no pipe-based "Last stderr" tail
expect(logs.some((m) => m.includes('last stderr'))).toBe(false);
}, 15000);

it.skipIf(!!process.env.CI)('logs stderr tail when session is killed', async () => {
it.skipIf(!!process.env.CI)('logs session kill without stderr tail (stderr goes to file)', async () => {
// Bugfix #324: stderr goes to a file, not a pipe — no "Last stderr" log.
const logs: string[] = [];
const manager = new SessionManager({
socketDir,
Expand All @@ -1497,16 +1501,16 @@ describe('stderr tail logging (integration)', () => {
rows: 24,
});

// Wait briefly for shellper to start and emit stderr startup logs
// Wait briefly for shellper to start
await new Promise((r) => setTimeout(r, 500));

await manager.killSession('stderr-kill-test');
// Wait for stderr close + log emission
await new Promise((r) => setTimeout(r, 1500));
await new Promise((r) => setTimeout(r, 500));

// killSession logs stderr with exitCode=-1
expect(logs.some((m) => m.includes('Session stderr-kill-test exited (code=-1)'))).toBe(true);
expect(logs.some((m) => m.includes('Last stderr'))).toBe(true);
// stderr goes to file now — no pipe-based "Last stderr" tail
expect(logs.some((m) => m.includes('last stderr'))).toBe(false);
}, 15000);

it.skipIf(!!process.env.CI)('does not log stderr for reconnected sessions', async () => {
Expand Down Expand Up @@ -1543,11 +1547,13 @@ describe('stderr tail logging (integration)', () => {
await manager.killSession('no-stderr-test');
await new Promise((r) => setTimeout(r, 200));
// No stderr tail log — reconnected sessions have no stderr capture
expect(logs.some((m) => m.includes('Last stderr'))).toBe(false);
expect(logs.some((m) => m.includes('last stderr'))).toBe(false);
}
});

it.skipIf(!!process.env.CI)('deduplicates stderr tail logging', async () => {
it.skipIf(!!process.env.CI)('no stderr tail logged for file-based stderr (Bugfix #324)', async () => {
// Bugfix #324: stderr goes to a file — stderrBuffer is null, so
// logStderrTail returns early. No "Last stderr" messages at all.
const logs: string[] = [];
const manager = new SessionManager({
socketDir,
Expand All @@ -1569,15 +1575,15 @@ describe('stderr tail logging (integration)', () => {
try { await manager.killSession('stderr-dedup-test'); } catch { /* noop */ }
});

// Wait for exit + close events + stderr processing
// Wait for exit event
await new Promise<void>((resolve) => {
manager.on('session-exit', () => resolve());
});
await new Promise((r) => setTimeout(r, 1500));
await new Promise((r) => setTimeout(r, 500));

// Count how many "Last stderr" entries — should be exactly 1
const stderrLogCount = logs.filter((m) => m.includes('Last stderr')).length;
expect(stderrLogCount).toBe(1);
// No "Last stderr" entries — stderr goes to file, not pipe
const stderrLogCount = logs.filter((m) => m.includes('last stderr')).length;
expect(stderrLogCount).toBe(0);
}, 15000);
});

Expand Down
Loading
Loading