Skip to content

Conversation

@ammar-agent
Copy link
Collaborator

Problem

Commands like cat /tmp/results.json | jq '.' would sometimes hang forever over SSH, not respecting timeouts. This was observed over SSH with the bash tool, and it's unclear if it happens on local workspaces.

Root Cause

The issue stems from using WritableStream.close() to close stdin:

  • close() is async and waits for graceful acknowledgment from the remote end
  • Over SSH with ControlMaster multiplexing, the SSH channel's stdin buffer might not be immediately flushed
  • The close() promise waits for acknowledgment that may never come
  • Even timeouts couldn't help because the promise was already stuck in Node.js stream machinery

Solution

Use stdin.abort() instead of stdin.close() for immediate closure:

  • abort() - Immediate, synchronous force-close. Marks stream as errored, releases locks, doesn't wait
  • close() - Graceful, async close. Waits for all writes to be acknowledged

Applied selectively:

  • bash tool: Use abort() - commands don't need stdin at all
  • SSHRuntime mv/rm/stat: Use abort() - commands don't read stdin
  • SSHRuntime writeFile: Keep close() - needs to flush data written to stdin

Testing

  • All 45 bash tool tests pass
  • SSH integration tests pass:
    • runtimeExecuteBash - verifies bash commands over SSH work
    • renameWorkspace - verifies mv operations work
    • runtimeFileEditing - verifies file writes (which use stdin) work correctly
  • All 921 unit tests pass
  • Typecheck passes

Implementation Details

Changed 4 locations:

  1. src/services/tools/bash.ts - Use abort() to close stdin immediately
  2. src/runtime/SSHRuntime.ts (3 places) - Use abort() for mv/rm/stat commands

Kept close() in 1 location:

  • SSHRuntime.writeFile() - Needs graceful close to ensure all data is written

Generated with cmux

Commands that don't read from stdin (like 'cat file | jq') would sometimes
hang forever over SSH because stdin.close() waits for graceful acknowledgment
from the remote end, which may never arrive due to buffering or SSH channel issues.

Root cause:
- WritableStream.close() is async and waits for all writes to be acknowledged
- Over SSH with ControlMaster, this can hang indefinitely
- Even timeouts couldn't help once the promise was stuck

Solution:
- Use stdin.abort() for immediate, forceful closure (no waiting)
- Keep stdin.close() only for operations that write data to stdin

Changes:
- bash tool: Use abort() instead of close() for immediate stdin closure
- SSHRuntime: Use abort() for commands that don't use stdin (mv, rm, stat)
- Kept close() for writeFile() which pipes data through stdin

Testing:
- All 45 bash tool tests pass
- SSH integration tests pass (runtimeExecuteBash, renameWorkspace, runtimeFileEditing)
- Verified file operations that write to stdin still work correctly

Generated with `cmux`
Adds integration test that verifies commands like 'cat file | jq' don't hang
when piping through stdin-reading commands. Tests both SSH and local runtimes.

The test:
- Creates a JSON file
- Pipes it through jq (which reads from stdin)
- Verifies command completes quickly (< 8s SSH, < 5s local)
- Would hang forever before the stdin.abort() fix

Regression test for the stdin.close() hang issue.

Generated with `cmux`
@ammario ammario added this pull request to the merge queue Nov 4, 2025
Merged via the queue into main with commit 7a213e5 Nov 4, 2025
13 checks passed
@ammario ammario deleted the bash-timeout branch November 4, 2025 03:13
ammar-agent added a commit that referenced this pull request Nov 8, 2025
## Problem

Commands like `grep -RIn ... | head -n 200` would hang over SSH runtime,
not respecting timeouts. Previous fix (PR #504) using stdin.abort() addressed
a symptom but not the root cause.

## Root Cause

bash.ts waited for THREE conditions before finalizing:
- exitCode !== null
- stdoutEnded (stream 'close' event)
- stderrEnded (stream 'close' event)

Over SSH with ControlMaster multiplexing, stream 'close' events don't
propagate reliably. The 50ms grace period fallback was insufficient, allowing
hangs on commands that take longer or during high SSH latency.

## Solution

**Single source of truth: process exit triggers all cleanup**

Created DisposableProcess wrapper that:
1. Registers cleanup callbacks (stream destruction, etc.)
2. Auto-executes cleanup when process exits
3. Makes it IMPOSSIBLE to wait for stream events that never arrive

Key changes:
- SSHRuntime.exec(): Use DisposableProcess, cancel streams on exit
- LocalRuntime.exec(): Same pattern for consistency
- bash.ts: Simplified from complex multi-condition wait to simple `await exitCode`

## Benefits

- ✅ Fixes the hang completely
- ✅ Simpler: Removed ~130 LoC of complex finalization logic
- ✅ Faster: No 50ms grace period on every command
- ✅ Correct by construction: Process lifetime bounds all resources

## Impact

**Invariant**: When exitCode resolves, streams are already destroyed.

**Proof**: DisposableProcess registers cleanup on process 'close' event,
which fires before exitCode promise resolves. Therefore: exitCode resolved ⟹
streams destroyed. No race conditions possible.

## Testing

- All 955 unit tests pass
- Added regression test for grep|head pattern
- bash.ts tests validate no hangs
- LocalRuntime/SSHRuntime tests pass

---
_Generated with `cmux`_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants