[Bug]: BlockingIOError crashes dstack server due to SSHTunnel resource leaks

### Steps to reproduce

1. Deploy dstack 0.19.36 server on AWS Fargate with default settings
2. Configure AWS backend with 5+ GPU instances
3. Run for 6-8 hours with active instances
4. Observe process count growing: `ps aux | wc -l`
5. Eventually hits subprocess limit and crashes


### Environment
- **dstack version:** 0.19.36
- **Platform:** AWS ECS Fargate (4 vCPU, 8GB RAM)
- **Python version:** 3.11
- **Number of managed GPU instances:** 3-10 concurrent


### Actual behaviour

After running the dstack server for several hours (typically 4-8 hours), it crashes with:

```
BlockingIOError: [Errno 11] Resource temporarily unavailable
```

Subprocesses accumulate over time, eventually exhausting available PIDs and causing server crash.

### Stack Trace

```
{
    "timestamp": "2025-11-13 15:50:05,062",
    "logger": "apscheduler.executors.default",
    "level": "ERROR",
    "message": "Job \"process_instances (trigger: interval[0:00:04], next run at: 2025-11-13 15:50:09 UTC)\" raised an exception",
    "exc_info": "Traceback (most recent call last):\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/apscheduler/executors/base.py\", line 181, in run_coroutine_job\n    retval = await job.func(*job.args, **job.kwargs)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/background/tasks/process_instances.py\", line 138, in process_instances\n    await asyncio.gather(*tasks)\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/utils/sentry_utils.py\", line 10, in wrapper\n    return await f(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/background/tasks/process_instances.py\", line 193, in _process_next_instance\n    await _process_instance(session=session, instance=instance)\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/background/tasks/process_instances.py\", line 240, in _process_instance\n    await _check_instance(session, instance)\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/background/tasks/process_instances.py\", line 772, in _check_instance\n    instance_check = await run_async(\n                     ^^^^^^^^^^^^^^^^\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/utils/common.py\", line 21, in run_async\n    return await asyncio.get_running_loop().run_in_executor(None, func_with_args)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/concurrent/futures/thread.py\", line 58, in run\n    result = self.fn(*self.args, **self.kwargs)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/services/runner/ssh.py\", line 91, in wrapper\n    with SSHTunnel(\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/core/services/ssh/tunnel.py\", line 233, in __enter__\n    self.open()\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/core/services/ssh/tunnel.py\", line 172, in open\n    r = subprocess.run(\n        ^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/subprocess.py\", line 548, in run\n    with Popen(*popenargs, **kwargs) as process:\n         ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/subprocess.py\", line 1026, in __init__\n    self._execute_child(args, executable, preexec_fn, close_fds,\n  File \"/usr/local/lib/python3.11/subprocess.py\", line 1885, in _execute_child\n    self.pid = _fork_exec(\n               ^^^^^^^^^^^\nBlockingIOError: [Errno 11] Resource temporarily unavailable"
}
```

### Expected behaviour

SSH tunnel subprocesses should be properly cleaned up after each health check, preventing accumulation.


### dstack version

0.19.36

### Server logs

```shell

```

### Additional information

### Root Cause Analysis
SSH tunnel subprocesses appear to accumulate over time and are not properly cleaned up:

1. Health checks run every 4 seconds per instance (`apscheduler` interval)
2. Each check spawns an SSH subprocess via `SSHTunnel` context manager
3. Over hours of operation, these processes accumulate until hitting OS limits
4. System can no longer fork new processes → `BlockingIOError`

### Evidence
- Log shows `interval[0:00:04]` trigger
- Error in `subprocess.py _execute_child` → `_fork_exec()`
- Temporary fix: Restarting dstack clears accumulated processes
- The crash occurs after processing hundreds/thousands of health checks

### Workarounds Attempted:

1. ✅ Increased ulimits (`nproc: 16384`) - delays but doesn't prevent
2. ✅ Set `initProcessEnabled: true` - helps with zombies but not cleanup
3. ❌ `DSTACK_SERVER_INSTANCE_HEALTH_MIN_COLLECT_INTERVAL_SECONDS=30` - appears to be ignored (still shows 4s interval)
4. ✅ Reduced `DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR=1` - helps but doesn't fix root cause

### Suggested Fix
Possible issues to investigate:

1. **`SSHTunnel.__exit__` cleanup:** Ensure subprocess termination in all code paths (including exceptions)
2. **ThreadPoolExecutor lifecycle:** Use explicit executor with `max_workers` limit instead of `None`
3. **subprocess.run cleanup:** Ensure processes are waited on properly
4. **Context manager exceptions:** SSH failures might prevent cleanup block execution

### Related Code
- `dstack/_internal/core/services/ssh/tunnel.py:172` in `open()` method
- `dstack/_internal/server/background/tasks/process_instances.py:772` in `_check_instance()`
- `dstack/_internal/utils/common.py:21` in `run_async()` using default executor

### Impact
- Server becomes unusable after several hours
- Requires manual restarts
- Affects production deployments on resource-constrained platforms (Fargate, smaller VMs)
- Particularly impacts users managing multiple GPU instances

### Additional Context

This issue appears to be a resource leak that becomes critical on platforms with strict process limits (like AWS Fargate). The problem is reproducible and affects production deployments where dstack needs to run continuously for extended periods.

The workarounds (increased ulimits, reduced background processing) only delay the inevitable crash but don't address the root cause of subprocess accumulation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: BlockingIOError crashes dstack server due to SSHTunnel resource leaks #3291

Steps to reproduce

Environment

Actual behaviour

Stack Trace

Expected behaviour

dstack version

Server logs

Additional information

Root Cause Analysis

Evidence

Workarounds Attempted:

Suggested Fix

Related Code

Impact

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: BlockingIOError crashes dstack server due to SSHTunnel resource leaks #3291

Description

Steps to reproduce

Environment

Actual behaviour

Stack Trace

Expected behaviour

dstack version

Server logs

Additional information

Root Cause Analysis

Evidence

Workarounds Attempted:

Suggested Fix

Related Code

Impact

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions