-
Notifications
You must be signed in to change notification settings - Fork 202
Description
Steps to reproduce
- Deploy dstack 0.19.36 server on AWS Fargate with default settings
- Configure AWS backend with 5+ GPU instances
- Run for 6-8 hours with active instances
- Observe process count growing:
ps aux | wc -l - Eventually hits subprocess limit and crashes
Environment
- dstack version: 0.19.36
- Platform: AWS ECS Fargate (4 vCPU, 8GB RAM)
- Python version: 3.11
- Number of managed GPU instances: 3-10 concurrent
Actual behaviour
After running the dstack server for several hours (typically 4-8 hours), it crashes with:
BlockingIOError: [Errno 11] Resource temporarily unavailable
Subprocesses accumulate over time, eventually exhausting available PIDs and causing server crash.
Stack Trace
{
"timestamp": "2025-11-13 15:50:05,062",
"logger": "apscheduler.executors.default",
"level": "ERROR",
"message": "Job \"process_instances (trigger: interval[0:00:04], next run at: 2025-11-13 15:50:09 UTC)\" raised an exception",
"exc_info": "Traceback (most recent call last):\n File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/apscheduler/executors/base.py\", line 181, in run_coroutine_job\n retval = await job.func(*job.args, **job.kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/background/tasks/process_instances.py\", line 138, in process_instances\n await asyncio.gather(*tasks)\n File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/utils/sentry_utils.py\", line 10, in wrapper\n return await f(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/background/tasks/process_instances.py\", line 193, in _process_next_instance\n await _process_instance(session=session, instance=instance)\n File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/background/tasks/process_instances.py\", line 240, in _process_instance\n await _check_instance(session, instance)\n File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/background/tasks/process_instances.py\", line 772, in _check_instance\n instance_check = await run_async(\n ^^^^^^^^^^^^^^^^\n File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/utils/common.py\", line 21, in run_async\n return await asyncio.get_running_loop().run_in_executor(None, func_with_args)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/concurrent/futures/thread.py\", line 58, in run\n result = self.fn(*self.args, **self.kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/services/runner/ssh.py\", line 91, in wrapper\n with SSHTunnel(\n File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/core/services/ssh/tunnel.py\", line 233, in __enter__\n self.open()\n File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/core/services/ssh/tunnel.py\", line 172, in open\n r = subprocess.run(\n ^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/subprocess.py\", line 548, in run\n with Popen(*popenargs, **kwargs) as process:\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/subprocess.py\", line 1026, in __init__\n self._execute_child(args, executable, preexec_fn, close_fds,\n File \"/usr/local/lib/python3.11/subprocess.py\", line 1885, in _execute_child\n self.pid = _fork_exec(\n ^^^^^^^^^^^\nBlockingIOError: [Errno 11] Resource temporarily unavailable"
}
Expected behaviour
SSH tunnel subprocesses should be properly cleaned up after each health check, preventing accumulation.
dstack version
0.19.36
Server logs
Additional information
Root Cause Analysis
SSH tunnel subprocesses appear to accumulate over time and are not properly cleaned up:
- Health checks run every 4 seconds per instance (
apschedulerinterval) - Each check spawns an SSH subprocess via
SSHTunnelcontext manager - Over hours of operation, these processes accumulate until hitting OS limits
- System can no longer fork new processes →
BlockingIOError
Evidence
- Log shows
interval[0:00:04]trigger - Error in
subprocess.py _execute_child→_fork_exec() - Temporary fix: Restarting dstack clears accumulated processes
- The crash occurs after processing hundreds/thousands of health checks
Workarounds Attempted:
- ✅ Increased ulimits (
nproc: 16384) - delays but doesn't prevent - ✅ Set
initProcessEnabled: true- helps with zombies but not cleanup - ❌
DSTACK_SERVER_INSTANCE_HEALTH_MIN_COLLECT_INTERVAL_SECONDS=30- appears to be ignored (still shows 4s interval) - ✅ Reduced
DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR=1- helps but doesn't fix root cause
Suggested Fix
Possible issues to investigate:
SSHTunnel.__exit__cleanup: Ensure subprocess termination in all code paths (including exceptions)- ThreadPoolExecutor lifecycle: Use explicit executor with
max_workerslimit instead ofNone - subprocess.run cleanup: Ensure processes are waited on properly
- Context manager exceptions: SSH failures might prevent cleanup block execution
Related Code
dstack/_internal/core/services/ssh/tunnel.py:172inopen()methoddstack/_internal/server/background/tasks/process_instances.py:772in_check_instance()dstack/_internal/utils/common.py:21inrun_async()using default executor
Impact
- Server becomes unusable after several hours
- Requires manual restarts
- Affects production deployments on resource-constrained platforms (Fargate, smaller VMs)
- Particularly impacts users managing multiple GPU instances
Additional Context
This issue appears to be a resource leak that becomes critical on platforms with strict process limits (like AWS Fargate). The problem is reproducible and affects production deployments where dstack needs to run continuously for extended periods.
The workarounds (increased ulimits, reduced background processing) only delay the inevitable crash but don't address the root cause of subprocess accumulation.