Skip to content

[Bug]: BlockingIOError crashes dstack server due to SSHTunnel resource leaks #3291

@peterschmidt85

Description

@peterschmidt85

Steps to reproduce

  1. Deploy dstack 0.19.36 server on AWS Fargate with default settings
  2. Configure AWS backend with 5+ GPU instances
  3. Run for 6-8 hours with active instances
  4. Observe process count growing: ps aux | wc -l
  5. Eventually hits subprocess limit and crashes

Environment

  • dstack version: 0.19.36
  • Platform: AWS ECS Fargate (4 vCPU, 8GB RAM)
  • Python version: 3.11
  • Number of managed GPU instances: 3-10 concurrent

Actual behaviour

After running the dstack server for several hours (typically 4-8 hours), it crashes with:

BlockingIOError: [Errno 11] Resource temporarily unavailable

Subprocesses accumulate over time, eventually exhausting available PIDs and causing server crash.

Stack Trace

{
    "timestamp": "2025-11-13 15:50:05,062",
    "logger": "apscheduler.executors.default",
    "level": "ERROR",
    "message": "Job \"process_instances (trigger: interval[0:00:04], next run at: 2025-11-13 15:50:09 UTC)\" raised an exception",
    "exc_info": "Traceback (most recent call last):\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/apscheduler/executors/base.py\", line 181, in run_coroutine_job\n    retval = await job.func(*job.args, **job.kwargs)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/background/tasks/process_instances.py\", line 138, in process_instances\n    await asyncio.gather(*tasks)\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/utils/sentry_utils.py\", line 10, in wrapper\n    return await f(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/background/tasks/process_instances.py\", line 193, in _process_next_instance\n    await _process_instance(session=session, instance=instance)\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/background/tasks/process_instances.py\", line 240, in _process_instance\n    await _check_instance(session, instance)\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/background/tasks/process_instances.py\", line 772, in _check_instance\n    instance_check = await run_async(\n                     ^^^^^^^^^^^^^^^^\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/utils/common.py\", line 21, in run_async\n    return await asyncio.get_running_loop().run_in_executor(None, func_with_args)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/concurrent/futures/thread.py\", line 58, in run\n    result = self.fn(*self.args, **self.kwargs)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/server/services/runner/ssh.py\", line 91, in wrapper\n    with SSHTunnel(\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/core/services/ssh/tunnel.py\", line 233, in __enter__\n    self.open()\n  File \"/root/.local/share/uv/tools/dstack/lib/python3.11/site-packages/dstack/_internal/core/services/ssh/tunnel.py\", line 172, in open\n    r = subprocess.run(\n        ^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/subprocess.py\", line 548, in run\n    with Popen(*popenargs, **kwargs) as process:\n         ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/subprocess.py\", line 1026, in __init__\n    self._execute_child(args, executable, preexec_fn, close_fds,\n  File \"/usr/local/lib/python3.11/subprocess.py\", line 1885, in _execute_child\n    self.pid = _fork_exec(\n               ^^^^^^^^^^^\nBlockingIOError: [Errno 11] Resource temporarily unavailable"
}

Expected behaviour

SSH tunnel subprocesses should be properly cleaned up after each health check, preventing accumulation.

dstack version

0.19.36

Server logs

Additional information

Root Cause Analysis

SSH tunnel subprocesses appear to accumulate over time and are not properly cleaned up:

  1. Health checks run every 4 seconds per instance (apscheduler interval)
  2. Each check spawns an SSH subprocess via SSHTunnel context manager
  3. Over hours of operation, these processes accumulate until hitting OS limits
  4. System can no longer fork new processes → BlockingIOError

Evidence

  • Log shows interval[0:00:04] trigger
  • Error in subprocess.py _execute_child_fork_exec()
  • Temporary fix: Restarting dstack clears accumulated processes
  • The crash occurs after processing hundreds/thousands of health checks

Workarounds Attempted:

  1. ✅ Increased ulimits (nproc: 16384) - delays but doesn't prevent
  2. ✅ Set initProcessEnabled: true - helps with zombies but not cleanup
  3. DSTACK_SERVER_INSTANCE_HEALTH_MIN_COLLECT_INTERVAL_SECONDS=30 - appears to be ignored (still shows 4s interval)
  4. ✅ Reduced DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR=1 - helps but doesn't fix root cause

Suggested Fix

Possible issues to investigate:

  1. SSHTunnel.__exit__ cleanup: Ensure subprocess termination in all code paths (including exceptions)
  2. ThreadPoolExecutor lifecycle: Use explicit executor with max_workers limit instead of None
  3. subprocess.run cleanup: Ensure processes are waited on properly
  4. Context manager exceptions: SSH failures might prevent cleanup block execution

Related Code

  • dstack/_internal/core/services/ssh/tunnel.py:172 in open() method
  • dstack/_internal/server/background/tasks/process_instances.py:772 in _check_instance()
  • dstack/_internal/utils/common.py:21 in run_async() using default executor

Impact

  • Server becomes unusable after several hours
  • Requires manual restarts
  • Affects production deployments on resource-constrained platforms (Fargate, smaller VMs)
  • Particularly impacts users managing multiple GPU instances

Additional Context

This issue appears to be a resource leak that becomes critical on platforms with strict process limits (like AWS Fargate). The problem is reproducible and affects production deployments where dstack needs to run continuously for extended periods.

The workarounds (increased ulimits, reduced background processing) only delay the inevitable crash but don't address the root cause of subprocess accumulation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmajor

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions