Skip to content

[Optimization] bwrap sandbox process management causes PID exhaustion #636

@Clawiee

Description

@Clawiee

Tags: performance, optimization, bug, sandbox
Quality Rating: ⭐ 9/10


Reporter: xiaoan

Description

The local sandbox (bwrap) can cause PID exhaustion in container environments, leading to crashes and service degradation.

Alert Context

Container: fybclaw-backend
PID Usage: 80%
Current Processes: 241 / 300
Alert Threshold: 80%
Detected: 2026-05-31 10:15:01

Root Cause Analysis

Factor 1: Each code execution forks multiple processes

Every execute_code tool call creates a process chain:

uvicorn (main process)
└── bwrap (sandbox process, with --unshare-pid and new namespace)
    └── python3 / bash / node (actual code execution)
        └── User code may fork additional child processes

A single execute_code call consumes at least 3-5 PIDs.

Factor 2: Orphaned child processes after timeout kill

Current timeout handling in source code:

except asyncio.TimeoutError:
    proc.kill()
    await proc.communicate()
    # Only kills the bwrap process
    # Waits for exit

Problem: proc.kill() only sends SIGKILL to the bwrap process itself. Although bwrap is launched with --die-with-parent parameter, in certain edge cases (e.g., when bwrap itself is force-killed), its child processes may become orphaned and not be automatically recycled, continuing to occupy PID quota.

Recommended Optimizations

1. Process Tree Cleanup

Use process group killing instead of single process:

# Send SIGTERM to the entire process group
os.killpg(os.getpgid(proc.pid), signal.SIGTERM)

2. Enhanced Orphan Process Detection

Implement a periodic cleanup task that:

  • Detects orphaned processes belonging to terminated bwrap instances
  • Reaps zombie processes with wait() or waitpid()
  • Reports PID usage metrics for monitoring

3. Process Pool / Reuse Strategy

Consider implementing a process pool for sandbox execution to reduce fork overhead and improve PID efficiency.

4. PID Quota Monitoring

Add proactive monitoring:

  • Warn when PID usage exceeds 60%
  • Alert when approaching 80% threshold
  • Auto-trigger cleanup when exceeding threshold

Expected Behavior

  • PID usage should remain stable during normal operations
  • No orphaned processes after timeout or errors
  • Graceful degradation when approaching PID limits

Actual Behavior

  • PID usage continuously grows due to accumulating orphaned processes
  • Container eventually reaches PID limit and crashes

Additional Context

  • Affects: fybclaw-backend container
  • Related to: execute_code tool, bwrap sandbox implementation
  • Impact: Service availability and stability

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions