Skip to content

Run and coordinate multiple Claude Code instances in tmux panes with automatic health monitoring, crash recovery, and reliable message delivery.

Notifications You must be signed in to change notification settings

andrewjsauer/panebus

Repository files navigation

PaneBus

A distributed worker coordination system that uses tmux as the execution surface and Redis as the control plane.

PaneBus lets you dispatch commands to workers running in tmux panes, with automatic health monitoring, crash recovery, and reliable message delivery.

Why PaneBus?

The Problem: You need to run multiple long-lived worker processes (AI agents, background jobs, processing pipelines) and coordinate them reliably. Traditional approaches require complex container orchestration or custom process management.

The Solution: PaneBus treats tmux panes as lightweight execution containers. Redis provides the coordination layer. You get:

  • Reliable message delivery via Redis BLMOVE (at-least-once semantics)
  • Automatic health monitoring with TTL-based heartbeats
  • Crash recovery with intelligent respawn
  • Crash loop protection to prevent runaway respawns
  • Dead letter queue for failed commands
  • Simple CLI for management and dispatch

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        Redis                                 │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────────────┐ │
│  │cmd:%1   │  │cmd:%2   │  │hb:%1    │  │panes:active     │ │
│  │(queue)  │  │(queue)  │  │(TTL key)│  │(set of pane ids)│ │
│  └─────────┘  └─────────┘  └─────────┘  └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘
        │              │            ▲
        │              │            │ heartbeat
        ▼              ▼            │
┌──────────────┐  ┌──────────────┐  │
│  Worker %1   │  │  Worker %2   │──┘
│  (tmux pane) │  │  (tmux pane) │
└──────────────┘  └──────────────┘
        ▲              ▲
        │ respawn      │ health check
        │              │
┌──────────────────────────────────┐
│           Watchdog               │
│  - monitors heartbeats           │
│  - respawns crashed workers      │
│  - quarantines crash-loopers     │
└──────────────────────────────────┘

Components

Component Role
Worker Runs in a tmux pane, pulls commands from its Redis queue, sends heartbeats
Watchdog Monitors all workers, respawns crashed ones, manages crash loop protection
Admin Dispatches commands, queries status, manages the dead letter queue
CLI Command-line interface for all operations

Installation

# Clone the repository
git clone https://github.com/yourusername/pane-bus.git
cd pane-bus

# Install with pip (editable mode for development)
pip install -e .

# Or install dependencies directly
pip install -r requirements.txt

Requirements

  • Python 3.11+
  • Redis 6.2+ (for BLMOVE support)
  • tmux 3.0+ (workers must run inside tmux panes)

Quick Start

1. Start Redis

# Using Docker
docker run -d --name redis -p 6379:6379 redis:7-alpine

# Or install locally
brew install redis && brew services start redis

2. Create a tmux session with workers

Important: Workers must run inside tmux panes. The worker automatically detects its pane ID from tmux (e.g., %0, %1). Running outside tmux generates a random standalone ID which is harder to work with.

# Create a new tmux session
tmux new-session -d -s panebus -n main

# Split into panes and start workers
tmux split-window -t panebus:main
tmux send-keys -t panebus:main.0 'panebus worker --role processor' Enter
tmux send-keys -t panebus:main.1 'panebus worker --role processor' Enter

# Attach to see the workers
tmux attach -t panebus

Or use the bootstrap script:

./scripts/bootstrap.sh

3. Start the watchdog (in another terminal)

panebus watchdog

4. Send tasks (natural language)

# Simple natural language task dispatch
panebus ask Fix the failing tests

# With specific working directory
panebus ask -d /path/to/project Add user authentication

# Fire and forget (don't wait for completion)
panebus ask --no-wait Refactor the payment module

# Verbose mode (show detailed logs)
panebus ask -v Investigate the codebase

The ask command automatically orchestrates the full workflow:

  1. Review - Analyze the codebase
  2. Plan - Create implementation plan
  3. Work - Execute the plan
  4. Compound - Document learnings (if valuable)

5. Low-level dispatch (advanced)

# Ping a specific worker
panebus dispatch %1 ping

# Send a prompt to a worker
panebus dispatch %1 prompt --payload '{"text": "Hello, agent!"}'

# Broadcast to all workers
panebus broadcast ping

# Dispatch to all workers with a specific role
panebus dispatch-role processor ping

6. Check status

# System overview
panebus status

# List all panes
panebus panes

# Check the dead letter queue
panebus dlq list

# View recent events
panebus events

Multi-Agent Workflow Setup

For orchestrated multi-agent workflows with visual debugging:

# Set up tmux session with 5 specialized panes
panebus workflow-setup

# This creates:
#   Pane 0: Orchestrator - coordinates workflow phases
#   Pane 1: Planner - creates implementation plans
#   Pane 2: Reviewer - analyzes codebase
#   Pane 3: Worker - executes the plan
#   Pane 4: Compounder - documents learnings

# Then from another terminal, send tasks:
panebus ask Fix the authentication bug

# Tear down when done
panebus workflow-teardown

Note: For simple tasks, you may not need the full multi-agent setup. Consider using Claude Code directly:

claude -p "Fix the authentication bug"

Configuration

PaneBus is configured via environment variables:

Variable Default Description
PANEBUS_REDIS_URL redis://localhost:6379/0 Redis connection URL
PANEBUS_HEARTBEAT_INTERVAL 10 Seconds between heartbeats
PANEBUS_HEARTBEAT_TTL 60 Seconds before a pane is considered dead
PANEBUS_COMMAND_TIMEOUT 300 Default command timeout in seconds
PANEBUS_MAX_QUEUE_DEPTH 100 Maximum commands per queue
PANEBUS_RESPAWN_LIMIT 5 Respawns before quarantine
PANEBUS_RESPAWN_WINDOW 300 Window for counting respawns (seconds)
PANEBUS_LOG_LEVEL INFO Logging level

Example .env file

PANEBUS_REDIS_URL=redis://localhost:6379/0
PANEBUS_HEARTBEAT_INTERVAL=10
PANEBUS_HEARTBEAT_TTL=60
PANEBUS_LOG_LEVEL=DEBUG

CLI Reference

Task Commands

# Send a natural language task (recommended)
panebus ask TASK...

# Options:
#   --wait/--no-wait    Wait for workflow completion (default: wait)
#   --timeout N         Timeout in seconds (default: 600)
#   -d, --working-dir   Working directory (default: current)
#   -v, --verbose       Show detailed logs

# Examples:
panebus ask Fix the failing tests
panebus ask -d /path/to/project Add user authentication
panebus ask --no-wait Refactor the payment module

Workflow Commands

# Set up multi-agent tmux session
panebus workflow-setup [--session NAME] [--no-attach]

# Tear down session
panebus workflow-teardown [SESSION]

# Check workflow status
panebus workflow-status WORKFLOW_ID [--json]

# List workflows
panebus workflows [--count N] [--active] [--json]

Worker Commands

# Start a worker inside a tmux pane (auto-detects pane ID)
panebus worker --role processor

# Options:
#   --role       Worker role for routing (default: worker)

Note: Run workers inside tmux panes. The pane ID (e.g., %0) is auto-detected and used for command routing. Running outside tmux generates a random standalone-* ID.

Watchdog Commands

# Start the watchdog
panebus watchdog

# Options:
#   --check-interval    Seconds between health checks (default: 30)

Admin Commands

# System status
panebus status

# List panes (optionally filter by role)
panebus panes [--role ROLE]

# Dispatch a command to a specific pane
panebus dispatch PANE_ID COMMAND_TYPE [--payload JSON] [--timeout MS]

# Dispatch to all panes with a role
panebus dispatch-role ROLE COMMAND_TYPE [--payload JSON]

# Broadcast to all panes
panebus broadcast COMMAND_TYPE [--payload JSON]

# Dead letter queue management
panebus dlq list [--limit N]
panebus dlq purge

# View events
panebus events [--count N]

Command Types

Type Description Payload
ping Health check None
prompt Send text to Claude Code {"text": "...", "context": {"working_dir": "..."}}
shutdown Graceful shutdown None

Claude Code Integration

PaneBus workers invoke Claude Code CLI for prompt commands:

# Install Claude Code CLI
npm install -g @anthropic-ai/claude-code

# Install plugins (optional but recommended)
claude /install-plugin https://github.com/EveryInc/compound-engineering-plugin

When a worker receives a prompt command, it runs:

claude --dangerously-skip-permissions -p "your prompt text"

The --dangerously-skip-permissions flag enables autonomous execution without permission prompts.

Example dispatch with working directory:

panebus dispatch %1 prompt --payload '{"text": "Fix the failing tests", "context": {"working_dir": "/path/to/project"}}'

Custom Command Handling

Override the handle_command method in your worker subclass:

from panebus.worker import Worker
from panebus.schema import Command

class MyWorker(Worker):
    def handle_command(self, cmd: Command) -> None:
        if cmd.type == "prompt":
            text = cmd.payload.get("text", "")
            # Process the prompt...
            self.log.info("processed_prompt", text=text[:50])
        elif cmd.type == "custom_action":
            # Handle custom command type
            pass
        else:
            super().handle_command(cmd)

How It Works

Reliable Message Delivery

PaneBus uses Redis's BLMOVE command for reliable queue processing:

  1. Commands are pushed to a pane's queue (cmd:%pane_id)
  2. Worker atomically moves command to processing list (processing:%pane_id)
  3. Worker executes the command
  4. Worker acknowledges by removing from processing list
  5. If worker crashes, watchdog can recover unacknowledged commands

Heartbeat Liveness

Workers send heartbeats every 10 seconds (configurable):

  1. Worker sets a Redis key (hb:%pane_id) with 60-second TTL
  2. Watchdog checks if heartbeat key exists
  3. Missing heartbeat = dead worker
  4. Watchdog respawns using tmux respawn-pane

Crash Loop Protection

To prevent runaway respawns:

  1. Each respawn is recorded with timestamp
  2. If 5 respawns occur within 5 minutes, pane is quarantined
  3. Quarantined panes require manual intervention
  4. Use panebus status to see quarantined panes

Dead Letter Queue

Failed commands go to the DLQ:

  1. Command execution fails or times out
  2. Error details recorded with original command
  3. Commands can be inspected and retried
  4. Purge old entries when no longer needed

Development

Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=panebus --cov-report=term-missing

# Run specific test file
pytest tests/test_worker.py -v

Project Structure

pane-bus/
├── src/panebus/
│   ├── __init__.py
│   ├── admin.py        # Service discovery and dispatch
│   ├── cli.py          # Click CLI commands
│   ├── config.py       # Configuration management
│   ├── logging.py      # Structured logging setup
│   ├── redis_client.py # Redis operations wrapper
│   ├── schema.py       # Pydantic models
│   ├── watchdog.py     # Health monitoring
│   └── worker.py       # Worker implementation
├── tests/
│   ├── conftest.py     # Pytest fixtures
│   ├── test_admin.py
│   ├── test_redis_client.py
│   ├── test_schema.py
│   └── test_worker.py
├── scripts/
│   └── bootstrap.sh    # tmux setup script
├── pyproject.toml
└── README.md

Code Style

  • Type hints on all public functions
  • Pydantic V2 for data validation
  • structlog for structured logging
  • Ruff for linting and formatting

Production Considerations

High Availability

  • Run multiple watchdog instances with leader election (not yet implemented)
  • Use Redis Sentinel or Redis Cluster for HA
  • Consider running workers across multiple tmux sessions on different hosts

Monitoring

PaneBus emits events to a Redis stream (events):

# View events
panebus events --count 100

# Or directly from Redis
redis-cli XRANGE events - + COUNT 100

Key events to monitor:

  • worker_started / worker_stopped
  • command_received / command_completed / command_failed
  • pane_respawned / pane_quarantined
  • heartbeat_missed

Scaling

  • Each pane has its own queue (horizontal scaling by adding panes)
  • Use roles to route commands to specific worker types
  • Queue depth limits prevent memory exhaustion

Troubleshooting

Worker not receiving commands

  1. Check Redis connectivity: redis-cli ping
  2. Verify worker is registered: panebus panes
  3. Check queue depth: redis-cli LLEN cmd:%pane_id
  4. Look at worker logs for errors

Watchdog not respawning

  1. Ensure watchdog is running: panebus watchdog
  2. Check if pane is quarantined: panebus status
  3. Verify tmux target is correct
  4. Check watchdog logs for respawn errors

Commands going to DLQ

  1. List DLQ entries: panebus dlq list
  2. Check error messages for root cause
  3. Fix the issue and retry manually
  4. Purge processed entries: panebus dlq purge

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

Acknowledgments

  • Built with Redis for coordination
  • Uses tmux for process management
  • Inspired by distributed systems patterns from Kafka and Celery

About

Run and coordinate multiple Claude Code instances in tmux panes with automatic health monitoring, crash recovery, and reliable message delivery.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •