Kubernetes multi-agent system that helps:
- Automates investigation: Transforms high-level failure signals into targeted diagnostic reports
- Coordinates multi-cluster debugging: Seamlessly queries both workload and management clusters
- Optimizes cost and speed: Uses powerful reasoning only for coordination, simpler models for data collection
- Provides structured output: Returns concise, actionable diagnostic reports instead of raw data dumps
Multi-agent system for Kubernetes E2E debugging:
The system is workload-cluster-first: runtime evidence is gathered from the workload cluster; the management cluster is used only for App/HelmRelease deployment status and Cluster API (CAPI) object status.
- Coordinator Agent: Orchestrates investigation, synthesizes findings from collectors, generates diagnostic reports. Uses a powerful reasoning model (configurable via
OPENAI_COORDINATOR_MODEL). - WC Collector Agent: Collects diagnostic data from the workload cluster via Kubernetes MCP server (
workload_cluster_*tools). - MC Collector Agent: Collects diagnostic data from the management cluster via Kubernetes MCP server (
management_cluster_*tools).
graph TD
User[User Query] --> API[FastAPI Endpoint]
API --> Coordinator[Coordinator Agent<br/>High-level reasoning]
Coordinator -->|delegates| WC[WC Collector Agent<br/>Workload Cluster]
Coordinator -->|delegates| MC[MC Collector Agent<br/>Management Cluster]
WC -->|MCP tools| WC_K8s[Workload Cluster<br/>Kubernetes API]
MC -->|MCP tools| MC_K8s[Management Cluster<br/>Kubernetes API]
WC_K8s -->|data| WC
MC_K8s -->|data| MC
WC -->|findings| Coordinator
MC -->|findings| Coordinator
Coordinator -->|synthesizes| Report[Diagnostic Report]
Report --> API
API --> User
- Anthropic API key
- Access to Kubernetes clusters via Teleport (
tshCLI) - Docker (for containerized setup) or Python 3.11+ with uv (for native setup)
Create the local configuration directory and templates:
make -f Makefile.local.mk local-setupThis creates local_config/ with a .env template.
Edit local_config/.env and add your Anthropic API key:
ANTHROPIC_API_KEY=sk-ant-your-key-hereOptional configuration:
# Model selection (defaults shown)
ANTHROPIC_COORDINATOR_MODEL=claude-sonnet-4-5-20250929
ANTHROPIC_COLLECTOR_MODEL=claude-3-5-haiku-20241022
# Cluster context for prompts
WC_CLUSTER=my-workload-cluster
ORG_NS=org-myorgUse Teleport to create kubeconfigs:
# For separate management and workload clusters:
make -f Makefile.local.mk local-kubeconfig MC=<management-cluster> WC=<workload-cluster>
# Or use the same cluster for both:
make -f Makefile.local.mk local-kubeconfig MC=<cluster-name>This creates:
local_config/mc-kubeconfig.yaml(management cluster)local_config/wc-kubeconfig.yaml(workload cluster)
Build and run with Docker:
# Build the image
make -f Makefile.local.mk docker-build
# Run the container
make -f Makefile.local.mk docker-runThe API will be available at http://localhost:8000
Download the MCP Kubernetes binary:
make -f Makefile.local.mk local-mcpRun with uvicorn (automatically creates virtualenv and installs dependencies):
make -f Makefile.local.mk local-runThe API will be available at http://localhost:8000 with hot-reload enabled.
# Basic health check
curl http://localhost:8000/health
# Deep check (validates configuration, kubeconfig, API key, MCP binary)
curl http://localhost:8000/ready?deep=trueUsing the convenience command:
# Default query (lists namespaces)
make -f Makefile.local.mk local-query
# Custom query
make -f Makefile.local.mk local-query Q="Check pod status in kube-system"
make -f Makefile.local.mk local-query Q="Investigate non-ready deployments"Or using curl directly:
curl http://localhost:8000/ -d '{"query": "Investigate non-ready deployments"}'curl -N http://localhost:8000/stream -d '{"query": "List all pods in default namespace"}'GET /health- Basic health checkGET /ready- Readiness check (optional?deep=truefor configuration validation)GET /schema- Returns the DiagnosticReport JSON schemaPOST /- Blocking query endpoint (returns complete response)POST /stream- Streaming query endpoint (returns chunks as they're generated)
{
"query": "Your diagnostic query here",
"timeout_seconds": 300, // optional, default 300
"max_turns": 15 // optional, default 15
}{
"result": "Diagnostic report text...",
"request_id": "uuid-here",
"metrics": {
"duration_ms": 12345,
"num_turns": 8,
"total_cost_usd": 0.0245,
"usage": {
"input_tokens": 1234,
"output_tokens": 567,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 890
},
"breakdown": {
"wc_collector": {
"usage": {"input_tokens": 500, "output_tokens": 200},
"total_cost_usd": 0.01,
"duration_ms": 3000
},
"mc_collector": {
"usage": {"input_tokens": 300, "output_tokens": 150},
"total_cost_usd": 0.008,
"duration_ms": 2000
}
}
}
}The metrics object includes:
- duration_ms: Total investigation time in milliseconds
- num_turns: Number of agent conversation turns
- total_cost_usd: Total cost of the API calls in USD (includes coordinator + all subagents)
- usage: Overall token usage breakdown
input_tokens: Tokens sent to the modeloutput_tokens: Tokens generated by the modelcache_creation_input_tokens: Tokens used to create prompt cachecache_read_input_tokens: Tokens read from prompt cache (cost savings)
- breakdown: Per-agent cost and token breakdown (coordinator uses Task tool, collectors gather data)
wc_collector: Workload cluster collector metricsmc_collector: Management cluster collector metrics- Each agent shows its own usage, cost, and duration
# Install pre-commit hooks for code quality
pre-commit install
# Run code quality checks
pre-commit run --all-files
# Individual checks
black src/ # Format code
flake8 src/ # Lint
mypy src/ # Type checking
bandit -c .bandit src/ # Security scan"Claude Code not found" error:
- For native Python setup, download the MCP binary:
make -f Makefile.local.mk local-mcp - Ensure
MCP_KUBERNETES_PATHpoints to the correct binary location
"model: claude-sonnet-4-5-20250514" not found:
- Update your model configuration to use valid model IDs (see
.env.example) - Latest valid models:
claude-sonnet-4-5-20250929,claude-3-5-haiku-20241022
Authentication errors:
- Verify your
ANTHROPIC_API_KEYis valid - Refresh cluster kubeconfigs:
make -f Makefile.local.mk local-kubeconfig MC=<cluster>
For more detailed development guidance, see CLAUDE.md.