AgentOps

Multi-Agent Infrastructure Remediation Platform

Specialized AI agents coordinate via A2A (Agent-to-Agent) protocol to detect, diagnose, and fix infrastructure issues — with human-in-the-loop approval and automatic rollback safety guarantees.

"By 2028, 40% of large enterprises will deploy agentic AI in SRE workflows, up from less than 5% in 2024." — Gartner, Predicts 2025: IT Operations

AgentOps is what that future looks like: a coordinated team of purpose-built agents that handle the full incident lifecycle — from metric anomaly to verified fix — while keeping humans in control of every irreversible action.

Why AgentOps?

Modern infrastructure generates thousands of alerts per day. Human operators spend 80% of their incident response time on diagnosis, not remediation. AgentOps automates the cognitive labor while preserving human judgment where it matters most.

Without AgentOps	With AgentOps
Alert → page human → triage → diagnose → fix → verify	Alert → agents diagnose + plan fix → human approves → agents execute + verify
30-90 min MTTR	< 5 min MTTR (with auto-approve)
Knowledge trapped in runbooks	Knowledge encoded in agent capabilities
Single-threaded incident response	Parallel multi-agent coordination

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Orchestrator                           │
│                  (DAG-based pipeline)                        │
│                                                             │
│   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌────────┐ │
│   │ Monitor  │──▶│ Diagnose │──▶│Remediate │──▶│ Verify │ │
│   │  Agent   │   │  Agent   │   │  Agent   │   │ Agent  │ │
│   └──────────┘   └──────────┘   └──────────┘   └────────┘ │
│        │              │              │    ▲           │      │
│        │              │              │    │           │      │
│        ▼              ▼              ▼    │           ▼      │
│   ┌─────────┐   ┌─────────┐   ┌─────────┐    ┌─────────┐  │
│   │ Metrics │   │Evidence │   │Approval │    │Rollback │  │
│   │Threshold│   │Topology │   │  Gate   │    │ Manager │  │
│   │ Anomaly │   │  Corr.  │   │ (HITL)  │    │  (Auto) │  │
│   └─────────┘   └─────────┘   └─────────┘    └─────────┘  │
│                                                             │
│   ┌──────────────────────────────────────────────────────┐  │
│   │           A2A Protocol (Agent-to-Agent)              │  │
│   │  Agent Cards • Task Delegation • Priority Routing    │  │
│   └──────────────────────────────────────────────────────┘  │
│                                                             │
│   ┌──────────────────────────────────────────────────────┐  │
│   │              Safety Layer                            │  │
│   │  Kill Switch • Auto-Rollback • Blast Radius Limits  │  │
│   └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

The Four Agents

Agent	Role	Capabilities
MonitorAgent	Detection	Metric collection, threshold evaluation, anomaly detection (z-score), alert generation
DiagnoserAgent	Analysis	Root cause analysis, topology-aware correlation, evidence collection, hypothesis ranking
RemediatorAgent	Action	Fix generation from templates, config change proposals, rollback preparation, blast radius assessment
VerifierAgent	Validation	Post-fix metric comparison, SLA compliance checks, improvement scoring, rollback recommendation

A2A Protocol

Agents discover each other through Agent Cards — machine-readable capability declarations. The protocol handles:

Discovery: Find agents by capability (find_best_agent("root_cause_analysis"))
Task Delegation: Route work to the most capable available agent
Priority Routing: Critical incidents jump the queue (P1 > P5 > P10)
Conversation Tracking: Multi-message interactions maintain state
Broadcast: System-wide announcements (kill signals, heartbeats)

Safety-First Design

Every design decision prioritizes safety:

Kill Switch: Instantly halt all agents. One call pauses everything.
Auto-Rollback: If post-fix metrics don't improve within the observation window, changes are automatically reverted.
HITL Approval Gate: No remediation executes without human approval (configurable: always-require, auto-low-risk, auto-all for demos).
Blast Radius Limits: Plans exceeding the configured blast radius are automatically rejected.
Audit Trail: Every agent action, decision, and state change is logged.

Quick Start

Install

pip install -e .

Run the Demo

See all five scenarios resolve end-to-end:

agentops demo

Output:

╭──── Demo Mode ────╮
│ AgentOps Full Demo │
│ Scenarios: 5       │
╰────────────────────╯

============================================================
Scenario: link-down
Device: core-rtr-01 | Description: Network link failure on core router
============================================================

  Result: RESOLVED
  Root Cause: Physical cable disconnected or damaged
  Confidence: high
  Plan: REM-a3f1b2c4 (high risk)
  Verification: passed

  ...

┌───────────── Demo Results Summary ──────────────┐
│ Scenario      │ Device       │ Result           │
├───────────────┼──────────────┼──────────────────┤
│ link-down     │ core-rtr-01  │ resolved         │
│ cpu-spike     │ web-srv-01   │ resolved         │
│ bgp-flap      │ core-rtr-02  │ resolved         │
│ disk-full     │ db-srv-01    │ resolved         │
│ memory-leak   │ web-srv-02   │ resolved         │
└───────────────┴──────────────┴──────────────────┘

Total: 5/5 incidents resolved automatically

Simulate a Single Incident

# With human approval required (default)
agentops simulate-incident --device web-srv-01 --scenario cpu-spike

# With auto-approval for demo
agentops simulate-incident --device core-rtr-01 --scenario link-down --auto-approve

Start the API Server

agentops start --port 8080

Then interact via REST:

# Submit an incident
curl -X POST http://localhost:8080/api/v1/incidents \
  -H "Content-Type: application/json" \
  -d '{"device_id": "web-srv-01", "scenario": "cpu_spike", "description": "High CPU"}'

# Process it
curl -X POST http://localhost:8080/api/v1/incidents/INC-xxxx/process

# Check status
curl http://localhost:8080/api/v1/status

Start the Dashboard

agentops dashboard --port 8888

Open http://localhost:8888 for the web interface showing agent status, incidents, approval queue, and audit trail.

Docker

docker compose up -d

API: http://localhost:8080
Dashboard: http://localhost:8888

Pre-Built Scenarios

Scenario	Device	What Happens	Remediation
`link-down`	core-rtr-01	Physical link failure, error rate spikes	Bounce interface, activate failover path
`cpu-spike`	web-srv-01	Runaway process, response time degrades	Identify process, apply CPU limit, restart service
`bgp-flap`	core-rtr-02	BGP prefixes drop, routing instability	Soft reset BGP, apply route dampening
`disk-full`	db-srv-01	Disk utilization >90%, database at risk	Rotate logs, clean temp files, expand volume
`memory-leak`	web-srv-02	Gradual memory increase, OOM risk	Capture heap dump, set memory limit, graceful restart

Testing

# Run all tests
pytest -v

# With coverage
pytest --cov=agentops --cov-report=term-missing

# Run specific test category
pytest tests/test_orchestrator.py -v

The test suite includes 50+ tests covering:

Agent lifecycle and state machine transitions
Threshold evaluation and anomaly detection
Root cause analysis and hypothesis ranking
Remediation plan generation, approval, execution, rollback
Verification and SLA compliance
A2A protocol discovery and routing
Kill switch and safety mechanisms
REST API endpoints
End-to-end integration for all 5 scenarios

Project Structure

agentops/
├── src/agentops/
│   ├── agents/          # Agent implementations
│   │   ├── base.py      #   Base class, Agent Card, lifecycle
│   │   ├── monitor.py   #   Metric collection, thresholds, anomaly detection
│   │   ├── diagnoser.py #   Root cause analysis, evidence, hypotheses
│   │   ├── remediator.py#   Fix plans, approval gates, rollback
│   │   └── verifier.py  #   Post-fix validation, SLA checks
│   ├── protocol/        # A2A protocol
│   │   ├── a2a.py       #   Discovery, routing, conversations
│   │   └── messages.py  #   Message types, task delegation
│   ├── orchestrator/    # Pipeline orchestration
│   │   └── engine.py    #   DAG execution, incident lifecycle
│   ├── safety/          # Safety mechanisms
│   │   ├── kill_switch.py#  Immediate halt
│   │   ├── rollback.py  #   Auto-rollback on metric degradation
│   │   └── approval.py  #   HITL approval gates
│   ├── observe/         # Observability
│   │   └── tracer.py    #   OTel tracing, decision audit, metrics
│   ├── inventory/       # Infrastructure inventory
│   │   └── registry.py  #   Devices, topology, dependencies
│   ├── api/             # REST API
│   │   └── routes.py    #   Flask API endpoints
│   ├── dashboard/       # Web dashboard
│   │   └── app.py       #   Flask dashboard with embedded templates
│   └── cli.py           # Click CLI
├── tests/               # 50+ tests
├── scenarios/           # 5 pre-built incident scenarios
├── pyproject.toml       # Project configuration
├── Dockerfile           # Container image
├── docker-compose.yml   # Multi-service deployment
└── README.md            # This file

Philosophy

AgentOps is built on three convictions:

Agents should be specialists, not generalists. A monitoring agent that does one thing extremely well is more reliable than an all-purpose agent. Specialization enables composability.
Safety is not a feature — it's the architecture. Kill switches, rollback guarantees, blast radius limits, and HITL gates aren't bolted on. They're structural. Removing them would require rewriting the system.
Observability is non-negotiable. Every agent action, every decision, every state change is traced and logged. When something goes wrong (and it will), you need to reconstruct exactly what happened and why.

License

MIT License — see LICENSE

Author

Corey A. Wade — GitHub

AgentOps demonstrates the architecture and coordination patterns for multi-agent infrastructure remediation. All agent implementations use mock infrastructure for safe demonstration without requiring real devices.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentOps

Why AgentOps?

Architecture

The Four Agents

A2A Protocol

Safety-First Design

Quick Start

Install

Run the Demo

Simulate a Single Incident

Start the API Server

Start the Dashboard

Docker

Pre-Built Scenarios

Testing

Project Structure

Philosophy

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
scenarios		scenarios
src/agentops		src/agentops
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

AgentOps

Why AgentOps?

Architecture

The Four Agents

A2A Protocol

Safety-First Design

Quick Start

Install

Run the Demo

Simulate a Single Incident

Start the API Server

Start the Dashboard

Docker

Pre-Built Scenarios

Testing

Project Structure

Philosophy

License

Author

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages