Skip to content

benmoggee/sre-agent

Repository files navigation

SRE Agent MCP Server

Python License FastMCP uv

Table of Contents


The Problem

Production incident response is one of the most high-pressure, time-sensitive activities in software engineering. When a service goes down at 3 AM, an on-call Site Reliability Engineer (SRE) is paged and must answer a cascade of questions under extreme time pressure:

  • What is actually broken? Is it a single endpoint, an entire service, or a cascading failure across multiple systems?
  • What changed? Was there a recent deployment, a configuration change, or an infrastructure event that correlates with the start of the incident?
  • Where is the evidence? Logs, metrics, and traces are scattered across different observability platforms. The engineer must query each one individually, mentally correlate timestamps, and piece together a timeline.
  • What does the runbook say? Most teams maintain runbooks for known failure modes, but finding the right runbook and following its steps while simultaneously investigating is cognitively demanding.
  • What is the fix? Once the root cause is identified, the engineer must decide on remediation: rollback a deployment, scale infrastructure, toggle a feature flag, or apply a hotfix. Each option carries risk.

This process is slow, error-prone, and mentally exhausting. Studies show that Mean Time to Resolution (MTTR) for production incidents averages 1-4 hours across the industry, with a significant portion of that time spent on the investigation phase rather than the actual fix. During an outage, every minute costs money: lost revenue, SLA violations, customer churn, and engineering productivity drain.

The core challenge is that incident triage is fundamentally a multi-step reasoning task that requires gathering evidence from multiple sources, forming hypotheses, testing them against data, and arriving at a root cause. Today, this reasoning happens entirely in the engineer's head, with no structured framework to guide the investigation or prevent cognitive shortcuts that lead to wrong conclusions.

Why Existing Tools Fall Short

Current incident response tooling addresses individual pieces of the puzzle but not the investigation workflow itself:

  • Observability platforms (Datadog, Grafana, New Relic) are excellent at storing and visualizing data, but they don't reason about what the data means or guide investigation.
  • Alerting systems (PagerDuty, OpsGenie) notify the right person, but provide no investigation support beyond the initial alert context.
  • Runbook tools (Confluence, Notion) store institutional knowledge, but there's no automated way to match symptoms to the right runbook or verify that its steps are being followed.
  • ChatOps bots (custom Slack bots) can query individual data sources, but lack the ability to plan a structured investigation or synthesize findings across sources.

What's missing is an intelligent orchestration layer that can take a natural-language description of symptoms, plan a structured investigation, execute it by querying the right tools in the right order, and synthesize the results into an actionable incident report with remediation proposals.

What SRE Agent Does

SRE Agent is a domain-specific MCP server that acts as an AI-powered first responder for production incidents. It bridges the gap between raw observability data and actionable incident response by automating the investigation workflow that SREs currently perform manually.

When an on-call engineer reports symptoms (e.g. "Prod checkout is spiking 500s since 14:05 UTC"), the server:

  1. Plans the investigation — An LLM-powered orchestrator analyzes the symptoms, generates ranked hypotheses (e.g. "bad deployment", "dependency failure", "resource exhaustion"), and produces a structured triage plan with phased tasks and dependency ordering.

  2. Executes the investigation in parallel — Specialized workers query logs, metrics, deployment records, and runbooks simultaneously. Same-phase tasks run in parallel for speed; cross-phase dependencies are enforced automatically.

  3. Synthesizes findings into an incident report — Worker outputs are aggregated and analyzed to produce a root-cause assessment, evidence summary, severity classification, and recommended next actions.

  4. Proposes remediation with human approval — The system generates patch artifacts (e.g. rollback diffs, runbook updates) but never applies them automatically. A human approval gate ensures the engineer reviews and explicitly approves any changes before they take effect.

The result: an engineer who would normally spend 30-60 minutes querying dashboards, reading logs, and cross-referencing deployment history can get a structured incident report with root-cause analysis and remediation options in under a minute.

Use Cases

  • On-call triage acceleration — Reduce MTTR by automating evidence gathering and correlation during production incidents.
  • Incident postmortem preparation — Generate structured timelines and evidence summaries that feed directly into postmortem documents.
  • Runbook validation — Automatically match incident symptoms to relevant runbooks and verify that remediation steps are applicable.
  • Junior SRE training — The structured triage plan serves as an educational tool, showing less experienced engineers how to systematically investigate an outage.
  • AI-assisted pair debugging — Use the MCP client interactively to explore an incident step by step, with the agent providing context and suggestions at each stage.

Demo

demo_sre_agent.mp4


Architecture & System Design

The system follows Clean Architecture with an explicit composition root, ensuring that business logic is decoupled from infrastructure concerns and the MCP protocol boundary.

sre_design.svg

Layering

Layer Path Responsibility
Entities mcp_server/src/entities/ Pure domain models (Pydantic). No external dependencies.
Use Cases mcp_server/src/app/ Orchestration, phase execution, synthesis, worker dispatch. Depends only on port interfaces (ports.py).
Infrastructure mcp_server/src/infrastructure/ Adapter implementations: Gemini LLM gateway, SRE tool gateway, approval store.
MCP Routers mcp_server/src/routers/ Protocol boundary. Registers tools, resources, and prompts with FastMCP.
Composition Root mcp_server/src/workflows/main_workflow.py Wires infrastructure adapters into use cases and drives the end-to-end flow.

Ports and Adapters

Dependency injection interfaces are defined in mcp_server/src/app/ports.py:

  • LanguageModelPortgenerate_json() and generate_text() for LLM calls
  • SreToolGatewayPort — logs, metrics, deploys, runbooks, and patch operations
  • ApprovalStorePort — create, approve, and apply patch proposals

Concrete adapters in mcp_server/src/infrastructure/gateways.py:

  • GeminiLanguageModelGateway — Google Gemini with structured output
  • SreGuardianToolGateway — backed by local mcp_tools/sre_guardian
  • PatchApprovalStoreGateway — in-memory approval store

Workflow Phases

  1. Start — Accepts user query and initializes progress tracking.
  2. Orchestrator — Builds an IncidentTriagePlan with hypotheses, tasks, dependencies, and phase numbers. Uses Gemini structured output or deterministic fallback.
  3. Worker Execution — Groups tasks by phase. Runs same-phase tasks in parallel via asyncio.gather(). Enforces depends_on gates before downstream tasks.
  4. Synthesis — Aggregates worker outputs into a structured incident report with root-cause assessment, evidence, and next actions. Creates a patch proposal.
  5. Approval Gate — Patch proposals remain pending until explicit human approval. If approved, the patch is applied; otherwise it stays unapplied.
  6. Completion — Returns final structured payload including plan, per-task results, synthesis output, and approval state.

Project Structure

.
├── mcp_server/
│   └── src/
│       ├── server.py                  # FastMCP entry point
│       ├── routers/                   # MCP tool/resource/prompt registration
│       │   ├── tools.py
│       │   ├── resources.py
│       │   └── prompts.py
│       ├── app/                       # Use-case logic
│       │   ├── ports.py               # Dependency injection interfaces
│       │   ├── orchestrator.py        # Query -> IncidentTriagePlan
│       │   ├── phase_runner.py        # Phased parallel task execution
│       │   ├── task_runner.py         # Single task dispatch
│       │   ├── synthesizer.py         # Findings -> report + patch
│       │   └── worker_registry.py     # Tool name -> worker mapping
│       ├── infrastructure/
│       │   └── gateways.py            # Gemini, SRE tool, approval adapters
│       ├── entities/                  # Pydantic domain models
│       │   ├── incident_triage_plan.py
│       │   ├── tasks.py
│       │   ├── workers.py
│       │   ├── hypothesis.py
│       │   ├── severity.py
│       │   └── ...
│       ├── nodes/                     # Concrete worker implementations
│       │   ├── base.py                # BaseWorker abstract class
│       │   ├── logs_query_worker.py
│       │   ├── metrics_query_worker.py
│       │   ├── deploy_list_worker.py
│       │   ├── incident_context_worker.py
│       │   ├── runbooks_search_worker.py
│       │   ├── runbook_get_worker.py
│       │   └── patch_generate_worker.py
│       ├── tools/                     # Tool implementation layer
│       ├── workflows/
│       │   └── main_workflow.py       # Composition root
│       ├── mcp_tools/
│       │   └── sre_guardian.py        # Local fallback tool backends
│       ├── prompts/                   # LLM prompt templates
│       ├── resources/                 # MCP resource implementations
│       ├── config/
│       │   └── settings.py            # Environment-based config (Pydantic)
│       ├── models/
│       │   └── get_model.py           # Gemini client initialization
│       └── utils/
│           ├── approval_store.py      # In-memory patch proposals
│           ├── opik_utils.py          # Observability integration
│           └── ...
├── mcp_client/
│   └── src/
│       ├── client.py                  # Interactive MCP client
│       ├── settings.py                # Client configuration
│       └── utils/                     # Agent loop, LLM, parsing, commands
├── data/scenarios/                    # Sample incident data
│   ├── bad_deploy/
│   ├── dependency_latency/
│   └── saturation_db_pool/
├── docs/
├── pyproject.toml
├── .python-version
├── .env.sample
└── uv.lock

Workers

All workers inherit from BaseWorker and implement async run(task, context_id) -> WorkerResult.

Worker Tool Purpose
LogsQueryWorker logs_query Query structured logs by service, level, and time range
MetricsQueryWorker metrics_query Fetch time-series metrics (e.g. http_5xx_rate)
DeploysListWorker deploys_list List recent deployments in the incident window
IncidentContextWorker incident_get_context Retrieve incident context and environment info
RunbooksSearchWorker runbooks_search Search runbooks by symptom or keyword
RunbookGetWorker runbooks_get Fetch full runbook content by ID
PatchGenerateWorker patch_generate Generate rollback/remediation diffs

MCP Interface

Tools

Registered in mcp_server/src/routers/tools.py:

  • Context management: incident_start_context, incident_get_context
  • Observability: logs_query, metrics_query
  • Deployment: deploys_list
  • Runbooks: runbooks_search, runbooks_get
  • Patching: patch_generate
  • Approval: request_patch_approval, set_patch_approval, apply_approved_patch, get_patch_approval_status
  • Workflow: process_user_query_workflow (main orchestrator entry point)

Resources

  • system://status — System health (CPU, uptime)
  • system://memory — Memory usage stats

Prompts

  • sre_triage_execution_prompt — Full SRE instructions for incident triage. This prompt ties together the investigation tools into a guided agentic workflow and includes steps where user feedback and confirmation are required (e.g. reviewing the triage plan, approving patch proposals before application).

Dataset Scenarios

Sample incident datasets in data/scenarios/ for local fallback and demonstration:

Scenario Description
bad_deploy A code deployment introduces a connection pool exhaustion bug, causing 500 errors
dependency_latency A downstream service dependency starts timing out, cascading failures upstream
saturation_db_pool Database connection pool saturation under increased load

Each scenario includes logs.jsonl, metrics.json, deploys.json, and runbooks.md. These are consumed by mcp_server/src/mcp_tools/sre_guardian.py when local fallback is enabled.


Setup Instructions

Prerequisites

  • Python 3.14+
  • uv (fast Python package manager)

Installation

  1. Clone the repository:
git clone https://github.com/your-username/sre-agent.git
cd sre-agent
  1. Install dependencies:
uv sync
  1. Create your environment file from the sample:
cp .env.sample .env
  1. Fill in .env with your real keys (see Environment Variables below).

Environment Variables

A sample file is provided at .env.sample. Copy it to .env and replace the placeholder values.

Variable Required Default Description
GOOGLE_API_KEY Yes Gemini API key. Get it from Google AI Studio.
MODEL_ID No gemini-2.5-flash Gemini model ID to use for orchestration and synthesis.
ORCHESTRATOR_ENABLE_OFFLINE_FALLBACK No true Use deterministic task plan when LLM is unavailable.
MCP_ENABLE_LOCAL_FALLBACK No true Use local scenario data when MCP backend is unavailable.
SRE_SAMPLE_SCENARIO No bad_deploy Which scenario to load (bad_deploy, dependency_latency, saturation_db_pool).
LOG_LEVEL No 20 Logging level (10=DEBUG, 20=INFO, 30=WARNING).
LOG_LEVEL_DEPENDENCIES No 30 Logging level for third-party libraries.
OPIK_API_KEY No Opik API key for observability.
OPIK_WORKSPACE No Opik workspace name.
OPIK_PROJECT_NAME No sre_guardian Opik project name.
OPENAI_API_KEY No OpenAI API key (reserved for future provider support).

Running the Server

MCP Server (stdio)

uv --directory mcp_server run -m src.server --transport stdio

MCP Server (streamable HTTP)

uv --directory mcp_server run -m src.server --transport streamable-http --port 8001

Interactive MCP Client

In-memory server mode (MCP server runs in the same process):

uv run python -m mcp_client.src.client

Stdio server mode (connects to external MCP server):

uv run python -m mcp_client.src.client --transport stdio

Connecting from an MCP Client

Cursor / Generic MCP Client

{
  "mcp_servers": {
    "sre-first-responder": {
      "command": "uv",
      "args": [
        "--directory",
        "/absolute/path/to/sre-agent/mcp_server",
        "run",
        "-m",
        "src.server",
        "--transport",
        "stdio"
      ],
      "env": {
        "GOOGLE_API_KEY": "your-google-api-key",
        "OPIK_API_KEY": "your-opik-api-key",
        "OPIK_PROJECT_NAME": "sre_guardian"
      }
    }
  }
}

Claude Desktop

A working example is included in claude_desktop_config.json.


Example Queries

Prod checkout is spiking 500s since 14:05 UTC. Start triage for checkout-api.
Start triage for checkout-api and check if a recent deploy caused regressions.
Investigate checkout latency and timeout errors in production.

Testing

Compile check:

python -m compileall mcp_server/src mcp_client/src

Workflow smoke test:

python - <<'PY'
import asyncio
from mcp_server.src.workflows.main_workflow import process_user_query

async def main():
    result = await process_user_query(
        user_query='Prod checkout is spiking 500s since 14:05 UTC. Start triage for checkout-api.',
        render_output=False,
    )
    print(result['status'], result['task_count'], result['result_count'])

asyncio.run(main())
PY

Troubleshooting

ModuleNotFoundError: No module named 'mcp_server' Use module-based invocation:

uv --directory mcp_server run -m src.server --transport stdio

Gemini quota errors (429 RESOURCE_EXHAUSTED) This is Gemini rate limiting. Wait for the quota to reset or enable offline fallback by setting ORCHESTRATOR_ENABLE_OFFLINE_FALLBACK=true in your .env.

Opik logs to Default Project Set OPIK_PROJECT_NAME in your .env file and ensure Opik is configured in the process creating spans.


Mandatory Features Implemented

  1. MCP server in Python using FastMCP

    • Implemented in mcp_server/src/server.py. The server is created with FastMCP and supports both stdio and streamable-http transports. All tools, resources, and prompts are registered via the routers layer in mcp_server/src/routers/.
  2. At least one MCP tool with meaningful functionality

    • Multiple MCP tools are implemented and registered in mcp_server/src/routers/tools.py, including logs_query, metrics_query, deploys_list, runbooks_search, runbooks_get, patch_generate, incident_start_context, incident_get_context, approval workflow tools (request_patch_approval, set_patch_approval, apply_approved_patch, get_patch_approval_status), and the main process_user_query_workflow orchestrator entry point.
  3. At least one MCP prompt with user feedback workflow

    • The sre_triage_execution_prompt is implemented in mcp_server/src/prompts/sre_instructions_prompt.py and registered in mcp_server/src/routers/prompts.py. This prompt drives a full agentic triage workflow and includes explicit user feedback steps: the human approval gate requires the engineer to review the generated patch proposal and explicitly approve or reject it before any changes are applied (via request_patch_approval -> set_patch_approval -> apply_approved_patch).
  4. Published on a public GitHub repository

    • The complete project is hosted on GitHub.
  5. Initialized using uv for dependency management

    • The project uses uv with pyproject.toml, .python-version (3.14.0), and uv.lock. Dependencies are installed via uv sync.
  6. Organized source structure

    • Code is organized under mcp_server/src/ and mcp_client/src/ with clear separation: server.py (entry point), routers/ (MCP registration), tools/ (tool implementations), app/ (business logic), entities/ (domain models), infrastructure/ (adapters), config/settings.py (Pydantic-based configuration), prompts/ (prompt templates), resources/ (resource implementations), and utils/ (utilities).
  7. Comprehensive README.md

    • This README includes: project description, use cases, architecture diagram, setup instructions, environment variable documentation, MCP client configuration JSON, mandatory and custom feature sections, and troubleshooting guide.
  8. No API keys or sensitive credentials committed

    • A .env.sample file with placeholder values is provided. The .env file is gitignored. The README explains which keys are needed and how to obtain them.

Custom Features Implemented

  1. MCP client with custom orchestration/planning (Client & Integration)

    • mcp_client/src/client.py implements a full interactive MCP client with a custom agent loop (handle_agent_loop_utils.py), tool-call orchestration, thinking mode toggle, transport mode selection (in-memory or stdio), and conversation history management. This demonstrates programmatic usage of the MCP server beyond Cursor or Claude Desktop.
  2. MCP resources for contextual information (Client & Integration)

    • Two MCP resources are implemented in mcp_server/src/resources/: system://status provides real-time system health data (CPU, uptime) and system://memory provides memory usage statistics. These are registered in mcp_server/src/routers/resources.py and give the agent contextual awareness of the environment it's running in.
  3. Multiple workflow patterns (sequential, parallel, conditional) (Workflow & Agentic Patterns)

    • The phase runner (mcp_server/src/app/phase_runner.py) implements three distinct patterns: sequential phase ordering (phase 1 completes before phase 2 begins), parallel execution within each phase (all same-phase tasks run concurrently via asyncio.gather()), and conditional branching through dependency gates (depends_on fields that skip downstream tasks when upstream dependencies fail).
  4. Plan-and-Execute agentic pattern (Workflow & Agentic Patterns)

    • The system implements a full Plan-and-Execute pattern: the orchestrator (mcp_server/src/app/orchestrator.py) uses Gemini structured output to produce an IncidentTriagePlan with ranked hypotheses, phased tasks, and dependency ordering. The phase runner then executes the plan. The synthesizer (mcp_server/src/app/synthesizer.py) composes the final incident report. This three-stage pipeline (plan -> execute -> synthesize) is wired together in the composition root (mcp_server/src/workflows/main_workflow.py).
  5. Human-in-the-loop validation (Human-in-the-Loop & UX)

    • The system generates patch proposals (e.g. rollback diffs, runbook updates) as structured artifacts and explicitly waits for human approval before applying them. The approval workflow uses request_patch_approval to create a pending proposal, set_patch_approval for the human to accept or reject, and apply_approved_patch to apply only after explicit approval. This goes beyond simple prompt feedback by implementing a full "AI Generation -> Diff Preview -> Human Review -> Gated Application" pattern.
  6. Multi-agent orchestration (Multi-Agent Systems)

    • The system implements a manager-worker pattern where the orchestrator acts as a manager agent that decomposes the problem and assigns tasks, and 7 specialized worker agents (LogsQueryWorker, MetricsQueryWorker, DeploysListWorker, IncidentContextWorker, RunbooksSearchWorker, RunbookGetWorker, PatchGenerateWorker) collaborate to investigate different aspects of the incident. The worker registry (mcp_server/src/app/worker_registry.py) maps tool names to worker instances, and the synthesizer aggregates outputs from all workers into a unified report.
  7. Observability with Opik (Observability & Evaluation)

    • Both the MCP server (mcp_server/src/utils/opik_utils.py) and the MCP client (mcp_client/src/utils/opik_handler.py) integrate with Opik for LLM call tracing, tool invocation tracking, and agent behavior analysis. All MCP tools are decorated with @opik.track() for automatic span creation.
  8. Custom domain-specific data (Custom Data)

    • Three incident scenario datasets are provided in data/scenarios/: bad_deploy, dependency_latency, and saturation_db_pool. Each scenario includes structured logs (logs.jsonl), time-series metrics (metrics.json), deployment records (deploys.json), and runbooks (runbooks.md). These are consumed by mcp_server/src/mcp_tools/sre_guardian.py to provide realistic, domain-specific incident data for investigation.
  9. Structured outputs with Pydantic models (Structured Outputs)

    • The orchestrator uses Gemini's structured output mode with Pydantic response schemas to ensure consistent, validated responses. Key models include IncidentTriagePlan, Task, Hypothesis, Severity, WorkerResult, and WorkerStatus (all defined in mcp_server/src/entities/). This guarantees that LLM outputs conform to expected shapes and can be reliably consumed by downstream components.

Contributing

Open an issue or PR with a clear problem statement, reproduction steps, and proposed changes.

License

This project is licensed under the MIT License. See LICENSE.

About

SRE Agent — An AI-powered MCP server for production incident triage. Takes natural-language symptom reports, plans structured investigations using Gemini, executes parallel workers (logs, metrics, deploys, runbooks), synthesizes root-cause reports, and proposes remediation patches with human approval gates.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages