SRE Agent MCP Server

The Problem
What SRE Agent Does
Demo
Architecture & System Design
Project Structure
Workers
MCP Interface
Dataset Scenarios
Setup Instructions
Environment Variables
Running the Server
Connecting from an MCP Client
Example Queries
Testing
Troubleshooting
Mandatory Features Implemented
Custom Features Implemented
Contributing
License

The Problem

Production incident response is one of the most high-pressure, time-sensitive activities in software engineering. When a service goes down at 3 AM, an on-call Site Reliability Engineer (SRE) is paged and must answer a cascade of questions under extreme time pressure:

What is actually broken? Is it a single endpoint, an entire service, or a cascading failure across multiple systems?
What changed? Was there a recent deployment, a configuration change, or an infrastructure event that correlates with the start of the incident?
Where is the evidence? Logs, metrics, and traces are scattered across different observability platforms. The engineer must query each one individually, mentally correlate timestamps, and piece together a timeline.
What does the runbook say? Most teams maintain runbooks for known failure modes, but finding the right runbook and following its steps while simultaneously investigating is cognitively demanding.
What is the fix? Once the root cause is identified, the engineer must decide on remediation: rollback a deployment, scale infrastructure, toggle a feature flag, or apply a hotfix. Each option carries risk.

This process is slow, error-prone, and mentally exhausting. Studies show that Mean Time to Resolution (MTTR) for production incidents averages 1-4 hours across the industry, with a significant portion of that time spent on the investigation phase rather than the actual fix. During an outage, every minute costs money: lost revenue, SLA violations, customer churn, and engineering productivity drain.

The core challenge is that incident triage is fundamentally a multi-step reasoning task that requires gathering evidence from multiple sources, forming hypotheses, testing them against data, and arriving at a root cause. Today, this reasoning happens entirely in the engineer's head, with no structured framework to guide the investigation or prevent cognitive shortcuts that lead to wrong conclusions.

Why Existing Tools Fall Short

Current incident response tooling addresses individual pieces of the puzzle but not the investigation workflow itself:

Observability platforms (Datadog, Grafana, New Relic) are excellent at storing and visualizing data, but they don't reason about what the data means or guide investigation.
Alerting systems (PagerDuty, OpsGenie) notify the right person, but provide no investigation support beyond the initial alert context.
Runbook tools (Confluence, Notion) store institutional knowledge, but there's no automated way to match symptoms to the right runbook or verify that its steps are being followed.
ChatOps bots (custom Slack bots) can query individual data sources, but lack the ability to plan a structured investigation or synthesize findings across sources.

What's missing is an intelligent orchestration layer that can take a natural-language description of symptoms, plan a structured investigation, execute it by querying the right tools in the right order, and synthesize the results into an actionable incident report with remediation proposals.

What SRE Agent Does

SRE Agent is a domain-specific MCP server that acts as an AI-powered first responder for production incidents. It bridges the gap between raw observability data and actionable incident response by automating the investigation workflow that SREs currently perform manually.

When an on-call engineer reports symptoms (e.g. "Prod checkout is spiking 500s since 14:05 UTC"), the server:

Plans the investigation — An LLM-powered orchestrator analyzes the symptoms, generates ranked hypotheses (e.g. "bad deployment", "dependency failure", "resource exhaustion"), and produces a structured triage plan with phased tasks and dependency ordering.
Executes the investigation in parallel — Specialized workers query logs, metrics, deployment records, and runbooks simultaneously. Same-phase tasks run in parallel for speed; cross-phase dependencies are enforced automatically.
Synthesizes findings into an incident report — Worker outputs are aggregated and analyzed to produce a root-cause assessment, evidence summary, severity classification, and recommended next actions.
Proposes remediation with human approval — The system generates patch artifacts (e.g. rollback diffs, runbook updates) but never applies them automatically. A human approval gate ensures the engineer reviews and explicitly approves any changes before they take effect.

The result: an engineer who would normally spend 30-60 minutes querying dashboards, reading logs, and cross-referencing deployment history can get a structured incident report with root-cause analysis and remediation options in under a minute.

Use Cases

On-call triage acceleration — Reduce MTTR by automating evidence gathering and correlation during production incidents.
Incident postmortem preparation — Generate structured timelines and evidence summaries that feed directly into postmortem documents.
Runbook validation — Automatically match incident symptoms to relevant runbooks and verify that remediation steps are applicable.
Junior SRE training — The structured triage plan serves as an educational tool, showing less experienced engineers how to systematically investigate an outage.
AI-assisted pair debugging — Use the MCP client interactively to explore an incident step by step, with the agent providing context and suggestions at each stage.

Demo

demo_sre_agent.mp4

Architecture & System Design

The system follows Clean Architecture with an explicit composition root, ensuring that business logic is decoupled from infrastructure concerns and the MCP protocol boundary.

Layering

Layer	Path	Responsibility
Entities	`mcp_server/src/entities/`	Pure domain models (Pydantic). No external dependencies.
Use Cases	`mcp_server/src/app/`	Orchestration, phase execution, synthesis, worker dispatch. Depends only on port interfaces (`ports.py`).
Infrastructure	`mcp_server/src/infrastructure/`	Adapter implementations: Gemini LLM gateway, SRE tool gateway, approval store.
MCP Routers	`mcp_server/src/routers/`	Protocol boundary. Registers tools, resources, and prompts with FastMCP.
Composition Root	`mcp_server/src/workflows/main_workflow.py`	Wires infrastructure adapters into use cases and drives the end-to-end flow.

Ports and Adapters

Dependency injection interfaces are defined in mcp_server/src/app/ports.py:

LanguageModelPort — generate_json() and generate_text() for LLM calls
SreToolGatewayPort — logs, metrics, deploys, runbooks, and patch operations
ApprovalStorePort — create, approve, and apply patch proposals

Concrete adapters in mcp_server/src/infrastructure/gateways.py:

GeminiLanguageModelGateway — Google Gemini with structured output
SreGuardianToolGateway — backed by local mcp_tools/sre_guardian
PatchApprovalStoreGateway — in-memory approval store

Workflow Phases

Start — Accepts user query and initializes progress tracking.
Orchestrator — Builds an IncidentTriagePlan with hypotheses, tasks, dependencies, and phase numbers. Uses Gemini structured output or deterministic fallback.
Worker Execution — Groups tasks by phase. Runs same-phase tasks in parallel via asyncio.gather(). Enforces depends_on gates before downstream tasks.
Synthesis — Aggregates worker outputs into a structured incident report with root-cause assessment, evidence, and next actions. Creates a patch proposal.
Approval Gate — Patch proposals remain pending until explicit human approval. If approved, the patch is applied; otherwise it stays unapplied.
Completion — Returns final structured payload including plan, per-task results, synthesis output, and approval state.

Project Structure

.
├── mcp_server/
│   └── src/
│       ├── server.py                  # FastMCP entry point
│       ├── routers/                   # MCP tool/resource/prompt registration
│       │   ├── tools.py
│       │   ├── resources.py
│       │   └── prompts.py
│       ├── app/                       # Use-case logic
│       │   ├── ports.py               # Dependency injection interfaces
│       │   ├── orchestrator.py        # Query -> IncidentTriagePlan
│       │   ├── phase_runner.py        # Phased parallel task execution
│       │   ├── task_runner.py         # Single task dispatch
│       │   ├── synthesizer.py         # Findings -> report + patch
│       │   └── worker_registry.py     # Tool name -> worker mapping
│       ├── infrastructure/
│       │   └── gateways.py            # Gemini, SRE tool, approval adapters
│       ├── entities/                  # Pydantic domain models
│       │   ├── incident_triage_plan.py
│       │   ├── tasks.py
│       │   ├── workers.py
│       │   ├── hypothesis.py
│       │   ├── severity.py
│       │   └── ...
│       ├── nodes/                     # Concrete worker implementations
│       │   ├── base.py                # BaseWorker abstract class
│       │   ├── logs_query_worker.py
│       │   ├── metrics_query_worker.py
│       │   ├── deploy_list_worker.py
│       │   ├── incident_context_worker.py
│       │   ├── runbooks_search_worker.py
│       │   ├── runbook_get_worker.py
│       │   └── patch_generate_worker.py
│       ├── tools/                     # Tool implementation layer
│       ├── workflows/
│       │   └── main_workflow.py       # Composition root
│       ├── mcp_tools/
│       │   └── sre_guardian.py        # Local fallback tool backends
│       ├── prompts/                   # LLM prompt templates
│       ├── resources/                 # MCP resource implementations
│       ├── config/
│       │   └── settings.py            # Environment-based config (Pydantic)
│       ├── models/
│       │   └── get_model.py           # Gemini client initialization
│       └── utils/
│           ├── approval_store.py      # In-memory patch proposals
│           ├── opik_utils.py          # Observability integration
│           └── ...
├── mcp_client/
│   └── src/
│       ├── client.py                  # Interactive MCP client
│       ├── settings.py                # Client configuration
│       └── utils/                     # Agent loop, LLM, parsing, commands
├── data/scenarios/                    # Sample incident data
│   ├── bad_deploy/
│   ├── dependency_latency/
│   └── saturation_db_pool/
├── docs/
├── pyproject.toml
├── .python-version
├── .env.sample
└── uv.lock

Workers

All workers inherit from BaseWorker and implement async run(task, context_id) -> WorkerResult.

Worker	Tool	Purpose
`LogsQueryWorker`	`logs_query`	Query structured logs by service, level, and time range
`MetricsQueryWorker`	`metrics_query`	Fetch time-series metrics (e.g. `http_5xx_rate`)
`DeploysListWorker`	`deploys_list`	List recent deployments in the incident window
`IncidentContextWorker`	`incident_get_context`	Retrieve incident context and environment info
`RunbooksSearchWorker`	`runbooks_search`	Search runbooks by symptom or keyword
`RunbookGetWorker`	`runbooks_get`	Fetch full runbook content by ID
`PatchGenerateWorker`	`patch_generate`	Generate rollback/remediation diffs

MCP Interface

Tools

Registered in mcp_server/src/routers/tools.py:

Context management: incident_start_context, incident_get_context
Observability: logs_query, metrics_query
Deployment: deploys_list
Runbooks: runbooks_search, runbooks_get
Patching: patch_generate
Approval: request_patch_approval, set_patch_approval, apply_approved_patch, get_patch_approval_status
Workflow: process_user_query_workflow (main orchestrator entry point)

Resources

system://status — System health (CPU, uptime)
system://memory — Memory usage stats

Prompts

sre_triage_execution_prompt — Full SRE instructions for incident triage. This prompt ties together the investigation tools into a guided agentic workflow and includes steps where user feedback and confirmation are required (e.g. reviewing the triage plan, approving patch proposals before application).

Dataset Scenarios

Sample incident datasets in data/scenarios/ for local fallback and demonstration:

Scenario	Description
`bad_deploy`	A code deployment introduces a connection pool exhaustion bug, causing 500 errors
`dependency_latency`	A downstream service dependency starts timing out, cascading failures upstream
`saturation_db_pool`	Database connection pool saturation under increased load

Each scenario includes logs.jsonl, metrics.json, deploys.json, and runbooks.md. These are consumed by mcp_server/src/mcp_tools/sre_guardian.py when local fallback is enabled.

Setup Instructions

Prerequisites

Python 3.14+
uv (fast Python package manager)

Installation

Clone the repository:

git clone https://github.com/your-username/sre-agent.git
cd sre-agent

Install dependencies:

uv sync

Create your environment file from the sample:

cp .env.sample .env

Fill in .env with your real keys (see Environment Variables below).

Environment Variables

A sample file is provided at .env.sample. Copy it to .env and replace the placeholder values.

Variable	Required	Default	Description
`GOOGLE_API_KEY`	Yes	—	Gemini API key. Get it from Google AI Studio.
`MODEL_ID`	No	`gemini-2.5-flash`	Gemini model ID to use for orchestration and synthesis.
`ORCHESTRATOR_ENABLE_OFFLINE_FALLBACK`	No	`true`	Use deterministic task plan when LLM is unavailable.
`MCP_ENABLE_LOCAL_FALLBACK`	No	`true`	Use local scenario data when MCP backend is unavailable.
`SRE_SAMPLE_SCENARIO`	No	`bad_deploy`	Which scenario to load (`bad_deploy`, `dependency_latency`, `saturation_db_pool`).
`LOG_LEVEL`	No	`20`	Logging level (10=DEBUG, 20=INFO, 30=WARNING).
`LOG_LEVEL_DEPENDENCIES`	No	`30`	Logging level for third-party libraries.
`OPIK_API_KEY`	No	—	Opik API key for observability.
`OPIK_WORKSPACE`	No	—	Opik workspace name.
`OPIK_PROJECT_NAME`	No	`sre_guardian`	Opik project name.
`OPENAI_API_KEY`	No	—	OpenAI API key (reserved for future provider support).

Running the Server

MCP Server (stdio)

uv --directory mcp_server run -m src.server --transport stdio

MCP Server (streamable HTTP)

uv --directory mcp_server run -m src.server --transport streamable-http --port 8001

Interactive MCP Client

In-memory server mode (MCP server runs in the same process):

uv run python -m mcp_client.src.client

Stdio server mode (connects to external MCP server):

uv run python -m mcp_client.src.client --transport stdio

Connecting from an MCP Client

Cursor / Generic MCP Client

{
  "mcp_servers": {
    "sre-first-responder": {
      "command": "uv",
      "args": [
        "--directory",
        "/absolute/path/to/sre-agent/mcp_server",
        "run",
        "-m",
        "src.server",
        "--transport",
        "stdio"
      ],
      "env": {
        "GOOGLE_API_KEY": "your-google-api-key",
        "OPIK_API_KEY": "your-opik-api-key",
        "OPIK_PROJECT_NAME": "sre_guardian"
      }
    }
  }
}

Claude Desktop

A working example is included in claude_desktop_config.json.

Example Queries

Prod checkout is spiking 500s since 14:05 UTC. Start triage for checkout-api.
Start triage for checkout-api and check if a recent deploy caused regressions.
Investigate checkout latency and timeout errors in production.

Testing

Compile check:

python -m compileall mcp_server/src mcp_client/src

Workflow smoke test:

python - <<'PY'
import asyncio
from mcp_server.src.workflows.main_workflow import process_user_query

async def main():
    result = await process_user_query(
        user_query='Prod checkout is spiking 500s since 14:05 UTC. Start triage for checkout-api.',
        render_output=False,
    )
    print(result['status'], result['task_count'], result['result_count'])

asyncio.run(main())
PY

Troubleshooting

ModuleNotFoundError: No module named 'mcp_server' Use module-based invocation:

uv --directory mcp_server run -m src.server --transport stdio

Gemini quota errors (429 RESOURCE_EXHAUSTED) This is Gemini rate limiting. Wait for the quota to reset or enable offline fallback by setting ORCHESTRATOR_ENABLE_OFFLINE_FALLBACK=true in your .env.

Opik logs to Default Project Set OPIK_PROJECT_NAME in your .env file and ensure Opik is configured in the process creating spans.

Mandatory Features Implemented

MCP server in Python using FastMCP
- Implemented in mcp_server/src/server.py. The server is created with FastMCP and supports both stdio and streamable-http transports. All tools, resources, and prompts are registered via the routers layer in mcp_server/src/routers/.
At least one MCP tool with meaningful functionality
- Multiple MCP tools are implemented and registered in mcp_server/src/routers/tools.py, including logs_query, metrics_query, deploys_list, runbooks_search, runbooks_get, patch_generate, incident_start_context, incident_get_context, approval workflow tools (request_patch_approval, set_patch_approval, apply_approved_patch, get_patch_approval_status), and the main process_user_query_workflow orchestrator entry point.
At least one MCP prompt with user feedback workflow
- The sre_triage_execution_prompt is implemented in mcp_server/src/prompts/sre_instructions_prompt.py and registered in mcp_server/src/routers/prompts.py. This prompt drives a full agentic triage workflow and includes explicit user feedback steps: the human approval gate requires the engineer to review the generated patch proposal and explicitly approve or reject it before any changes are applied (via request_patch_approval -> set_patch_approval -> apply_approved_patch).
Published on a public GitHub repository
- The complete project is hosted on GitHub.
Initialized using uv for dependency management
- The project uses uv with pyproject.toml, .python-version (3.14.0), and uv.lock. Dependencies are installed via uv sync.
Organized source structure
- Code is organized under mcp_server/src/ and mcp_client/src/ with clear separation: server.py (entry point), routers/ (MCP registration), tools/ (tool implementations), app/ (business logic), entities/ (domain models), infrastructure/ (adapters), config/settings.py (Pydantic-based configuration), prompts/ (prompt templates), resources/ (resource implementations), and utils/ (utilities).
Comprehensive README.md
- This README includes: project description, use cases, architecture diagram, setup instructions, environment variable documentation, MCP client configuration JSON, mandatory and custom feature sections, and troubleshooting guide.
No API keys or sensitive credentials committed
- A .env.sample file with placeholder values is provided. The .env file is gitignored. The README explains which keys are needed and how to obtain them.

Custom Features Implemented

MCP client with custom orchestration/planning (Client & Integration)
- mcp_client/src/client.py implements a full interactive MCP client with a custom agent loop (handle_agent_loop_utils.py), tool-call orchestration, thinking mode toggle, transport mode selection (in-memory or stdio), and conversation history management. This demonstrates programmatic usage of the MCP server beyond Cursor or Claude Desktop.
MCP resources for contextual information (Client & Integration)
- Two MCP resources are implemented in mcp_server/src/resources/: system://status provides real-time system health data (CPU, uptime) and system://memory provides memory usage statistics. These are registered in mcp_server/src/routers/resources.py and give the agent contextual awareness of the environment it's running in.
Multiple workflow patterns (sequential, parallel, conditional) (Workflow & Agentic Patterns)
- The phase runner (mcp_server/src/app/phase_runner.py) implements three distinct patterns: sequential phase ordering (phase 1 completes before phase 2 begins), parallel execution within each phase (all same-phase tasks run concurrently via asyncio.gather()), and conditional branching through dependency gates (depends_on fields that skip downstream tasks when upstream dependencies fail).
Plan-and-Execute agentic pattern (Workflow & Agentic Patterns)
- The system implements a full Plan-and-Execute pattern: the orchestrator (mcp_server/src/app/orchestrator.py) uses Gemini structured output to produce an IncidentTriagePlan with ranked hypotheses, phased tasks, and dependency ordering. The phase runner then executes the plan. The synthesizer (mcp_server/src/app/synthesizer.py) composes the final incident report. This three-stage pipeline (plan -> execute -> synthesize) is wired together in the composition root (mcp_server/src/workflows/main_workflow.py).
Human-in-the-loop validation (Human-in-the-Loop & UX)
- The system generates patch proposals (e.g. rollback diffs, runbook updates) as structured artifacts and explicitly waits for human approval before applying them. The approval workflow uses request_patch_approval to create a pending proposal, set_patch_approval for the human to accept or reject, and apply_approved_patch to apply only after explicit approval. This goes beyond simple prompt feedback by implementing a full "AI Generation -> Diff Preview -> Human Review -> Gated Application" pattern.
Multi-agent orchestration (Multi-Agent Systems)
- The system implements a manager-worker pattern where the orchestrator acts as a manager agent that decomposes the problem and assigns tasks, and 7 specialized worker agents (LogsQueryWorker, MetricsQueryWorker, DeploysListWorker, IncidentContextWorker, RunbooksSearchWorker, RunbookGetWorker, PatchGenerateWorker) collaborate to investigate different aspects of the incident. The worker registry (mcp_server/src/app/worker_registry.py) maps tool names to worker instances, and the synthesizer aggregates outputs from all workers into a unified report.
Observability with Opik (Observability & Evaluation)
- Both the MCP server (mcp_server/src/utils/opik_utils.py) and the MCP client (mcp_client/src/utils/opik_handler.py) integrate with Opik for LLM call tracing, tool invocation tracking, and agent behavior analysis. All MCP tools are decorated with @opik.track() for automatic span creation.
Custom domain-specific data (Custom Data)
- Three incident scenario datasets are provided in data/scenarios/: bad_deploy, dependency_latency, and saturation_db_pool. Each scenario includes structured logs (logs.jsonl), time-series metrics (metrics.json), deployment records (deploys.json), and runbooks (runbooks.md). These are consumed by mcp_server/src/mcp_tools/sre_guardian.py to provide realistic, domain-specific incident data for investigation.
Structured outputs with Pydantic models (Structured Outputs)
- The orchestrator uses Gemini's structured output mode with Pydantic response schemas to ensure consistent, validated responses. Key models include IncidentTriagePlan, Task, Hypothesis, Severity, WorkerResult, and WorkerStatus (all defined in mcp_server/src/entities/). This guarantees that LLM outputs conform to expected shapes and can be reliably consumed by downstream components.

Contributing

Open an issue or PR with a clear problem statement, reproduction steps, and proposed changes.

License

This project is licensed under the MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
data/scenarios		data/scenarios
docs		docs
mcp_client/src		mcp_client/src
mcp_server/src		mcp_server/src
.env.sample		.env.sample
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
TEST_PROMPT.MD		TEST_PROMPT.MD
claude_desktop_config.json		claude_desktop_config.json
github_repository.txt		github_repository.txt
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

SRE Agent MCP Server

Table of Contents

The Problem

Why Existing Tools Fall Short

What SRE Agent Does

Use Cases

Demo

Architecture & System Design

Layering

Ports and Adapters

Workflow Phases

Project Structure

Workers

MCP Interface

Tools

Resources

Prompts

Dataset Scenarios

Setup Instructions

Prerequisites

Installation

Environment Variables

Running the Server

MCP Server (stdio)

MCP Server (streamable HTTP)

Interactive MCP Client

Connecting from an MCP Client

Cursor / Generic MCP Client

Claude Desktop

Example Queries

Testing

Troubleshooting

Mandatory Features Implemented

Custom Features Implemented

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages