- The Problem
- What SRE Agent Does
- Demo
- Architecture & System Design
- Project Structure
- Workers
- MCP Interface
- Dataset Scenarios
- Setup Instructions
- Environment Variables
- Running the Server
- Connecting from an MCP Client
- Example Queries
- Testing
- Troubleshooting
- Mandatory Features Implemented
- Custom Features Implemented
- Contributing
- License
Production incident response is one of the most high-pressure, time-sensitive activities in software engineering. When a service goes down at 3 AM, an on-call Site Reliability Engineer (SRE) is paged and must answer a cascade of questions under extreme time pressure:
- What is actually broken? Is it a single endpoint, an entire service, or a cascading failure across multiple systems?
- What changed? Was there a recent deployment, a configuration change, or an infrastructure event that correlates with the start of the incident?
- Where is the evidence? Logs, metrics, and traces are scattered across different observability platforms. The engineer must query each one individually, mentally correlate timestamps, and piece together a timeline.
- What does the runbook say? Most teams maintain runbooks for known failure modes, but finding the right runbook and following its steps while simultaneously investigating is cognitively demanding.
- What is the fix? Once the root cause is identified, the engineer must decide on remediation: rollback a deployment, scale infrastructure, toggle a feature flag, or apply a hotfix. Each option carries risk.
This process is slow, error-prone, and mentally exhausting. Studies show that Mean Time to Resolution (MTTR) for production incidents averages 1-4 hours across the industry, with a significant portion of that time spent on the investigation phase rather than the actual fix. During an outage, every minute costs money: lost revenue, SLA violations, customer churn, and engineering productivity drain.
The core challenge is that incident triage is fundamentally a multi-step reasoning task that requires gathering evidence from multiple sources, forming hypotheses, testing them against data, and arriving at a root cause. Today, this reasoning happens entirely in the engineer's head, with no structured framework to guide the investigation or prevent cognitive shortcuts that lead to wrong conclusions.
Current incident response tooling addresses individual pieces of the puzzle but not the investigation workflow itself:
- Observability platforms (Datadog, Grafana, New Relic) are excellent at storing and visualizing data, but they don't reason about what the data means or guide investigation.
- Alerting systems (PagerDuty, OpsGenie) notify the right person, but provide no investigation support beyond the initial alert context.
- Runbook tools (Confluence, Notion) store institutional knowledge, but there's no automated way to match symptoms to the right runbook or verify that its steps are being followed.
- ChatOps bots (custom Slack bots) can query individual data sources, but lack the ability to plan a structured investigation or synthesize findings across sources.
What's missing is an intelligent orchestration layer that can take a natural-language description of symptoms, plan a structured investigation, execute it by querying the right tools in the right order, and synthesize the results into an actionable incident report with remediation proposals.
SRE Agent is a domain-specific MCP server that acts as an AI-powered first responder for production incidents. It bridges the gap between raw observability data and actionable incident response by automating the investigation workflow that SREs currently perform manually.
When an on-call engineer reports symptoms (e.g. "Prod checkout is spiking 500s since 14:05 UTC"), the server:
-
Plans the investigation — An LLM-powered orchestrator analyzes the symptoms, generates ranked hypotheses (e.g. "bad deployment", "dependency failure", "resource exhaustion"), and produces a structured triage plan with phased tasks and dependency ordering.
-
Executes the investigation in parallel — Specialized workers query logs, metrics, deployment records, and runbooks simultaneously. Same-phase tasks run in parallel for speed; cross-phase dependencies are enforced automatically.
-
Synthesizes findings into an incident report — Worker outputs are aggregated and analyzed to produce a root-cause assessment, evidence summary, severity classification, and recommended next actions.
-
Proposes remediation with human approval — The system generates patch artifacts (e.g. rollback diffs, runbook updates) but never applies them automatically. A human approval gate ensures the engineer reviews and explicitly approves any changes before they take effect.
The result: an engineer who would normally spend 30-60 minutes querying dashboards, reading logs, and cross-referencing deployment history can get a structured incident report with root-cause analysis and remediation options in under a minute.
- On-call triage acceleration — Reduce MTTR by automating evidence gathering and correlation during production incidents.
- Incident postmortem preparation — Generate structured timelines and evidence summaries that feed directly into postmortem documents.
- Runbook validation — Automatically match incident symptoms to relevant runbooks and verify that remediation steps are applicable.
- Junior SRE training — The structured triage plan serves as an educational tool, showing less experienced engineers how to systematically investigate an outage.
- AI-assisted pair debugging — Use the MCP client interactively to explore an incident step by step, with the agent providing context and suggestions at each stage.
The system follows Clean Architecture with an explicit composition root, ensuring that business logic is decoupled from infrastructure concerns and the MCP protocol boundary.
| Layer | Path | Responsibility |
|---|---|---|
| Entities | mcp_server/src/entities/ |
Pure domain models (Pydantic). No external dependencies. |
| Use Cases | mcp_server/src/app/ |
Orchestration, phase execution, synthesis, worker dispatch. Depends only on port interfaces (ports.py). |
| Infrastructure | mcp_server/src/infrastructure/ |
Adapter implementations: Gemini LLM gateway, SRE tool gateway, approval store. |
| MCP Routers | mcp_server/src/routers/ |
Protocol boundary. Registers tools, resources, and prompts with FastMCP. |
| Composition Root | mcp_server/src/workflows/main_workflow.py |
Wires infrastructure adapters into use cases and drives the end-to-end flow. |
Dependency injection interfaces are defined in mcp_server/src/app/ports.py:
LanguageModelPort—generate_json()andgenerate_text()for LLM callsSreToolGatewayPort— logs, metrics, deploys, runbooks, and patch operationsApprovalStorePort— create, approve, and apply patch proposals
Concrete adapters in mcp_server/src/infrastructure/gateways.py:
GeminiLanguageModelGateway— Google Gemini with structured outputSreGuardianToolGateway— backed by localmcp_tools/sre_guardianPatchApprovalStoreGateway— in-memory approval store
- Start — Accepts user query and initializes progress tracking.
- Orchestrator — Builds an
IncidentTriagePlanwith hypotheses, tasks, dependencies, and phase numbers. Uses Gemini structured output or deterministic fallback. - Worker Execution — Groups tasks by phase. Runs same-phase tasks in parallel via
asyncio.gather(). Enforcesdepends_ongates before downstream tasks. - Synthesis — Aggregates worker outputs into a structured incident report with root-cause assessment, evidence, and next actions. Creates a patch proposal.
- Approval Gate — Patch proposals remain pending until explicit human approval. If approved, the patch is applied; otherwise it stays unapplied.
- Completion — Returns final structured payload including plan, per-task results, synthesis output, and approval state.
.
├── mcp_server/
│ └── src/
│ ├── server.py # FastMCP entry point
│ ├── routers/ # MCP tool/resource/prompt registration
│ │ ├── tools.py
│ │ ├── resources.py
│ │ └── prompts.py
│ ├── app/ # Use-case logic
│ │ ├── ports.py # Dependency injection interfaces
│ │ ├── orchestrator.py # Query -> IncidentTriagePlan
│ │ ├── phase_runner.py # Phased parallel task execution
│ │ ├── task_runner.py # Single task dispatch
│ │ ├── synthesizer.py # Findings -> report + patch
│ │ └── worker_registry.py # Tool name -> worker mapping
│ ├── infrastructure/
│ │ └── gateways.py # Gemini, SRE tool, approval adapters
│ ├── entities/ # Pydantic domain models
│ │ ├── incident_triage_plan.py
│ │ ├── tasks.py
│ │ ├── workers.py
│ │ ├── hypothesis.py
│ │ ├── severity.py
│ │ └── ...
│ ├── nodes/ # Concrete worker implementations
│ │ ├── base.py # BaseWorker abstract class
│ │ ├── logs_query_worker.py
│ │ ├── metrics_query_worker.py
│ │ ├── deploy_list_worker.py
│ │ ├── incident_context_worker.py
│ │ ├── runbooks_search_worker.py
│ │ ├── runbook_get_worker.py
│ │ └── patch_generate_worker.py
│ ├── tools/ # Tool implementation layer
│ ├── workflows/
│ │ └── main_workflow.py # Composition root
│ ├── mcp_tools/
│ │ └── sre_guardian.py # Local fallback tool backends
│ ├── prompts/ # LLM prompt templates
│ ├── resources/ # MCP resource implementations
│ ├── config/
│ │ └── settings.py # Environment-based config (Pydantic)
│ ├── models/
│ │ └── get_model.py # Gemini client initialization
│ └── utils/
│ ├── approval_store.py # In-memory patch proposals
│ ├── opik_utils.py # Observability integration
│ └── ...
├── mcp_client/
│ └── src/
│ ├── client.py # Interactive MCP client
│ ├── settings.py # Client configuration
│ └── utils/ # Agent loop, LLM, parsing, commands
├── data/scenarios/ # Sample incident data
│ ├── bad_deploy/
│ ├── dependency_latency/
│ └── saturation_db_pool/
├── docs/
├── pyproject.toml
├── .python-version
├── .env.sample
└── uv.lock
All workers inherit from BaseWorker and implement async run(task, context_id) -> WorkerResult.
| Worker | Tool | Purpose |
|---|---|---|
LogsQueryWorker |
logs_query |
Query structured logs by service, level, and time range |
MetricsQueryWorker |
metrics_query |
Fetch time-series metrics (e.g. http_5xx_rate) |
DeploysListWorker |
deploys_list |
List recent deployments in the incident window |
IncidentContextWorker |
incident_get_context |
Retrieve incident context and environment info |
RunbooksSearchWorker |
runbooks_search |
Search runbooks by symptom or keyword |
RunbookGetWorker |
runbooks_get |
Fetch full runbook content by ID |
PatchGenerateWorker |
patch_generate |
Generate rollback/remediation diffs |
Registered in mcp_server/src/routers/tools.py:
- Context management:
incident_start_context,incident_get_context - Observability:
logs_query,metrics_query - Deployment:
deploys_list - Runbooks:
runbooks_search,runbooks_get - Patching:
patch_generate - Approval:
request_patch_approval,set_patch_approval,apply_approved_patch,get_patch_approval_status - Workflow:
process_user_query_workflow(main orchestrator entry point)
system://status— System health (CPU, uptime)system://memory— Memory usage stats
sre_triage_execution_prompt— Full SRE instructions for incident triage. This prompt ties together the investigation tools into a guided agentic workflow and includes steps where user feedback and confirmation are required (e.g. reviewing the triage plan, approving patch proposals before application).
Sample incident datasets in data/scenarios/ for local fallback and demonstration:
| Scenario | Description |
|---|---|
bad_deploy |
A code deployment introduces a connection pool exhaustion bug, causing 500 errors |
dependency_latency |
A downstream service dependency starts timing out, cascading failures upstream |
saturation_db_pool |
Database connection pool saturation under increased load |
Each scenario includes logs.jsonl, metrics.json, deploys.json, and runbooks.md. These are consumed by mcp_server/src/mcp_tools/sre_guardian.py when local fallback is enabled.
- Python 3.14+
- uv (fast Python package manager)
- Clone the repository:
git clone https://github.com/your-username/sre-agent.git
cd sre-agent- Install dependencies:
uv sync- Create your environment file from the sample:
cp .env.sample .env- Fill in
.envwith your real keys (see Environment Variables below).
A sample file is provided at .env.sample. Copy it to .env and replace the placeholder values.
| Variable | Required | Default | Description |
|---|---|---|---|
GOOGLE_API_KEY |
Yes | — | Gemini API key. Get it from Google AI Studio. |
MODEL_ID |
No | gemini-2.5-flash |
Gemini model ID to use for orchestration and synthesis. |
ORCHESTRATOR_ENABLE_OFFLINE_FALLBACK |
No | true |
Use deterministic task plan when LLM is unavailable. |
MCP_ENABLE_LOCAL_FALLBACK |
No | true |
Use local scenario data when MCP backend is unavailable. |
SRE_SAMPLE_SCENARIO |
No | bad_deploy |
Which scenario to load (bad_deploy, dependency_latency, saturation_db_pool). |
LOG_LEVEL |
No | 20 |
Logging level (10=DEBUG, 20=INFO, 30=WARNING). |
LOG_LEVEL_DEPENDENCIES |
No | 30 |
Logging level for third-party libraries. |
OPIK_API_KEY |
No | — | Opik API key for observability. |
OPIK_WORKSPACE |
No | — | Opik workspace name. |
OPIK_PROJECT_NAME |
No | sre_guardian |
Opik project name. |
OPENAI_API_KEY |
No | — | OpenAI API key (reserved for future provider support). |
uv --directory mcp_server run -m src.server --transport stdiouv --directory mcp_server run -m src.server --transport streamable-http --port 8001In-memory server mode (MCP server runs in the same process):
uv run python -m mcp_client.src.clientStdio server mode (connects to external MCP server):
uv run python -m mcp_client.src.client --transport stdio{
"mcp_servers": {
"sre-first-responder": {
"command": "uv",
"args": [
"--directory",
"/absolute/path/to/sre-agent/mcp_server",
"run",
"-m",
"src.server",
"--transport",
"stdio"
],
"env": {
"GOOGLE_API_KEY": "your-google-api-key",
"OPIK_API_KEY": "your-opik-api-key",
"OPIK_PROJECT_NAME": "sre_guardian"
}
}
}
}A working example is included in claude_desktop_config.json.
Prod checkout is spiking 500s since 14:05 UTC. Start triage for checkout-api.
Start triage for checkout-api and check if a recent deploy caused regressions.
Investigate checkout latency and timeout errors in production.
Compile check:
python -m compileall mcp_server/src mcp_client/srcWorkflow smoke test:
python - <<'PY'
import asyncio
from mcp_server.src.workflows.main_workflow import process_user_query
async def main():
result = await process_user_query(
user_query='Prod checkout is spiking 500s since 14:05 UTC. Start triage for checkout-api.',
render_output=False,
)
print(result['status'], result['task_count'], result['result_count'])
asyncio.run(main())
PYModuleNotFoundError: No module named 'mcp_server'
Use module-based invocation:
uv --directory mcp_server run -m src.server --transport stdioGemini quota errors (429 RESOURCE_EXHAUSTED)
This is Gemini rate limiting. Wait for the quota to reset or enable offline fallback by setting ORCHESTRATOR_ENABLE_OFFLINE_FALLBACK=true in your .env.
Opik logs to Default Project
Set OPIK_PROJECT_NAME in your .env file and ensure Opik is configured in the process creating spans.
-
MCP server in Python using FastMCP
- Implemented in
mcp_server/src/server.py. The server is created with FastMCP and supports bothstdioandstreamable-httptransports. All tools, resources, and prompts are registered via the routers layer inmcp_server/src/routers/.
- Implemented in
-
At least one MCP tool with meaningful functionality
- Multiple MCP tools are implemented and registered in
mcp_server/src/routers/tools.py, includinglogs_query,metrics_query,deploys_list,runbooks_search,runbooks_get,patch_generate,incident_start_context,incident_get_context, approval workflow tools (request_patch_approval,set_patch_approval,apply_approved_patch,get_patch_approval_status), and the mainprocess_user_query_workfloworchestrator entry point.
- Multiple MCP tools are implemented and registered in
-
At least one MCP prompt with user feedback workflow
- The
sre_triage_execution_promptis implemented inmcp_server/src/prompts/sre_instructions_prompt.pyand registered inmcp_server/src/routers/prompts.py. This prompt drives a full agentic triage workflow and includes explicit user feedback steps: the human approval gate requires the engineer to review the generated patch proposal and explicitly approve or reject it before any changes are applied (viarequest_patch_approval->set_patch_approval->apply_approved_patch).
- The
-
Published on a public GitHub repository
- The complete project is hosted on GitHub.
-
Initialized using
uvfor dependency management- The project uses
uvwithpyproject.toml,.python-version(3.14.0), anduv.lock. Dependencies are installed viauv sync.
- The project uses
-
Organized source structure
- Code is organized under
mcp_server/src/andmcp_client/src/with clear separation:server.py(entry point),routers/(MCP registration),tools/(tool implementations),app/(business logic),entities/(domain models),infrastructure/(adapters),config/settings.py(Pydantic-based configuration),prompts/(prompt templates),resources/(resource implementations), andutils/(utilities).
- Code is organized under
-
Comprehensive README.md
- This README includes: project description, use cases, architecture diagram, setup instructions, environment variable documentation, MCP client configuration JSON, mandatory and custom feature sections, and troubleshooting guide.
-
No API keys or sensitive credentials committed
- A
.env.samplefile with placeholder values is provided. The.envfile is gitignored. The README explains which keys are needed and how to obtain them.
- A
-
MCP client with custom orchestration/planning (Client & Integration)
mcp_client/src/client.pyimplements a full interactive MCP client with a custom agent loop (handle_agent_loop_utils.py), tool-call orchestration, thinking mode toggle, transport mode selection (in-memory or stdio), and conversation history management. This demonstrates programmatic usage of the MCP server beyond Cursor or Claude Desktop.
-
MCP resources for contextual information (Client & Integration)
- Two MCP resources are implemented in
mcp_server/src/resources/:system://statusprovides real-time system health data (CPU, uptime) andsystem://memoryprovides memory usage statistics. These are registered inmcp_server/src/routers/resources.pyand give the agent contextual awareness of the environment it's running in.
- Two MCP resources are implemented in
-
Multiple workflow patterns (sequential, parallel, conditional) (Workflow & Agentic Patterns)
- The phase runner (
mcp_server/src/app/phase_runner.py) implements three distinct patterns: sequential phase ordering (phase 1 completes before phase 2 begins), parallel execution within each phase (all same-phase tasks run concurrently viaasyncio.gather()), and conditional branching through dependency gates (depends_onfields that skip downstream tasks when upstream dependencies fail).
- The phase runner (
-
Plan-and-Execute agentic pattern (Workflow & Agentic Patterns)
- The system implements a full Plan-and-Execute pattern: the orchestrator (
mcp_server/src/app/orchestrator.py) uses Gemini structured output to produce anIncidentTriagePlanwith ranked hypotheses, phased tasks, and dependency ordering. The phase runner then executes the plan. The synthesizer (mcp_server/src/app/synthesizer.py) composes the final incident report. This three-stage pipeline (plan -> execute -> synthesize) is wired together in the composition root (mcp_server/src/workflows/main_workflow.py).
- The system implements a full Plan-and-Execute pattern: the orchestrator (
-
Human-in-the-loop validation (Human-in-the-Loop & UX)
- The system generates patch proposals (e.g. rollback diffs, runbook updates) as structured artifacts and explicitly waits for human approval before applying them. The approval workflow uses
request_patch_approvalto create a pending proposal,set_patch_approvalfor the human to accept or reject, andapply_approved_patchto apply only after explicit approval. This goes beyond simple prompt feedback by implementing a full "AI Generation -> Diff Preview -> Human Review -> Gated Application" pattern.
- The system generates patch proposals (e.g. rollback diffs, runbook updates) as structured artifacts and explicitly waits for human approval before applying them. The approval workflow uses
-
Multi-agent orchestration (Multi-Agent Systems)
- The system implements a manager-worker pattern where the orchestrator acts as a manager agent that decomposes the problem and assigns tasks, and 7 specialized worker agents (
LogsQueryWorker,MetricsQueryWorker,DeploysListWorker,IncidentContextWorker,RunbooksSearchWorker,RunbookGetWorker,PatchGenerateWorker) collaborate to investigate different aspects of the incident. The worker registry (mcp_server/src/app/worker_registry.py) maps tool names to worker instances, and the synthesizer aggregates outputs from all workers into a unified report.
- The system implements a manager-worker pattern where the orchestrator acts as a manager agent that decomposes the problem and assigns tasks, and 7 specialized worker agents (
-
Observability with Opik (Observability & Evaluation)
- Both the MCP server (
mcp_server/src/utils/opik_utils.py) and the MCP client (mcp_client/src/utils/opik_handler.py) integrate with Opik for LLM call tracing, tool invocation tracking, and agent behavior analysis. All MCP tools are decorated with@opik.track()for automatic span creation.
- Both the MCP server (
-
Custom domain-specific data (Custom Data)
- Three incident scenario datasets are provided in
data/scenarios/:bad_deploy,dependency_latency, andsaturation_db_pool. Each scenario includes structured logs (logs.jsonl), time-series metrics (metrics.json), deployment records (deploys.json), and runbooks (runbooks.md). These are consumed bymcp_server/src/mcp_tools/sre_guardian.pyto provide realistic, domain-specific incident data for investigation.
- Three incident scenario datasets are provided in
-
Structured outputs with Pydantic models (Structured Outputs)
- The orchestrator uses Gemini's structured output mode with Pydantic response schemas to ensure consistent, validated responses. Key models include
IncidentTriagePlan,Task,Hypothesis,Severity,WorkerResult, andWorkerStatus(all defined inmcp_server/src/entities/). This guarantees that LLM outputs conform to expected shapes and can be reliably consumed by downstream components.
- The orchestrator uses Gemini's structured output mode with Pydantic response schemas to ensure consistent, validated responses. Key models include
Open an issue or PR with a clear problem statement, reproduction steps, and proposed changes.
This project is licensed under the MIT License. See LICENSE.