# SREnity Agent Demo - Enterprise SRE Agent

## 🤖 How SREnity Works

SREnity is an **Enterprise SRE Agent** that uses advanced AI to help resolve production incidents by combining:

### 🧠 **Intelligent Reasoning**
- **LangGraph ReAct Pattern**: 2-node architecture for reasoning and tool execution
- **Context-Aware Decision Making**: Analyzes queries to determine the best approach
- **Multi-Step Problem Solving**: Can chain multiple tools for complex issues

### 🔍 **Advanced Retrieval System**
- **Ensemble Retriever**: Combines vector similarity + BM25 + Cohere Reranking
- **GitLab Runbooks**: Access to comprehensive SRE procedures and troubleshooting guides
- **Smart Chunking**: Optimized document processing for better retrieval

### 🛠️ **Dual-Tool Architecture**

#### 1. **Runbook Search** (`search_runbooks`)
- **Source**: GitLab SRE runbooks (Redis, PostgreSQL, Elastic, CI/CD, etc.)
- **Use Case**: Established procedures, troubleshooting steps, command references
- **Strength**: Reliable, tested procedures from production experience

#### 2. **Web Search** (`search_web`) 
- **Source**: Tavily search API for latest information
- **Use Case**: Recent CVEs, version updates, breaking changes, latest best practices
- **Strength**: Real-time information, current security updates

### 🔄 **Agent Workflow**

```
User Query → Agent Analysis → Tool Selection → Information Retrieval → Response Synthesis
```

1. **Query Analysis**: Determines if query is SRE-related and which tools to use
2. **Tool Selection**: Starts with runbooks, adds web search if needed
3. **Information Retrieval**: Uses ensemble retriever for comprehensive coverage
4. **Response Synthesis**: Combines information into actionable guidance

### 🎯 **Key Features**

- ✅ **Production-Ready**: Tested on real GitLab SRE runbooks
- ✅ **Context-Aware**: Understands SRE terminology and procedures  
- ✅ **Guardrails**: Refuses non-technical queries, focuses on SRE/DevOps
- ✅ **Comprehensive**: Combines established procedures with latest information
- ✅ **Actionable**: Provides step-by-step instructions and specific commands

### 📊 **Performance**
- **RAGAS Evaluation**: Superior performance across faithfulness, relevancy, and correctness
- **Ensemble Retrieval**: +131% improvement in context recall vs baseline
- **Enterprise Scale**: Handles complex production incident scenarios


## Setup and Imports

This section sets up the environment and imports all necessary components for the SREnity agent.


## Package Installation

Required packages are defined in pyproject.toml and should be available.
If you need to install them, run: `pip install -e .`


In [None]:
## 4. Core Imports

# Core imports
import os
import sys
import logging
from pathlib import Path
from typing import TypedDict, Annotated, Sequence
import operator

# Add project root to Python path
sys.path.insert(0, "..")

# Set up minimal logging
logging.basicConfig(level=logging.WARNING)

# Load environment variables
from dotenv import load_dotenv
load_dotenv("../.env")

# LangChain imports
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, SystemMessage
from langchain_core.tools import tool
from langchain_openai import ChatOpenAI

# LangGraph imports
from langgraph.graph import StateGraph, START, END
from langgraph.prebuilt import ToolNode

# Tavily search
from langchain_community.tools.tavily_search import TavilySearchResults

# Local imports
from src.utils.config import get_config, get_model_factory

print("✅ All imports successful")


✅ All imports successful


## Create SRE Agent

The SREAgent class handles all the complexity automatically:
- Database initialization and vector store creation
- Ensemble retriever setup (Naive + BM25 + Reranker)
- Tools initialization with database components
- LangGraph ReAct pattern implementation


In [3]:
# Create SRE Agent (handles everything automatically)
from src.agents.sre_agent import SREAgent

print("🔄 Creating SRE agent...")
agent = SREAgent()
print("✅ SRE agent ready!")


🔄 Creating SRE agent...
🔄 Initializing SRE agent database components...
📚 Loaded 696 documents
🔍 Filtered to 33 Redis documents
🔄 Preprocessing documents...
HTML to Markdown conversion results:
  Original: 290,437 - 575,312 chars
  Markdown: 52,226 - 96,814 chars
  Reduction: 81.5%
🔄 Chunking documents...
📄 Created 685 chunks
🔄 Creating/loading vector store...
Vector database exists. Loading...
Loaded existing vector store from ../qdrant_db
🔄 Initializing tools with provided database...
Advanced retrieval module loaded with rerank-v3.5
Creating BM25 + Reranker retriever...
Creating BM25 retriever from 685 documents...
BM25 retriever created (k=12)
BM25 + Reranker retriever created (BM25 k=12, Rerank k=4)
✅ Tools initialized with database
✅ Database components initialized
✅ SRE agent ready!
✅ SRE agent ready!


  compressor = CohereRerank(


## Test Function

Utility function to test the agent with queries and show the reasoning process.


In [6]:
def test_agent(query: str, verbose: bool = True):
    """
    Test the agent with a query and show the reasoning process
    """
    print(f"\n{'='*60}")
    print(f"QUERY: {query}")
    print(f"{'='*60}")
    
    if verbose:
        print("\n🤖 Agent reasoning process:")
        
    # Use the SREAgent
    result = agent.invoke(query, verbose=verbose)
    
    if verbose:
        print("\n📝 Final Response:")
        print("-" * 40)
    
    print(result)
    
    return result

print("✅ Test function ready")


✅ Test function ready


## Demo Scenarios

Test the agent with various types of SRE queries to demonstrate its capabilities.


In [7]:
# Test 1: Standard SRE query (should use runbooks only)
test_agent("How to monitor Redis memory usage?")



QUERY: How to monitor Redis memory usage?

🤖 Agent reasoning process:
Agent reasoning steps: 4 messages
Step 2: Called tools: ['search_runbooks']
Step 4: Final response

📝 Final Response:
----------------------------------------
**1. Direct Answer:**  
To monitor Redis memory usage effectively, you should regularly collect memory metrics using `redis-cli MEMORY_STATS`, monitor the configured maxmemory setting, and observe eviction and eviction-related metrics. Additionally, set up alerting for when Redis approaches its memory limit or when evicted keys increase, indicating saturation.

---

**2. Step-by-Step Instructions:**

**Step 1: Gather Memory Usage Data**  
- Run the `MEMORY_STATS` command via `redis-cli` to get detailed memory metrics.  
  ```bash
  redis-cli MEMORY_STATS
  ```  
- Run the `INFO` command to get overall server info, including memory usage and configuration.  
  ```bash
  redis-cli INFO
  ```

**Step 2: Monitor Max Memory Configuration**  
- Check the `maxmemory`

"**1. Direct Answer:**  \nTo monitor Redis memory usage effectively, you should regularly collect memory metrics using `redis-cli MEMORY_STATS`, monitor the configured maxmemory setting, and observe eviction and eviction-related metrics. Additionally, set up alerting for when Redis approaches its memory limit or when evicted keys increase, indicating saturation.\n\n---\n\n**2. Step-by-Step Instructions:**\n\n**Step 1: Gather Memory Usage Data**  \n- Run the `MEMORY_STATS` command via `redis-cli` to get detailed memory metrics.  \n  ```bash\n  redis-cli MEMORY_STATS\n  ```  \n- Run the `INFO` command to get overall server info, including memory usage and configuration.  \n  ```bash\n  redis-cli INFO\n  ```\n\n**Step 2: Monitor Max Memory Configuration**  \n- Check the `maxmemory` setting to understand the configured limit:  \n  ```bash\n  redis-cli CONFIG GET maxmemory\n  ```  \n- Ensure that your monitoring tools are aware of this limit for proportional memory usage alerts.\n\n**Step 3

In [8]:
# Test 2: Version-specific query (should use both tools)
test_agent("Redis 7.2 memory leak issues and fixes")



QUERY: Redis 7.2 memory leak issues and fixes

🤖 Agent reasoning process:


  tavily_tool = TavilySearchResults(


Agent reasoning steps: 5 messages
Step 2: Called tools: ['search_runbooks', 'search_web']
Step 5: Final response

📝 Final Response:
----------------------------------------
The search results indicate that Redis 7.2 has experienced memory leak issues, which can lead to increased memory consumption and potential instability. To address these issues, the recommended steps include upgrading Redis to a version where the leak is fixed, configuring memory and eviction policies properly, and monitoring key metrics.

### Summary of Fixes and Recommendations:
- **Upgrade Redis**: Ensure you are running Redis 7.2.0-v11 or later, as these versions include bug fixes related to memory leaks. Check the [Redis 7.2 release notes](https://redis.io/docs/latest/operate/oss_and_stack/stack-with-enterprise/release-notes/redisstack/redisstack-7.2-release-notes/) for specific fixes.
- **Configure Memory Limits**: Set `maxmemory` and choose an appropriate eviction policy such as `volatile-ttl` to prevent memo

'The search results indicate that Redis 7.2 has experienced memory leak issues, which can lead to increased memory consumption and potential instability. To address these issues, the recommended steps include upgrading Redis to a version where the leak is fixed, configuring memory and eviction policies properly, and monitoring key metrics.\n\n### Summary of Fixes and Recommendations:\n- **Upgrade Redis**: Ensure you are running Redis 7.2.0-v11 or later, as these versions include bug fixes related to memory leaks. Check the [Redis 7.2 release notes](https://redis.io/docs/latest/operate/oss_and_stack/stack-with-enterprise/release-notes/redisstack/redisstack-7.2-release-notes/) for specific fixes.\n- **Configure Memory Limits**: Set `maxmemory` and choose an appropriate eviction policy such as `volatile-ttl` to prevent memory overuse.\n- **Monitor Memory Usage**: Use Redis INFO commands and monitoring dashboards to track memory growth and eviction events.\n- **Analyze Memory Usage**: Use 

In [9]:
# Test 3: Off-topic query (should refuse)
test_agent("What's the weather like today?")



QUERY: What's the weather like today?

🤖 Agent reasoning process:
Agent reasoning steps: 2 messages
Step 2: Final response

📝 Final Response:
----------------------------------------
I'm specialized in SRE incident response and can only help with infrastructure troubleshooting, runbook procedures, and production issues. Please ask about system operations or technical problems.


"I'm specialized in SRE incident response and can only help with infrastructure troubleshooting, runbook procedures, and production issues. Please ask about system operations or technical problems."

In [10]:
# Test 4: Complex SRE query (should use both tools)
test_agent("PostgreSQL connection pool exhaustion in production - how to diagnose and fix?")



QUERY: PostgreSQL connection pool exhaustion in production - how to diagnose and fix?

🤖 Agent reasoning process:
Agent reasoning steps: 4 messages
Step 2: Called tools: ['search_runbooks']
Step 4: Final response

📝 Final Response:
----------------------------------------
### 1. Direct Answer
PostgreSQL connection pool exhaustion occurs when all available client connections are in use, preventing new clients from connecting. To diagnose and fix this, identify the cause of high connection usage, optimize connection management, and adjust configuration parameters as needed.

---

### 2. Step-by-Step Instructions

#### Step 1: Diagnose the Issue
1. **Check current connection count:**
   - Run:
     ```sql
     SELECT count(*) FROM pg_stat_activity;
     ```
   - Or use `psql`:
     ```bash
     psql -c "SELECT count(*) FROM pg_stat_activity;"
     ```
2. **Identify active and idle connections:**
   - Run:
     ```sql
     SELECT pid, usename, application_name, client_addr, state, query 


'### 1. Direct Answer\nPostgreSQL connection pool exhaustion occurs when all available client connections are in use, preventing new clients from connecting. To diagnose and fix this, identify the cause of high connection usage, optimize connection management, and adjust configuration parameters as needed.\n\n---\n\n### 2. Step-by-Step Instructions\n\n#### Step 1: Diagnose the Issue\n1. **Check current connection count:**\n   - Run:\n     ```sql\n     SELECT count(*) FROM pg_stat_activity;\n     ```\n   - Or use `psql`:\n     ```bash\n     psql -c "SELECT count(*) FROM pg_stat_activity;"\n     ```\n2. **Identify active and idle connections:**\n   - Run:\n     ```sql\n     SELECT pid, usename, application_name, client_addr, state, query \n     FROM pg_stat_activity \n     ORDER BY state;\n     ```\n3. **Check for long-running or idle transactions:**\n   - Run:\n     ```sql\n     SELECT pid, age(current_timestamp, xact_start) AS xact_age, query \n     FROM pg_stat_activity \n     WHERE x

## Interactive Demo

Try your own SRE queries with the interactive demo function.


In [11]:
# Test 5: Command-specific query (should use runbooks)
test_agent("Show me the exact syntax for Redis MEMORY STATS command")



QUERY: Show me the exact syntax for Redis MEMORY STATS command

🤖 Agent reasoning process:
Agent reasoning steps: 2 messages
Step 2: Final response

📝 Final Response:
----------------------------------------
I'm specialized in SRE incident response and can only help with infrastructure troubleshooting, runbook procedures, and production issues. Please ask about system operations or technical problems.


"I'm specialized in SRE incident response and can only help with infrastructure troubleshooting, runbook procedures, and production issues. Please ask about system operations or technical problems."