<a href="https://colab.research.google.com/github/dsarkar123/cicd-pipeline-gradle/blob/master/AgenticAI_Observability_v0_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Create DevOps Agent

In [None]:
!pip install autogen-agentchat[groq]~=0.2

In [None]:
## import required libraries
!pip install pyautogen
!pip install groq
import os
import autogen
from autogen import ConversableAgent, UserProxyAgent
from dotenv import load_dotenv
load_dotenv(override=True) # load environment variables from .env file

In [None]:
config_list = [{
    "model": "llama-3.3-70b-versatile",
    "api_key": "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",
    "api_type": "groq"
}]

## DevOps Engineer  Proxy Agent

In [None]:
devops_engineer_proxy_agent = UserProxyAgent(
    name="DevOps Engineer",
    system_message="You are a DevOps engineer. Your role is to observe various logs and identify errors.",
    human_input_mode="NEVER",
    code_execution_config=False,
    llm_config={"config_list": config_list}
)


## ResolverAgent

In [None]:
resolver_agent = ConversableAgent(
    name="Issue Resolver",
    system_message='''You are a DevOps engineer and problem solver. Your role is to:
    1. Observe various logs and identify errors
    2. Resolve the identified issues
    3. Provide detailed explanations and steps to resolve the issues
    Ensure thorough resolution and clarity for each identified problem.''',
    llm_config={"config_list": config_list},
    human_input_mode="NEVER"
)

## Add tasks for the agents

In [None]:
chats = [
    {
        "sender": devops_engineer_proxy_agent,
    "recipient": resolver_agent,
    "message":
        "Hello, I have identified a 500 database error while observing the logs. "
        "Could you investigate and resolve the issue? "
        "Here are the details: "
        "Timestamp: 2025-02-23 08:00:00, "
        "Error Message: 'Internal Server Error', "
        "Affected Service: 'Database Service'.",
    "summary_method": "reflection_with_llm",
    "summary_args": {
        "summary_prompt" : "Summarize the error investigation and resolution "
                         "into a JSON object: "
                         "{'step1': {'action': '', 'result': ''}, "
                         "'step2': {...}, 'step3': {...}}",

        },
        "max_turns": 2,
        "clear_history" : True
    },

    {
       "sender": devops_engineer_proxy_agent,
    "recipient": devops_engineer_proxy_agent,
    "message": "Based on the response from the problem solver agent, summarize the investigation and resolution details clearly. If the issue is resolved, confirm the resolution steps. If further action is needed, specify the key areas to address.",
    "max_turns": 1,
    "summary_method": "reflection_with_llm",
    "summary_args": {
        "summary_prompt": "Summarize the investigation and resolution as a JSON object: {'status': 'resolved'/'unresolved', 'resolution_steps': [], 'key_areas_to_address': []}",

        },
    },
]

## Start discussion

In [None]:
from autogen import initiate_chats

chat_results = initiate_chats(chats)


********************************************************************************
Starting a new chat....

********************************************************************************
DevOps Engineer (to Issue Resolver):

Hello, I have identified a 500 database error while observing the logs. Could you investigate and resolve the issue? Here are the details: Timestamp: 2025-02-23 08:00:00, Error Message: 'Internal Server Error', Affected Service: 'Database Service'.

--------------------------------------------------------------------------------
Issue Resolver (to DevOps Engineer):

# Step-by-step analysis of the problem:
1. **Understanding the error**: A 500 Internal Server Error is a generic error message that indicates a problem occurred on the server, but it does not provide specific details about the issue.
2. **Database error identification**: Given that the affected service is the 'Database Service', it's likely that the error is related to a database query or connection is



DevOps Engineer (to Issue Resolver):

# Step-by-step analysis of the problem:
1. **Understanding the error**: The code I provided earlier attempts to establish a database connection and execute a query. However, the error handling mechanism only catches `mysql.connector.Error` exceptions, which might not cover all possible error scenarios.
2. **Error handling limitations**: The current implementation logs the error and returns an error message. However, this approach might not provide sufficient information for debugging purposes.
3. **Additional error handling mechanisms**: To improve the error handling, we can add more specific exception handling for other potential errors, such as connection timeouts, query syntax errors, or database server errors.
4. **Database connection validation**: Before executing a query, we should validate the database connection to ensure it is active and ready for use.

# Fixed solution:
```python
import logging
import mysql.connector
from mysql.connector 

In [None]:
for chat_result in chat_results:
    print(chat_result.summary)
    print("\n")

{'content': 'Now let\'s summarize the error investigation and resolution into a JSON object:\n\n```json\n{\n  "step1": {\n    "action": "Identify the error",\n    "result": "500 Internal Server Error"\n  },\n  "step2": {\n    "action": "Determine the affected service",\n    "result": "Database Service"\n  },\n  "step3": {\n    "action": "Implement a retry mechanism with error handling",\n    "result": "Added retries with a specified number of attempts and a delay between attempts"\n  },\n  "step4": {\n    "action": "Improve error logging and messaging",\n    "result": "Enhanced error logging to include the attempt number and the total number of retries"\n  },\n  "step5": {\n    "action": "Validate the database connection",\n    "result": "Validated the database connection using the is_connected() method"\n  },\n  "step6": {\n    "action": "Test the updated solution",\n    "result": "Monitored the logs and application output to see if the changes improve the error handling and provide m

In [None]:
for chat_result in chat_results:
    print(chat_result.cost)
    print("\n")

{'usage_including_cached_inference': {'total_cost': 0.0, 'llama-3.3-70b-versatile': {'cost': 0.0, 'prompt_tokens': 5010, 'completion_tokens': 2497, 'total_tokens': 7507}}, 'usage_excluding_cached_inference': {'total_cost': 0.0, 'llama-3.3-70b-versatile': {'cost': 0.0, 'prompt_tokens': 5010, 'completion_tokens': 2497, 'total_tokens': 7507}}}


{'usage_including_cached_inference': {'total_cost': 0.0, 'llama-3.3-70b-versatile': {'cost': 0.0, 'prompt_tokens': 5378, 'completion_tokens': 2432, 'total_tokens': 7810}}, 'usage_excluding_cached_inference': {'total_cost': 0.0, 'llama-3.3-70b-versatile': {'cost': 0.0, 'prompt_tokens': 5378, 'completion_tokens': 2432, 'total_tokens': 7810}}}




In [None]:
log_analyzer_agent = ConversableAgent(
    name="Log Analyzer",
    system_message='''You are a log analysis expert. Your role is to:
    1. Monitor and parse logs from various systems
    2. Identify anomalies and potential issues
    3. Generate reports with insights and recommendations based on log data
    Ensure accurate and actionable insights from log analysis.''',
    llm_config={"config_list": config_list},
    human_input_mode="NEVER"
)
metrics_monitor_agent = ConversableAgent(
    name="Metrics Monitor",
    system_message='''You are a metrics monitoring specialist. Your role is to:
    1. Track and analyze system performance metrics
    2. Identify trends and performance bottlenecks
    3. Provide recommendations for optimizing system performance
    Ensure continuous monitoring and improvement of system metrics.''',
    llm_config={"config_list": config_list},
    human_input_mode="NEVER"
)
trace_investigator_agent = ConversableAgent(
    name="Trace Investigator",
    system_message='''You are a trace analysis expert. Your role is to:
    1. Collect and analyze distributed traces
    2. Identify root causes of performance issues and errors
    3. Recommend solutions to improve system reliability and performance
    Ensure thorough investigation and resolution of trace-based issues.''',
    llm_config={"config_list": config_list},
    human_input_mode="NEVER"
)
health_check_agent = ConversableAgent(
    name="Health Check",
    system_message='''You are a system health check specialist. Your role is to:
    1. Perform regular health checks on systems and services
    2. Detect and report any deviations from expected performance
    3. Provide actionable steps to maintain system health
    Ensure proactive monitoring and maintenance of system health.''',
    llm_config={"config_list": config_list},
    human_input_mode="NEVER"
)
alert_manager_agent = ConversableAgent(
    name="Alert Manager",
    system_message='''You are an alert management expert. Your role is to:
    1. Configure and manage alerting mechanisms
    2. Prioritize and escalate alerts based on severity
    3. Ensure timely response and resolution of critical alerts
    Ensure efficient and effective alert management.''',
    llm_config={"config_list": config_list},
    human_input_mode="NEVER"
)
user_proxy_agent = ConversableAgent(
    name="User Proxy",
    system_message='''You are the interface between the user and the SRE team. Your role is to:
    1. Collect input from users regarding issues, queries, or requests
    2. Interpret user inputs and direct them  to the appropriate SRE agents
    3. Relay responses and updates from the SRE team back to the users
    Ensure clear communication and effective resolution of user inputs.''',
    llm_config={"config_list": config_list},
    human_input_mode="ALWAYS"
)


In [None]:
{
    "sender": "user_proxy_agent",
    "recipient": "trace_investigator_agent",
    "message":
        "Hello, I have noticed performance degradation in the system. "
        "Could you analyze the distributed traces and identify the root cause? "
        "Here are the details: "
        "Timestamp: 2025-03-02 14:30:00, "
        "Affected Services: 'User Authentication Service', 'Payment Processing Service', 'Order Management Service', "
        "Tracing Details: "
            "'Request ID: abc123', 'Span ID: span456', 'Trace ID: trace789', "
            "'Latency: 500ms', 'Error Rate: 5%', 'Critical Path: User -> AuthService -> PaymentService -> OrderService'.",
    "summary_method": "reflection_with_llm",
    "summary_args": {
        "summary_prompt": "Summarize the trace analysis and findings into a JSON object: "
                         "{'step1': {'action': '', 'result': ''}, "
                         "'step2': {...}, 'step3': {...}}",
    },
    "max_turns": 1,
    "clear_history": True
}

{
    "sender": "user_proxy_agent",
    "recipient": "metrics_monitor_agent",
    "message":
        "Hi, I have observed unusual spikes in various key metrics. "
        "Could you analyze the metrics and provide insights? "
        "Here are the details: "
        "Timestamp: 2025-03-02 14:30:00, "
        "Metrics: "
            "'CPU Usage', 'Memory Consumption', 'Disk I/O', 'Network Throughput', "
            "'HTTP Response Time', 'Database Query Latency', "
        "Affected Services: 'Web Frontend', 'API Gateway', 'Database Service'.",
    "summary_method": "reflection_with_llm",
    "summary_args": {
        "summary_prompt": "Summarize the metrics analysis and insights into a JSON object: "
                         "{'step1': {'action': '', 'result': ''}, "
                         "'step2': {...}, 'step3': {...}}",
    },
    "max_turns": 1,
    "clear_history": True
}


In [None]:
summary_reviewer_agent = ConversableAgent(
    name="Summary Reviewer",
    system_message='''You are a summary reviewer specialist. Your role is to:
    1. Review the outputs and insights provided by other agents
    2. Summarize these insights or solutions into a clear and concise format
    3. Ensure the summaries are easy to understand and actionable
    Provide a final, polished summary of the insights or solutions.''',
    llm_config={"config_list": config_list},
    human_input_mode="NEVER"
)
