# Create Agent offline Evaluation Dataset (Optional)
This notebook demonstrates how to structure and generate an agent evaluation dataset using LangGraph.
Such datasets let you systematically benchmark agent and tool performance before production deployment.

> Note: An example dataset created in this notebook is already available at `./data/agent_evaluation_dataset.json`. You can use it directly or customize by following the instructions below.

In [None]:
# Install following dependencies. Ignore any warnings
!pip install --upgrade -r requirements.txt -q

## Reusing LangGraph Agent & Tool Setup
Import existing LangGraph IT support agent. 

This setup mirrors agent logic from your main workflow you used in the sagemaker-mlflow-agents-introduction notebook lab. This re-use of the agent ensures the evaluation dataset matches real-world interactions.

In [6]:
# Reuse the LangGraph Agents for ETL error resolution in 04-sagemaker-mlflow-agents-introduction

import boto3
import os

## Import Sample Data 
from data.data import log_data_set, log_data
from data.solution_book import solution_book

import mlflow

## Check AWS Credentials 
try:
    boto3.client('bedrock-runtime')
except Exception as e:
    print(f"Error configuring AWS credentials: {e}")
    print("Please set your AWS credentials before proceeding.")

## Set up MLflow 
# tracking_server_arn = "ENTER YOUR MLFLOW TRACKING SERVER ARN HERE" 
# experiment_name = "agent-mlflow-demo"
# mlflow.set_tracking_uri(tracking_server_arn) 
# mlflow.set_experiment(experiment_name)

# ## Set up LangChain Autolog 
# mlflow.langchain.autolog()

## Import LangGraph Packages
from langgraph.prebuilt import create_react_agent
from langchain.chat_models import init_chat_model
from langchain_core.tools import tool

## Define LangGraph Tools 
@tool 
def log_identifier(ticket_id: str) -> str:
    """Get error type from ticket number

    Args:
        ticket_id: ticket id

    Returns:
        an error type

    """
    if ticket_id not in log_data_set:
        return "ticket id not found in the database"
    
    for item in log_data:
        if item["id"] == ticket_id:
            return item['error_name']

@tool(return_direct=True)
def information_retriever(error_type: str) -> str:
    """Retriever error solution based on error type

    Args:
        error_type: user input error type
    
    Returns:
        a str of steps 
    """

    if error_type not in solution_book.keys():
        return "error type not found in the knowledge base, please use your own knowledge"
    
    return solution_book[error_type]

## Init LLM 
llm = init_chat_model(
    model= "us.anthropic.claude-3-5-haiku-20241022-v1:0",
    model_provider="bedrock_converse",
)

## Build System Prompt 
system_prompt = """
You are an expert a resolving ETL errors. You are equiped with two tools: 
1. log_identifier: Get error type from ticket number
2. information_retriever: Retriever error solution based on error type

You will use the ticket ID to gather information about the error using the log_identifier tool. 
Then you should search the database for information on how to resolve the error using the information_retriever tool

Return ONLY the numbered steps without any introduction or conclusion. Format as:
1. step 1 text
2. step 2 text
...
"""

## Create ReAct Agent 
agent = create_react_agent(
    model=llm,
    tools= [log_identifier, information_retriever], 
    prompt=system_prompt
)

def get_langGraph_agent_response(user_prompt):
    # Prepare input for the agent
    agent_input = {"messages": [{"role": "user", "content": user_prompt}]}
    response = agent.invoke(agent_input)
    return response['messages'][-1].content

## Prepare and Structure Evaluation Test Cases
Define your evaluation test cases, capturing expected agent input/output, tool usage, and stepwise solutions

Now we will use the test cases we have defined, to guide the evaluation dataset curation. Below we show one examples for test_case_1

Each test case is a dictionary contains separates feilds: 
- ticket_id: ticket id (this is the input to the agent)
- error_name: this is expected error name based on the ticekt id 
- solution: this is expected solution steps 
- expected_tools: expected tools to use, store in python list and order matters
- expected_arguments: expected arguements for the above tools

In [2]:
from data.test_cases_mlflow import TEST_CASES

# Display first test case
TEST_CASES[0]

{'user_prompt': 'Can you help me solve the ticket: TICKET-001?',
 'error_name': 'Connection Timeout',
 'solution': ['1. Check network connectivity between client and server',
  '2. Verify if the server is running and accessible',
  '3. Increase the connection timeout settings',
  '4. Check for firewall rules blocking the connection',
  '5. Monitor network latency and bandwidth'],
 'expected_tools': ['log_identifier', 'information_retriever'],
 'expected_arguments': [{'ticket_id': 'TICKET-001'},
  {'error_type': 'Connection Timeout'}]}

## Generate Offline Evaluation Dataset
Use the agent to answer each test case.
We will invoke agent for each of the test cases, we will store them into a dataframe, this dataframe will be used in evaluation. The dataframe will contains: 
1. inputs: this is the user input
2. actual_output: Agent generated output 
3. expected_output: ground truth output 

In [3]:
from tqdm import tqdm
import pandas as pd

result = []

for test_case in tqdm(TEST_CASES, desc="Processing test cases"):
    user_input = test_case['user_prompt']
    error_solution = "\n".join(test_case['solution'])
    agent_response = get_langGraph_agent_response(user_input)

    result.append({
        'inputs': user_input,
        'actual_output': agent_response,
        'expected_output': error_solution
    })

eval_df = pd.DataFrame(result)

Processing test cases: 100%|██████████| 11/11 [07:26<00:00, 40.58s/it]


In [4]:
# store eval dataset as json
eval_df.to_json("data/agent_evaluation_dataset.json", orient="records")

## Results
After your notebook runs it saves the file to data/agent_evaluation_dataset.json, and you will ready for the evaluation workshop lab.
1. Check the `data/` folder for the offline evaluation dataset `agent_evaluation_dataset.json`
2. Consume the offline evaluation dataset `agent_evaluation_dataset.json` in your evaluation step 