## Create Agent Evaluation Dataset 


This tutorial we will show you how to create Agent Evaluation Dataset with LangGraph 

In [None]:
### Reuse the LangGraph Agents for ETL error resolution in 04-sagemaker-mlflow-agents-introduction

import boto3
import os

## Import Data 
from data.data import log_data_set, log_data
from data.solution_book import solution_book

## Import MLflow Libs
import sagemaker_mlflow 
import mlflow

## Check AWS Credentials 
try:
    boto3.client('bedrock-runtime')
except Exception as e:
    print(f"Error configuring AWS credentials: {e}")
    print("Please set your AWS credentials before proceeding.")


## Set up MLflow 
tracking_server_arn = "ENTER YOUR MLFLOW TRACKIHG SERVER ARN HERE" 
experiment_name = "agent-mlflow-demo"
mlflow.set_tracking_uri(tracking_server_arn) 
mlflow.set_experiment(experiment_name)

## Set up LangChain Autolog 
mlflow.langchain.autolog()

## Import LangGraph Packages
from langgraph.prebuilt import create_react_agent
from langchain.chat_models import init_chat_model
from langchain_core.tools import tool

## Define LangGraph Tools 
@tool 
def log_identifier(ticket_id: str) -> str:
    """Get error type from ticket number

    Args:
        ticket_id: ticket id

    Returns:
        an error type

    """
    if ticket_id not in log_data_set:
        return "ticket id not found in the database"
    
    for item in log_data:
        if item["id"] == ticket_id:
            return item['error_name']

@tool(return_direct=True)
def information_retriever(error_type: str) -> str:
    """Retriever error solution based on error type

    Args:
        error_type: user input error type
    
    Returns:
        a str of steps 
    """

    if error_type not in solution_book.keys():
        return "error type not found in the knowledge base, please use your own knowledge"
    
    return solution_book[error_type]

## Init LLM 
llm = init_chat_model(
    model= "us.anthropic.claude-3-5-haiku-20241022-v1:0",
    model_provider="bedrock_converse",
)

## Build System Prompt 
system_prompt = """
You are an expert a resolving ETL errors. You are equiped with two tools: 
1. log_identifier: Get error type from ticket number
2. information_retriever: Retriever error solution based on error type

You will use the ticket ID to gather information about the error using the log_identifier tool. 
Then you should search the database for information on how to resolve the error using the information_retriever tool

Return ONLY the numbered steps without any introduction or conclusion. Format as:
1. step 1 text
2. step 2 text
...
"""

## Create ReAct Agent 
agent = create_react_agent(
    model=llm,
    tools= [log_identifier, information_retriever], 
    prompt=system_prompt
)


def get_langGraph_agent_response(user_prompt):
    # Prepare input for the agent
    agent_input = {"messages": [{"role": "user", "content": user_prompt}]}
    response = agent.invoke(agent_input)
    return response['messages'][-1].content

Now Lets introduce the test cases we have created, we have created four test cases, test_case_1, test_case_2, test_case_3 and test_case_4. Below we showed two examples for test_case_1 and test_case_2

Each test case is a dictionary contains separates feilds: 
- ticket_id: ticket id (this is the input to the agent)
- error_name: this is expected error name based on the ticekt id 
- solution: this is expected solution steps 
- expected_tools: expected tools to use, store in python list and order matters
- expected_arguments: expected arguements for the above tools

In [None]:
from data.test_cases_mlflow import test_case_1, test_case_2, test_case_3, test_case_4

In [None]:
test_case_1

We will invoke agent for each of the test cases, we will store them into a dataframe, this dataframe will be used in evaluation. The dataframe will contains: 
1. user_input: this is the input ticket number 
2. actual_output: Agent generated output 
3. expected_output: ground truth output 

### Create Evaluation Dataset 

In [None]:
import pandas as pd

test_cases = [test_case_1, test_case_2, test_case_3, test_case_4]

result = [] 

for test_case in test_cases:

    user_input = test_case['user_prompt']
    error_solution = "\n".join(test_case['solution'])
    agent_response = get_langGraph_agent_response(user_input)

    result.append({
        'user_input': user_input, 
        'actual_output': agent_response,
        'expected_output': error_solution
    })

eval_df = pd.DataFrame(result)

In [None]:
eval_df