## Agent offline Evalution with MLflow 

Offline evaluation is a critical methodology used to assess the performance of AI agents using historical groundtruth data without the need for live deployment. This approach serves multiple essential purposes in the development and deployment lifecycle of AI systems. By testing agent behavior in a controlled environment, users can mitigate risks, avoid costly mistakes, and protect against potential negative impacts in production environments.

MLflow provides comprehensive tools and functionalities for experiment tracking, allowing developers to log metrics, parameters, and artifacts while comparing different agent versions. This tutorial we will introduce you how to do offline evaluation on MLflow. We will use the same example we showed in the previous sections


## Reuse the LangGraph Agents for ETL error resolution

In [None]:
import boto3
import os

from data.data import log_data_set, log_data
from data.solution_book import knowledge_base

import sagemaker_mlflow
import mlflow

print(sagemaker_mlflow.__version__)
print(mlflow.__version__)

tracking_server_arn = "" 
experiment_name = "agent-mlflow-demo"
mlflow.set_tracking_uri(tracking_server_arn) 
mlflow.set_experiment(experiment_name)

mlflow.langchain.autolog()

0.1.0
3.0.0


### LangGraph Agent implementation

In [2]:
from langgraph.prebuilt import create_react_agent
from langchain.chat_models import init_chat_model
from langchain_core.tools import tool

@tool 
def log_identifier(ticket_id: str) -> str:
    """Get error type from ticket number

    Args:
        ticket_id: ticket id

    Returns:
        an error type

    """
    if ticket_id not in log_data_set:
        return "ticket id not found in the database"
    
    for item in log_data:
        if item["id"] == ticket_id:
            return item['error_name']

@tool(return_direct=True)
def information_retriever(error_type: str) -> str:
    """Retriever error solution based on error type

    Args:
        error_type: user input error type
    
    Returns:
        a str of steps 
    """

    if error_type not in knowledge_base.keys():
        return "error type not found in the knowledge base, please use your own knowledge"
    
    return knowledge_base[error_type]

llm = init_chat_model(
    model= "us.anthropic.claude-3-5-haiku-20241022-v1:0",
    model_provider="bedrock_converse",
)

system_prompt = """
You are an expert a resolving ETL errors. You are equiped with two tools: 
1. log_identifier: Get error type from ticket number
2. information_retriever: Retriever error solution based on error type

You will use the ticket ID to gather information about the error using the log_identifier tool. 
Then you should search the database for information on how to resolve the error using the information_retriever tool

Return ONLY the numbered steps without any introduction or conclusion. Format as:
1. step 1 text
2. step 2 text
...
"""

agent = create_react_agent(
    model=llm,
    tools= [log_identifier, information_retriever], 
    prompt=system_prompt
)

def get_langGraph_agent_response(ticket_id = 'TICKET-001'):
    # Prepare input for the agent
    agent_input = {"messages": [{"role": "user", "content": ticket_id}]}
    response = agent.invoke(agent_input)
    return response 


#  Here we add one more function to retrieve all the tool calling information (tool names and their corresponding arguments)
from langchain_core.messages import HumanMessage, AIMessage

def extract_tool_calls(response):
    '''
    extract tool calling information (tool names and their corresponding arguments)
    '''
    tool_calls_info = []
    
    for message in response['messages']:
        if isinstance(message, AIMessage):
            if isinstance(message.content, list):
                for content_item in message.content:
                    if content_item.get('type') == 'tool_use':
                        function_name = content_item['name']
                        arguments = content_item['input']
                        tool_id = content_item['id']
                        tool_calls_info.append({
                            'name': function_name,
                            'arguments': arguments                        })
    
    return tool_calls_info 

Now Lets introduce the test cases we have created, we have created four test cases, test_case_1, test_case_2, test_case_3 and test_case_4. Below we showed two examples for test_case_1 and test_case_2

Each test case is a dictionary contains separates feilds: 
- ticket_id: ticket id (this is the input to the agent)
- error_name: this is expected error name based on the ticekt id 
- solution: this is expected solution steps 
- expected_tools: expected tools to use, store in python list and order matters
- expected_arguments: expected arguements for the above tools

In [3]:
from data.test_cases_mlflow import test_case_1, test_case_2, test_case_3, test_case_4

In [4]:
test_case_1

{'ticket_id': 'TICKET-001',
 'error_name': 'Connection Timeout',
 'solution': ['1. Checked network connectivity between client and server',
  '2. Verified server status and accessibility',
  '3. Increased connection timeout settings',
  '4. Checked firewall rules',
  '5. Monitored network latency'],
 'expected_tools': ['log_identifier', 'information_retriever'],
 'expected_arguments': [{'ticket_id': 'TICKET-001'},
  {'error_type': 'Connection Timeout'}]}

We will invoke agent for each of the test cases, we will store them into a dataframe, this dataframe will be used in evaluation. The dataframe will contains: 
1. user_input: this is the input ticket number 
2. actual_output: Agent generated output 
3. expected_output: ground truth output 
4. tool_uses: Actual tool calling information contains both tools name and tools arguments
5. expected_tools: ground truth tools to call 
6. expected_arguments: ground truth tools arguments

### Create Evaluation Dataset 

In [5]:
import pandas as pd

test_cases = [test_case_1, test_case_2, test_case_3, test_case_4]

result = [] 

for test_case in test_cases:

    user_input = test_case['ticket_id']
    error_name = test_case['error_name']
    error_solution = "\n".join(test_case['solution'])
    expected_tools = test_case['expected_tools']
    expected_arguments = test_case['expected_arguments']

    agent_response = get_langGraph_agent_response(user_input)
    actual_output = agent_response['messages'][-1].content
    tool_uses = extract_tool_calls(agent_response)

    result.append({
        'user_input': user_input, 
        'actual_output': actual_output,
        'expected_output': error_solution,
        'tool_uses': tool_uses, 
        'expected_tools': expected_tools,
        'expected_arguments': expected_arguments
    })

import pandas as pd

eval_df = pd.DataFrame(result)
eval_df

content.str
  Input should be a valid string [type=string_type, input_value=[{'type': 'text', 'text':...P50vCq5TA2z9cY2xXTnXg'}], input_type=list]
    For further information visit https://errors.pydantic.dev/2.11/v/string_type
content.list[tagged-union[TextContentPart,ImageContentPart,AudioContentPart]].1
  Input tag 'tool_use' found using 'type' does not match any of the expected tags: 'text', 'image_url', 'input_audio' [type=union_tag_invalid, input_value={'type': 'tool_use', 'nam...KP50vCq5TA2z9cY2xXTnXg'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/union_tag_invalid
content.str
  Input should be a valid string [type=string_type, input_value=[{'type': 'tool_use', 'na...t_tTpOmTvWT3qBSOKWJYQ'}], input_type=list]
    For further information visit https://errors.pydantic.dev/2.11/v/string_type
content.list[tagged-union[TextContentPart,ImageContentPart,AudioContentPart]].0
  Input tag 'tool_use' found using 'type' does not match any of the exp

Unnamed: 0,user_input,actual_output,expected_output,tool_uses,expected_tools,expected_arguments
0,TICKET-001,1. Check network connectivity between client a...,1. Checked network connectivity between client...,"[{'name': 'log_identifier', 'arguments': {'tic...","[log_identifier, information_retriever]","[{'ticket_id': 'TICKET-001'}, {'error_type': '..."
1,TICKET-002,1. Verify database credentials are correct\n2....,1. Verified database credentials\n2. Checked u...,"[{'name': 'log_identifier', 'arguments': {'tic...","[log_identifier, information_retriever]","[{'ticket_id': 'TICKET-002'}, {'error_type': '..."
2,TICKET-006,1. Remove temporary and unnecessary files\n2. ...,1. Removed temporary and unnecessary files\n2....,"[{'name': 'log_identifier', 'arguments': {'tic...","[log_identifier, information_retriever]","[{'ticket_id': 'TICKET-006'}, {'error_type': '..."
3,TICKET-008,1. Review user access rights\n2. Check file an...,1. Reviewed user access rights\n2. Checked fil...,"[{'name': 'log_identifier', 'arguments': {'tic...","[log_identifier, information_retriever, notify...","[{'ticket_id': 'TICKET-008'}, {'error_type': '..."


In [6]:
eval_df

Unnamed: 0,user_input,actual_output,expected_output,tool_uses,expected_tools,expected_arguments
0,TICKET-001,1. Check network connectivity between client a...,1. Checked network connectivity between client...,"[{'name': 'log_identifier', 'arguments': {'tic...","[log_identifier, information_retriever]","[{'ticket_id': 'TICKET-001'}, {'error_type': '..."
1,TICKET-002,1. Verify database credentials are correct\n2....,1. Verified database credentials\n2. Checked u...,"[{'name': 'log_identifier', 'arguments': {'tic...","[log_identifier, information_retriever]","[{'ticket_id': 'TICKET-002'}, {'error_type': '..."
2,TICKET-006,1. Remove temporary and unnecessary files\n2. ...,1. Removed temporary and unnecessary files\n2....,"[{'name': 'log_identifier', 'arguments': {'tic...","[log_identifier, information_retriever]","[{'ticket_id': 'TICKET-006'}, {'error_type': '..."
3,TICKET-008,1. Review user access rights\n2. Check file an...,1. Reviewed user access rights\n2. Checked fil...,"[{'name': 'log_identifier', 'arguments': {'tic...","[log_identifier, information_retriever, notify...","[{'ticket_id': 'TICKET-008'}, {'error_type': '..."


Then we will combine **user_input**, **actual_output** and **tool_uses** into predictions, and combine **expected_output**, **expected_tools** and **expected_arguments** into targets. The reason for combination is when you create evaluation dataset for MLflow, you will need to specify the `prediction` and `targts`. The MLlfow doesn't accept multiple columns other than those. We will cover more on the later sections

In [7]:
eval_df['predictions'] = eval_df.apply(lambda row: {
    'user_input': row['user_input'],
    'actual_output': row['actual_output'],
    'tool_uses': row['tool_uses']
}, axis=1)

eval_df['targets'] = eval_df.apply(lambda row: {
    'expected_output': row['expected_output'],
    'expected_tools': row['expected_tools'],
    'expected_arguments': row['expected_arguments'],
}, axis=1)

eval_df

Unnamed: 0,user_input,actual_output,expected_output,tool_uses,expected_tools,expected_arguments,predictions,targets
0,TICKET-001,1. Check network connectivity between client a...,1. Checked network connectivity between client...,"[{'name': 'log_identifier', 'arguments': {'tic...","[log_identifier, information_retriever]","[{'ticket_id': 'TICKET-001'}, {'error_type': '...","{'user_input': 'TICKET-001', 'actual_output': ...",{'expected_output': '1. Checked network connec...
1,TICKET-002,1. Verify database credentials are correct\n2....,1. Verified database credentials\n2. Checked u...,"[{'name': 'log_identifier', 'arguments': {'tic...","[log_identifier, information_retriever]","[{'ticket_id': 'TICKET-002'}, {'error_type': '...","{'user_input': 'TICKET-002', 'actual_output': ...",{'expected_output': '1. Verified database cred...
2,TICKET-006,1. Remove temporary and unnecessary files\n2. ...,1. Removed temporary and unnecessary files\n2....,"[{'name': 'log_identifier', 'arguments': {'tic...","[log_identifier, information_retriever]","[{'ticket_id': 'TICKET-006'}, {'error_type': '...","{'user_input': 'TICKET-006', 'actual_output': ...",{'expected_output': '1. Removed temporary and ...
3,TICKET-008,1. Review user access rights\n2. Check file an...,1. Reviewed user access rights\n2. Checked fil...,"[{'name': 'log_identifier', 'arguments': {'tic...","[log_identifier, information_retriever, notify...","[{'ticket_id': 'TICKET-008'}, {'error_type': '...","{'user_input': 'TICKET-008', 'actual_output': ...",{'expected_output': '1. Reviewed user access r...


In [9]:
eval_df['predictions']
#eval_df['targets']

0    {'user_input': 'TICKET-001', 'actual_output': ...
1    {'user_input': 'TICKET-002', 'actual_output': ...
2    {'user_input': 'TICKET-006', 'actual_output': ...
3    {'user_input': 'TICKET-008', 'actual_output': ...
Name: predictions, dtype: object

### Run Evaluation

### Create customize evaluation metrics 

We use DeepEval as our evaluation framework as DeepEval contains usefull LLM as a Judge metrics and Heuristic metrics. Lets first define the LLM as a Judge model 

In [None]:
from deepeval.models.base_model import DeepEvalBaseLLM

class AWSBedrock(DeepEvalBaseLLM):
    def __init__(
        self,
        model
    ):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        return chat_model.invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        res = await chat_model.ainvoke(prompt)
        return res.content

    def get_model_name(self):
        return "Custom Bedrock Model"


aws_bedrock = AWSBedrock(model=llm)

Then lets load packages for DeepEval. Note here we will use two metrics `GEval` and `ToolCorrectnessMetrics`. 
- `GEval` use LLM as a Judge, you can specify the criteria and evaluation steps. We define one for Correctness which check whether teh facts in `actual output` contradicts any facts in `expected output`
- `ToolCorrectnessMetrics` this is to check whether your agent uses correct tools 

In [None]:
from mlflow.metrics import MetricValue 
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
from deepeval.test_case import LLMTestCase
from deepeval import evaluate
from deepeval.metrics import ToolCorrectnessMetric
from deepeval.test_case import ToolCall

correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are OK"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
    model = aws_bedrock
)

## Tool Correctness 
tool_correctness_metric = ToolCorrectnessMetric()

In [None]:
def evaluate_tool_correctness(predictions, targets):
    scores = [] 

    for prediction, target in zip(predictions, targets):
        test_case = LLMTestCase(
            input= prediction['user_input'],
            actual_output=prediction['actual_output'],
            expected_output= target['expected_output'],
            tools_called = [ToolCall(name=i['name']) for i in prediction['tool_uses']], 
            expected_tools = [ToolCall(name=i) for i in target['expected_tools']]
        )

        result = evaluate(test_cases=[test_case], metrics=[tool_correctness_metric])

        scores.append(result.test_results[0].metrics_data[0].score)

    return MetricValue(scores= scores)

def evaluate_correctness(predictions, targets):
    scores = [] 

    for prediction, target in zip(predictions, targets):
        test_case = LLMTestCase(
            input= prediction['user_input'],
            actual_output=prediction['actual_output'],
            expected_output= target['expected_output'],
            tools_called = [ToolCall(name=i['name']) for i in prediction['tool_uses']], 
            expected_tools = [ToolCall(name=i) for i in target['expected_tools']]
        )

        result = evaluate(test_cases=[test_case], metrics=[correctness_metric])

        scores.append(result.test_results[0].metrics_data[0].score)

    return MetricValue(scores= scores)

Now lets integrate those tool with MLflow. As we discussed above, MLflow will only accept precitions and targets as the arguments. We create two methods for tool correctness and final answer correctness. Both methods takes predictions and targets as the arguments and returns scores for these metrics. 

We use `make_metrics` method to make our customer metrics to be a MLflow metrics. Notices that here we use `greater_is_better = True` This indicate the score is higher the better. In future if you have similar metrics like incorrectness, you can set `greater_is_better = False`

In [None]:
from mlflow.models import make_metric
answer_correctness = make_metric(
    eval_fn=evaluate_correctness, greater_is_better= True , name="answer_correctness"
)

tool_answer_correctness = make_metric(
    eval_fn=evaluate_tool_correctness, greater_is_better= True, name="tool_answer_correctness"
)

Now we can use `mlflow.evaluate()` to see the performance of the chain on an evaluation dataset we created. This evaluation will take our two customized metrics (answer_correctness and tool_answer_correctness) into consideration. 

In [None]:
with mlflow.start_run() as evaluation_run:
    eval_dataset = mlflow.data.from_pandas(
        df=eval_df,
        name="eval_dataset_v1",
        targets="targets",
        predictions="predictions",
    )
    mlflow.log_input(dataset=eval_dataset)
    # Run the evaluation based on extra metrics
    # Current active model will be automatically used
    result = mlflow.evaluate(
        data=eval_dataset,
        extra_metrics=[
            answer_correctness, 
            tool_answer_correctness
        ]
    )

### Offline Evaluation Results

The results will be attached to your evaluation dataset frameworks with additional columns for scores


In [None]:
print(f"See aggregated evaluation results below: \n{result.metrics}")
result.tables["eval_results_table"]