## Agent offline Evalution with MLflow 

Offline evaluation is a critical methodology used to assess the performance of AI agents using historical groundtruth data without the need for live deployment. This approach serves multiple essential purposes in the development and deployment lifecycle of AI systems. By testing agent behavior in a controlled environment, users can mitigate risks, avoid costly mistakes, and protect against potential negative impacts in production environments.

MLflow provides comprehensive tools and functionalities for experiment tracking, allowing developers to log metrics, parameters, and artifacts while comparing different agent versions. This tutorial we will introduce you how to do offline evaluation on MLflow. We will use the same example we showed in the previous sections


## Reuse the LangGraph Agents for ETL error resolution

In [None]:
import boto3
import os

from data.data import log_data_set, log_data
from data.solution_book import knowledge_base

import sagemaker_mlflow
import mlflow

print(sagemaker_mlflow.__version__)
print(mlflow.__version__)

tracking_server_arn = "" 
experiment_name = "agent-mlflow-demo"
mlflow.set_tracking_uri(tracking_server_arn) 
mlflow.set_experiment(experiment_name)

mlflow.langchain.autolog()

0.1.0
3.0.0


### LangGraph Agent implementation

In [2]:
from langgraph.prebuilt import create_react_agent
from langchain.chat_models import init_chat_model
from langchain_core.tools import tool

@tool 
def log_identifier(ticket_id: str) -> str:
    """Get error type from ticket number

    Args:
        ticket_id: ticket id

    Returns:
        an error type

    """
    if ticket_id not in log_data_set:
        return "ticket id not found in the database"
    
    for item in log_data:
        if item["id"] == ticket_id:
            return item['error_name']

@tool(return_direct=True)
def information_retriever(error_type: str) -> str:
    """Retriever error solution based on error type

    Args:
        error_type: user input error type
    
    Returns:
        a str of steps 
    """

    if error_type not in knowledge_base.keys():
        return "error type not found in the knowledge base, please use your own knowledge"
    
    return knowledge_base[error_type]

llm = init_chat_model(
    model= "us.anthropic.claude-3-5-haiku-20241022-v1:0",
    model_provider="bedrock_converse",
)

system_prompt = """
You are an expert a resolving ETL errors. You are equiped with two tools: 
1. log_identifier: Get error type from ticket number
2. information_retriever: Retriever error solution based on error type

You will use the ticket ID to gather information about the error using the log_identifier tool. 
Then you should search the database for information on how to resolve the error using the information_retriever tool

Return ONLY the numbered steps without any introduction or conclusion. Format as:
1. step 1 text
2. step 2 text
...
"""

agent = create_react_agent(
    model=llm,
    tools= [log_identifier, information_retriever], 
    prompt=system_prompt
)

def get_langGraph_agent_response(ticket_id = 'TICKET-001'):
    # Prepare input for the agent
    agent_input = {"messages": [{"role": "user", "content": ticket_id}]}
    response = agent.invoke(agent_input)
    return response 

Now Lets introduce the test cases we have created, we have created four test cases, test_case_1, test_case_2, test_case_3 and test_case_4. Below we showed two examples for test_case_1 and test_case_2

Each test case is a dictionary contains separates feilds: 
- ticket_id: ticket id (this is the input to the agent)
- error_name: this is expected error name based on the ticekt id 
- solution: this is expected solution steps 
- expected_tools: expected tools to use, store in python list and order matters
- expected_arguments: expected arguements for the above tools

In [3]:
from data.test_cases_mlflow import test_case_1, test_case_2, test_case_3, test_case_4

In [4]:
test_case_1

{'ticket_id': 'TICKET-001',
 'error_name': 'Connection Timeout',
 'solution': ['1. Check network connectivity between client and server',
  '2. Verify if the server is running and accessible',
  '3. Increase the connection timeout settings',
  '4. Check for firewall rules blocking the connection',
  '5. Monitor network latency and bandwidth'],
 'expected_tools': ['log_identifier', 'information_retriever'],
 'expected_arguments': [{'ticket_id': 'TICKET-001'},
  {'error_type': 'Connection Timeout'}]}

We will invoke agent for each of the test cases, we will store them into a dataframe, this dataframe will be used in evaluation. The dataframe will contains: 
1. user_input: this is the input ticket number 
2. actual_output: Agent generated output 
3. expected_output: ground truth output 

### Create Evaluation Dataset 

In [5]:
import pandas as pd

test_cases = [test_case_1, test_case_2, test_case_3, test_case_4]

result = [] 

for test_case in test_cases:

    user_input = test_case['ticket_id']
    error_solution = "\n".join(test_case['solution'])
    agent_response = get_langGraph_agent_response(user_input)
    actual_output = agent_response['messages'][-1].content

    result.append({
        'user_input': user_input, 
        'actual_output': actual_output,
        'expected_output': error_solution
    })

import pandas as pd

eval_df = pd.DataFrame(result)

content.str
  Input should be a valid string [type=string_type, input_value=[{'type': 'text', 'text':...9yVbKlrT9S-OXS68JRD0A'}], input_type=list]
    For further information visit https://errors.pydantic.dev/2.11/v/string_type
content.list[tagged-union[TextContentPart,ImageContentPart,AudioContentPart]].1
  Input tag 'tool_use' found using 'type' does not match any of the expected tags: 'text', 'image_url', 'input_audio' [type=union_tag_invalid, input_value={'type': 'tool_use', 'nam...09yVbKlrT9S-OXS68JRD0A'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/union_tag_invalid
content.str
  Input should be a valid string [type=string_type, input_value=[{'type': 'tool_use', 'na...1pDfITmSnWJ88PvrM_rMw'}], input_type=list]
    For further information visit https://errors.pydantic.dev/2.11/v/string_type
content.list[tagged-union[TextContentPart,ImageContentPart,AudioContentPart]].0
  Input tag 'tool_use' found using 'type' does not match any of the exp

In [6]:
eval_df

Unnamed: 0,user_input,actual_output,expected_output
0,TICKET-001,1. Check network connectivity between client a...,1. Check network connectivity between client a...
1,TICKET-002,1. Verify database credentials are correct\n2....,1. Verify database credentials are correct\n2....
2,TICKET-006,1. Remove temporary and unnecessary files\n2. ...,1. Remove temporary and unnecessary files\n2. ...
3,TICKET-008,1. Review user access rights\n2. Check file an...,1. Review user access rights\n2. Check file an...


### Create evaluation metrics for MLflow

While MLflow has some built-in [Heuristic](https://mlflow.org/docs/2.21.3/llms/llm-evaluate/#heuristic-based-metrics) and [LLM as a Judge metrics](https://mlflow.org/docs/2.21.3/llms/llm-evaluate/#llm-as-a-judge-metrics), those metrics are limited and may not be able to fit all situations. This section we will show you how to build your own custom metrics.

1. You need to define your custom metric method with two arguments `predictions` and `targets`. "predictions" is a list of predicted results from you agent and "targets" is a list of ground truth. You can define any custom certeria for judgement. Here we design it to be exact match 
2. Use `make_metrics` method to make our customer metrics to be a MLflow metrics. Notices that here we use `greater_is_better = True` This indicate the score is higher the better. In future if you have similar metrics like incorrectness, you can set `greater_is_better = False`

In [7]:
from mlflow.metrics import MetricValue 
from mlflow.models import make_metric

def custom_correctness(predictions, targets):
    '''
    custom correctness based on exact match 

    Args:
        precitions: list of agent predicted values 
        targets: list of ground truth values 

    Returns: 
        A list of MetricValue scores
    '''
    scores = [] 

    for prediction, target in zip(predictions, targets):
        if prediction.strip() == target.strip():
            scores.append(1)
        else:
            scores.append(0)

    return MetricValue(scores= scores)

custom_correctness_metric = make_metric(
    eval_fn=custom_correctness, greater_is_better= True , name="custom_answer_correctness"
)

### Run offline evaluation

Now we can use `mlflow.evaluate()` to see the performance of the chain on an evaluation dataset we created. we will use the `custom_correctness_metric` we created



In [8]:
with mlflow.start_run() as evaluation_run:
    eval_dataset = mlflow.data.from_pandas(
        df=eval_df,
        name="eval_dataset",
        targets="expected_output",
        predictions="actual_output",
    )
    mlflow.log_input(dataset=eval_dataset)


    result = mlflow.evaluate(
        data=eval_dataset,
        extra_metrics=[
            custom_correctness_metric
        ]
    )   

2025/09/25 05:00:08 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


🏃 View run marvelous-cat-105 at: https://us-east-1.experiments.sagemaker.aws/#/experiments/1/runs/ab98516b896a42bea5949c86ddd49323
🧪 View experiment at: https://us-east-1.experiments.sagemaker.aws/#/experiments/1


### Offline Evaluation Results

The results will be attached to your evaluation dataset frameworks with additional columns for scores


In [9]:
print(f"See aggregated evaluation results below: \n{result.metrics}")
result.tables["eval_results_table"]

See aggregated evaluation results below: 
{'custom_answer_correctness/mean': 1.0, 'custom_answer_correctness/variance': 0.0, 'custom_answer_correctness/p90': 1.0}


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,user_input,expected_output,actual_output,custom_answer_correctness/score
0,TICKET-001,1. Check network connectivity between client a...,1. Check network connectivity between client a...,1
1,TICKET-002,1. Verify database credentials are correct\n2....,1. Verify database credentials are correct\n2....,1
2,TICKET-006,1. Remove temporary and unnecessary files\n2. ...,1. Remove temporary and unnecessary files\n2. ...,1
3,TICKET-008,1. Review user access rights\n2. Check file an...,1. Review user access rights\n2. Check file an...,1


### MLflow GenAI metrics with Bedrock

MLflow also have some built-in [Heuristic](https://mlflow.org/docs/2.21.3/llms/llm-evaluate/#heuristic-based-metrics) and [LLM as a Judge metrics](https://mlflow.org/docs/2.21.3/llms/llm-evaluate/#llm-as-a-judge-metrics). Here we show one example you can use for answer_correctness

In [None]:
import os

os.environ["AWS_REGION"] = "<your-aws-region>"

# Option 1. Role-based authentication
os.environ["AWS_ROLE_ARN"] = "<your-aws-role-arn>"

# Option 2. API key-based authentication
os.environ["AWS_ACCESS_KEY_ID"] = "<your-aws-access-key-id>"
os.environ["AWS_SECRET_ACCESS_KEY"] = "<your-aws-secret-access-key>"

# You can also use session token for temporary credentials.
# os.environ["AWS_SESSION_TOKEN"] = "<your-aws-session-token>"

answer_correctness = mlflow.metrics.genai.answer_correctness(
    model="bedrock:/anthropic.claude-3-5-sonnet-20241022-v2:0",
    parameters={
        "temperature": 0,
        "max_tokens": 256,
        "anthropic_version": "bedrock-2023-05-31",
    },
)

with mlflow.start_run() as evaluation_run:
    eval_dataset = mlflow.data.from_pandas(
        df=eval_df,
        name="eval_dataset_v3",
        targets="expected_output",
        predictions="actual_output",
    )
    mlflow.log_input(dataset=eval_dataset)


    result = mlflow.evaluate(
        data=eval_dataset,
        extra_metrics=[
            custom_correctness_metric
        ],
        evaluator_config={"col_mapping": {"inputs": "user_input"}}
    )

print(f"See aggregated evaluation results below: \n{result.metrics}")
result.tables["eval_results_table"]