# Agent offline Evalution with MLflow 

Evaluation is a critical methodology used to assess the performance of AI agents using historical groundtruth data. This approach serves multiple essential purposes in the development and deployment lifecycle of AI systems. By testing agent behavior in a controlled environment, users can mitigate risks, avoid costly mistakes, and protect against potential negative impacts in production environments.

MLflow provides comprehensive tools and functionalities for experiment tracking, allowing developers to log metrics, parameters, and artifacts while comparing different agent versions. This tutorial we will introduce you how to do offline evaluation on MLflow. We will use the same example we showed in the previous sections


In [None]:
import boto3
import os

import sagemaker_mlflow
import mlflow
import pandas as pd 

tracking_server_arn = "ENTER YOUR MLFLOW TRACKIHG SERVER ARN HERE" 
experiment_name = "agent-mlflow-demo"
mlflow.set_tracking_uri(tracking_server_arn) 
mlflow.set_experiment(experiment_name)

mlflow.langchain.autolog()

## Evaluation Dataset

When you build your Agent, you might want to test your agent performance based on some pre-defined groudtruth dataset. The following example shows the user inputs, agent actual outputs and expected outputs. You can find the dataset creation at [04-how-to-create-evaluation-dataset.ipynb](./04-how-to-create-evaluation-dataset.ipynb)



In [3]:
eval_df = pd.DataFrame(
    {
        "user_input": [
            "Can you help me solve the ticket: TICKET-001?",
            "Can you help me with this ticket id: TICKET-002?",
            "I got a ticket: TICKET-003, can you help me with this?",
            "Help me solve the ticekt: TICKET-004"
        ],
        "actual_output": [
            '1. Check network connectivity between client and server\n2. Verify if the server is running and accessible\n3. Increase the connection timeout settings\n4. Check for firewall rules blocking the connection\n5. Monitor network latency and bandwidth\n    ',
            '1. Verify database credentials are correct\n2. Check if the database user account is locked\n3. Ensure database service is running\n4. Review database access permissions\n5. Check for recent password changes\n    ',
            '1. Remove temporary and unnecessary files\n2. Implement log rotation\n3. Archive old data\n4. Expand disk space\n5. Monitor disk usage trends\n    ',
            '1. Review user access rights\n2. Check file and directory permissions\n3. Verify group memberships\n4. Update security policies\n5. Audit access control lists\n    '
        ],
        "expected_output":[
            '1. Check network connectivity between client and server\n2. Verify if the server is running and accessible\n3. Increase the connection timeout settings\n4. Check for firewall rules blocking the connection\n5. Monitor network latency and bandwidth\n    ',
            '1. Verify database credentials are correct\n2. Check if the database user account is locked\n3. Ensure database service is running\n4. Review database access permissions\n5. Check for recent password changes\n    ',
            '1. Remove temporary and unnecessary files\n2. Implement log rotation\n3. Archive old data\n4. Expand disk space\n5. Monitor disk usage trends\n    ',
            '1. Review user access rights\n2. Check file and directory permissions\n3. Verify group memberships\n4. Update security policies\n5. Audit access control lists\n    '
        ]
    }
)

## Create evaluation metrics for MLflow

MLflow have both built-in [Heuristic](https://mlflow.org/docs/2.21.3/llms/llm-evaluate/#heuristic-based-metrics) and [LLM as a Judge metrics](https://mlflow.org/docs/2.21.3/llms/llm-evaluate/#llm-as-a-judge-metrics). In this workflow, we will show you one example using [ROUGE score](https://huggingface.co/spaces/evaluate-metric/rouge)

In [4]:
rouge_metric = mlflow.metrics.rougeL()

### MLflow Custom Metrics: 

While MLflow has some built-in [Heuristic](https://mlflow.org/docs/2.21.3/llms/llm-evaluate/#heuristic-based-metrics) and [LLM as a Judge metrics](https://mlflow.org/docs/2.21.3/llms/llm-evaluate/#llm-as-a-judge-metrics), those metrics are limited and may not be able to fit all situations. This section we will show you how to build your own custom metrics.

1. You need to define your custom metric method with two arguments `predictions` and `targets`. "predictions" is a list of predicted results from you agent and "targets" is a list of ground truth. You can define any custom certeria for judgement. Here we design it to be exact match 
2. Use `make_metrics` method to make our customer metrics to be a MLflow metrics. Notices that here we use `greater_is_better = True` This indicate the score is higher the better. In future if you have similar metrics like incorrectness, you can set `greater_is_better = False`

In [5]:
from mlflow.metrics import MetricValue 
from mlflow.models import make_metric

def custom_correctness(predictions, targets):
    '''
    custom correctness based on exact match 

    Args:
        precitions: list of agent predicted values 
        targets: list of ground truth values 

    Returns: 
        A list of MetricValue scores
    '''
    scores = [] 

    for prediction, target in zip(predictions, targets):
        if prediction.strip() == target.strip():
            scores.append(1)
        else:
            scores.append(0)

    return MetricValue(scores= scores)

custom_correctness_metric = make_metric(
    eval_fn=custom_correctness, greater_is_better= True , name="custom_answer_correctness"
)

## Run Evaluation

Now we can use `mlflow.evaluate()` to see the performance of the chain on an evaluation dataset we created. we will use both build in metrics and custom metric 



In [6]:
with mlflow.start_run() as evaluation_run:
    eval_dataset = mlflow.data.from_pandas(
        df=eval_df,
        name="eval_dataset",
        targets="expected_output",
        predictions="actual_output",
    )
    mlflow.log_input(dataset=eval_dataset)


    result = mlflow.evaluate(
        data=eval_dataset,
        extra_metrics=[
            custom_correctness_metric,
            rouge_metric
        ]
    )   

2025/09/27 11:36:03 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
  from .autonotebook import tqdm as notebook_tqdm


🏃 View run skittish-pug-937 at: https://us-east-1.experiments.sagemaker.aws/#/experiments/1/runs/3aa375956d0b42cf8b4d0fe35c5a167d
🧪 View experiment at: https://us-east-1.experiments.sagemaker.aws/#/experiments/1


## Offline Evaluation Results

The results will be attached to your evaluation dataset frameworks with additional columns for scores


In [7]:
print(f"See aggregated evaluation results below: \n{result.metrics}")
result.tables["eval_results_table"]

See aggregated evaluation results below: 
{'custom_answer_correctness/mean': np.float64(1.0), 'custom_answer_correctness/variance': np.float64(0.0), 'custom_answer_correctness/p90': np.float64(1.0), 'rougeL/v1/mean': np.float64(1.0), 'rougeL/v1/variance': np.float64(0.0), 'rougeL/v1/p90': np.float64(1.0)}


Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00,  9.50it/s]


Unnamed: 0,user_input,expected_output,actual_output,custom_answer_correctness/score,rougeL/v1/score
0,Can you help me solve the ticket: TICKET-001?,1. Check network connectivity between client a...,1. Check network connectivity between client a...,1,1
1,Can you help me with this ticket id: TICKET-002?,1. Verify database credentials are correct\n2....,1. Verify database credentials are correct\n2....,1,1
2,"I got a ticket: TICKET-003, can you help me wi...",1. Remove temporary and unnecessary files\n2. ...,1. Remove temporary and unnecessary files\n2. ...,1,1
3,Help me solve the ticekt: TICKET-004,1. Review user access rights\n2. Check file an...,1. Review user access rights\n2. Check file an...,1,1
