# Agent Offline Evaluation with SageMaker Managed MLflow
In this lab, you will assess the performance of your IT support GenAI agent using SageMaker Managed MLflow. Offline evaluation is a critical method for testing agent behavior against historical, labeled data before deploying into production.

You’ll configure MLflow, review metric types, run evaluation on prepared data, and interpret results using the MLflow tracking server.


## Install Dependencies
First, ensure your notebook environment has all necessary packages. Ignore any warnings from pre-installed packages.

In [None]:
# Install following dependencies. Ignore any warnings and residual dependency errors.
!pip install --upgrade -r requirements-langgraph.txt -q

In [None]:
import boto3
import os

import sagemaker_mlflow
import sagemaker
import mlflow
import pandas as pd 

print(mlflow.__version__)

## Configure SagemakerAI managed MLFlow
We will configure MLflow for experiment tracking by setting `mlflow.set_tracking_uri()` and `mlflow.set_experiment("<YOUR_MLFLOW_EXPERIMENT_NAME>")` methods.

We will retrieve the stored notebook values containing the SageMakerAI MLflow Tracking Server ARN. If the stored value is empty you can enter your tracking server arn to the following place holder string TRACKING_SERVER_ARN. Additionally, you can give any MLflow experiment name, in this workshop, we will give an experiment name: agent-mlflow-demo

In [None]:
# Retrieve values stored from previous labs
%store -r 

%store

In [None]:
TRACKING_SERVER_ARN
# If the stored value is empty set your SageMaker Managed MLflow tracking server ARN from prerequisites
#TRACKING_SERVER_ARN = "ENTER YOUR MLFLOW TRACKIHG SERVER ARN HERE" 

In [None]:
tracking_server_arn = TRACKING_SERVER_ARN 
experiment_name = "agent-mlflow-demo"

# Set MLflow SDK to your configured tracking server 
mlflow.set_tracking_uri(tracking_server_arn) 
# Create or select an MLflow experiment
mlflow.set_experiment(experiment_name)

# Enable autologging module
mlflow.langchain.autolog()

## Offline Evaluation Dataset curation

When you build your Agent, you will want to test your agent performance based on some pre-defined groudtruth dataset (Offline). The following example shows the user inputs, agent actual outputs and expected outputs. This lets you benchmark performance, reliability, and generalization. 

In this section the Offline Evaluation Dataset is already prepared for you and we will load the the Offline Evaluation Dataset from the dataset `./data/agent_evaluation_dataset.json`

(Optional) We prepared the Offline Evaluation Dataset `./data/agent_evaluation_dataset.json` using the langgraph IT support agent you created in the previous lab. You can find this dataset creation in the notebook [04-how-to-create-evaluation-dataset.ipynb](./04-how-to-create-evaluation-dataset.ipynb) .

Now load in the json data:

In [None]:
evaluation_dataset_path = "./data/agent_evaluation_dataset.json"
eval_df = pd.read_json(evaluation_dataset_path, orient="records")

(Optional/Skip) Or comment in the following block to use the example dataset:

In [None]:
# eval_df = pd.DataFrame([{"inputs":"Can you help me solve the ticket: TICKET-001?","actual_output":"1. Check network connectivity between client and server\n2. Verify if the server is running and accessible\n3. Increase the connection timeout settings\n4. Check for firewall rules blocking the connection\n5. Monitor network latency and bandwidth\n    ","expected_output":"1. Check network connectivity between client and server\n2. Verify if the server is running and accessible\n3. Increase the connection timeout settings\n4. Check for firewall rules blocking the connection\n5. Monitor network latency and bandwidth"},{"inputs":"Can you help me with this ticket id: TICKET-002?","actual_output":"1. Verify database credentials are correct\n2. Check if the database user account is locked\n3. Ensure database service is running\n4. Review database access permissions\n5. Check for recent password changes\n    ","expected_output":"1. Verify database credentials are correct\n2. Check if the database user account is locked\n3. Ensure database service is running\n4. Review database access permissions\n5. Check for recent password changes"},{"inputs":"I got a ticket: TICKET-003, can you help me with this?","actual_output":"1. Analyze application memory usage patterns\n2. Increase available memory or swap space\n3. Look for memory leaks in the application\n4. Optimize database queries and caching\n5. Consider implementing memory pooling\n    ","expected_output":"1. Analyze application memory usage patterns\n2. Increase available memory or swap space\n3. Look for memory leaks in the application\n4. Optimize database queries and caching\n5. Consider implementing memory pooling"},{"inputs":"Help me solve the ticekt: TICKET-004","actual_output":"error type not found in the knowledge base, please use your own knowledge","expected_output":"1. Implement request throttling\n2. Use caching to reduce API calls\n3. Review API usage patterns\n4. Contact service provider for limit increase\n5. Optimize API call frequency"},{"inputs":"How do I solve this ticket: TICKET-005?","actual_output":"1. Check certificate expiration date\n2. Verify certificate chain is complete\n3. Ensure certificate matches domain name\n4. Update SSL certificate if expired\n5. Check certificate authority validity\n    ","expected_output":"1. Check certificate expiration date\n2. Verify certificate chain is complete\n3. Ensure certificate matches domain name\n4. Update SSL certificate if expired\n5. Check certificate authority validity"},{"inputs":"I need help with this ticket: TICKET-006","actual_output":"1. Remove temporary and unnecessary files\n2. Implement log rotation\n3. Archive old data\n4. Expand disk space\n5. Monitor disk usage trends\n    ","expected_output":"1. Remove temporary and unnecessary files\n2. Implement log rotation\n3. Archive old data\n4. Expand disk space\n5. Monitor disk usage trends"},{"inputs":"What do I do to resolve TICKET-007","actual_output":"1. Check physical network connections\n2. Verify router and switch status\n3. Test DNS resolution\n4. Check for network interface errors\n5. Monitor network traffic patterns\n    ","expected_output":"1. Check physical network connections\n2. Verify router and switch status\n3. Test DNS resolution\n4. Check for network interface errors\n5. Monitor network traffic patterns"},{"inputs":"I need help with this ticket: TICKET-008","actual_output":"1. Review user access rights\n2. Check file and directory permissions\n3. Verify group memberships\n4. Update security policies\n5. Audit access control lists\n    ","expected_output":"1. Review user access rights\n2. Check file and directory permissions\n3. Verify group memberships\n4. Update security policies\n5. Audit access control lists"},{"inputs":"How should I fix TICKET-009?","actual_output":"1. Check service status and logs\n2. Restart the service\n3. Verify dependencies are running\n4. Check system source documents. Monitor service health metrics\n    ","expected_output":"1. Check service status and logs\n2. Restart the service\n3. Verify dependencies are running\n4. Check system reyour source documents. Monitor service health metrics"},{"inputs":"I was just assigned TICKET-010, what do I do?","actual_output":"1. Locate backup configuration files\n2. Restore from version control\n3. Create new configuration file with default settings\n4. Check file path and permissions\n5. Verify application deployment process\n    ","expected_output":"1. Locate backup configuration files\n2. Restore from version control\n3. Create new configuration file with default settings\n4. Check file path and permissions\n5. Verify application deployment process"},{"inputs":"I need help with this ticket","actual_output":"I apologize, but you haven't provided the ticket number. Could you please share the specific ticket ID so I can help you identify and resolve the error?","expected_output":"I apologize, but I can only provide ETL resolution steps for specific ticket IDs. Please provide a ticket ID."}])

## Create evaluation metrics for MLflow

MLflow provides a wide range of built-in and custom metric options to flexibly score agent outputs. MLflow has two built-in metrics and :
1. [Heuristic Metric](https://mlflow.org/docs/2.21.3/llms/llm-evaluate/#heuristic-based-metrics)
2. [LLM as a Judge metrics](https://mlflow.org/docs/2.21.3/llms/llm-evaluate/#llm-as-a-judge-metrics).
3. Module to create and capture custom metrics

In this section, we will create you one Heuristic Metric [ROUGE score](https://huggingface.co/spaces/evaluate-metric/rouge) and one LLM-as-a-Judge metric [Answer Similarity](https://mlflow.org/docs/2.21.3/api_reference/python_api/mlflow.metrics.html#mlflow.metrics.genai.answer_similarity). Additional, we will show you how to create a custom metric. 

### Define Heuristic Metrics
Heuristic metrics use rules or direct calculations for scoring. Common examples include ROUGE (for text overlap) and exact match. Below, we use ROUGE-L. 
The score ranges from 0 to 1, where a higher score indicates higher similarity. rougeL uses unigram based scoring to calculate similarity.

In [None]:
# Define the MLflow heuristic metric for text similarity
rouge_metric = mlflow.metrics.rougeL()

### Define MLFLow LLM-as-Judege metrics using Bedrock as the evaluator LLM
LLM-as-a-Judge metrics use a large language model deployed via Bedrock or another provider to rate answer quality, correctness, and similarity.

In [None]:
os.environ["AWS_ROLE_ARN"] = sagemaker.get_execution_role()
LLM_EVALUATOR="bedrock:/us.anthropic.claude-3-5-haiku-20241022-v1:0"

First lets make sure the IAM role and credentials allow mlflow metric to perform evaluation and access Bedrock LLM access. 
Configure your AWS IAM role to allow access to Bedrock LLM services for evaluation.

In [None]:
import boto3
import json

# Create an IAM client
iam = boto3.client('iam')

# Define the role name and the new trust policy
role_name = os.environ["AWS_ROLE_ARN"].split("/")[-1]  # Replace with your role's name
role_arn = os.environ["AWS_ROLE_ARN"]
new_trust_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
			"Effect": "Allow",
			"Principal": {
				"Service": "sagemaker.amazonaws.com"
			},
			"Action": "sts:AssumeRole"
		},
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": role_arn  # Allow mlflow metric to assume the role
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

try:
    # Update the assume role policy
    response = iam.update_assume_role_policy(
        RoleName=role_name,
        PolicyDocument=json.dumps(new_trust_policy)
    )
    print(f"Trust policy for role '{role_name}' updated successfully.")
    print(json.dumps(response, indent=4))

except Exception as e:
    print(f"Error updating trust policy for role '{role_name}': {e}")


> Note: Wait for a few minutes for the IAM policy to take effect before continuing to the next notebook cell. 

Optionally instead of attaching the trust policy as you show in the previous cell, you can instead set the AWS credentials in the environment

`AWS_ACCESS_KEY_ID` = `<your-aws-access-key-id>`

`AWS_SECRET_ACCESS_KEY` = `<your-aws-secret-access-key>`

`AWS_REGION` = `<your aws region>`

`AWS_SESSION_TOKEN` = `<your-aws-session-token>`




Now define the `mlflow.metrics.genai.answer_similarity()` metric which evaluates how similar a agent generated output is compared to the information in the ground truth data.
This metric will ask the evaluator LLM to score how close your agent's prediction is to the ideal answer (scale 1-5, low-high). 

In [None]:
answer_similarity_mlflow_metric_bedrock = mlflow.metrics.genai.answer_similarity(
    model=LLM_EVALUATOR,
        parameters={
        "temperature": 0,
        "max_tokens": 256,
        "anthropic_version": "bedrock-2023-05-31",
    },
)


In [None]:
# Test the metric definition. Score range 1 - 5 / LOW - HIGH
answer_similarity_mlflow_metric_bedrock(
    inputs="What is the largest planet in our solar system?",
    predictions="The moon is the largest planet in our solar system.",
    targets="The largest planet in our solar system is Jupiter.",
)

> Note: If you see Bedrock model access error verify the IAM Trust policy is attached and re-run the previous cell again. 

### MLflow Custom Metrics: 

Custom metrics let you tailor evaluation logic to your agent’s workflow, use-case, or business rules.
You can design a scorer for any aspect not covered by built-in metrics; for example, exact match correctness as shown below. This section we will show you how to build your own custom metrics.

1. You need to define your custom metric method with two arguments `predictions` and `targets`. "predictions" is a list of predicted results from you agent and "targets" is a list of ground truth. You can define any custom certeria for judgement. Here we design it to be exact match 
2. Use `make_metrics` method to make our customer metrics to be a MLflow metrics. Notices that here we use `greater_is_better = True` This indicate the score is higher the better. In future if you have similar metrics like incorrectness, you can set `greater_is_better = False`
3. This metric returns 1 for predictions that exactly match ground truth and 0 otherwise.

In [None]:
from mlflow.metrics import MetricValue 
from mlflow.models import make_metric

def custom_correctness(predictions, targets):
    '''
    custom correctness based on exact match 

    Args:
        precitions: list of agent predicted values 
        targets: list of ground truth values 

    Returns: 
        A list of MetricValue scores
    '''
    scores = [] 

    for prediction, target in zip(predictions, targets):
        if prediction.strip() == target.strip():
            scores.append(1)
        else:
            scores.append(0)

    return MetricValue(scores= scores)

custom_correctness_metric = make_metric(
    eval_fn=custom_correctness, greater_is_better= True , name="custom_answer_correctness"
)

## Execute the MLflow Evaluation 
Now we can use `mlflow.evaluate()` to see the performance of the agent on an evaluation dataset we prepared. We will use the three metrics we defined:
1. Heuristic Metric `rouge_metric`
2. LLM as a Judge metrics `answer_similarity_mlflow_metric_bedrock`.
3. Custom metric `custom_correctness_metric`


> Important note: Due to throughput limitation enforced in the AWS workshop lab accounts bedrock models you many see evaluation result scores containing null value `NaN`. Ignore the null value `NaN` as this is due to the temporary high demand throughputs. 

In [None]:
with mlflow.start_run() as evaluation_run:
    eval_dataset = mlflow.data.from_pandas(
        df=eval_df,
        name="eval_dataset",
        targets="expected_output",
        predictions="actual_output"
    )

    # Set project level tracking information as desired for governance and lineage
    phase = "offline-agent-evaluation"
    stage = "offline"
    task = "it-support"
    version = "1.0.0"
    user = role_arn
    mlflow.log_input(
        dataset=eval_dataset,
        context=phase,
            tags={
                "task": task,
                "split": stage,
                "version": version
            }
    )
    mlflow.log_param("data_path", evaluation_dataset_path)
    mlflow.log_param("llm-as-a-judge", LLM_EVALUATOR)
    mlflow.set_tag("experiment_phase", phase)
    mlflow.set_tag("version", version)
    
    # Evaluation run in MLflow with results logged for ROUGE, LLM-Judge answer similarity, and custom correctness.
    result = mlflow.evaluate(
        data=eval_dataset,
        extra_metrics=[
            rouge_metric,
            answer_similarity_mlflow_metric_bedrock,
            custom_correctness_metric
        ],
    )
       

## Offline Evaluation Results

The results of the evaluation will be logged to the MLflow experiment run and also returned to your evaluation function with new columns for result scores along with the original evaluation dataset provided.


In [None]:
print(f"See aggregated evaluation results below: \n{result.metrics}")
result.tables["eval_results_table"]

### Interpreting Results and Error Analysis

Let's inspect one of the poor performing test cases:

The error,  API Rate Limit Exceeded, for ticket TICKET-004 is not present in the internal storage i.e, the sample solution_book in the workshop lab. This means the `information_retriever` agent tool could not return the resolution steps to resolve the error. Indicating to the agent developers possible areas of improvement. 

The output from the model is "error type not found in the knowledge base, please use your own knowledge" instead of the expected result: 
1. Implement request throttling
2. Use caching to reduce API calls
3. Review API usage patterns
4. Contact service provider for limit increase
5. Optimize API call frequency

Thus the LLM as a Judge Metric for answer similarity is low, a 1 out of 5, for this test case, catching the gap in the informatiion_retriever tool. This helps diagnose why a given ticket received a low similarity score—useful for improving agent logic and tool design 

Hence you can use the evaluation to interpret agent performace and areas that need improvement.retrieval coverage.

### View Results in the SageMaker managed MLflow UI
After completing the evaluation run:
- Open SageMaker Studio MLflow UI → view experiment runs, and inspect model metrics.

Summary: You have now curated an offline evaluation dataset, defined diverse MLflow metrics, and conducted a full agent evaluation pipeline using SageMaker Managed MLflow. Perform aggregation, artifact browsing, and trace visualizations for deeper analysis. Use these techniques for reliable agent deployment and ongoing improvement. 