# Agent offline Evalution with MLflow 

Evaluation is a critical methodology used to assess the performance of AI agents using historical groundtruth data. This approach serves multiple essential purposes in the development and deployment lifecycle of AI systems. By testing agent behavior in a controlled environment, users can mitigate risks, avoid costly mistakes, and protect against potential negative impacts in production environments.

MLflow provides comprehensive tools and functionalities for experiment tracking, allowing developers to log metrics, parameters, and artifacts while comparing different agent versions. This tutorial we will introduce you how to do offline evaluation on MLflow. We will use the same example we showed in the previous sections


In [None]:
import boto3
import os

import sagemaker_mlflow
import mlflow
import pandas as pd 

tracking_server_arn = "ENTER YOUR MLFLOW TRACKING SERVER ARN HERE" 
experiment_name = "agent-mlflow-demo"
mlflow.set_tracking_uri(tracking_server_arn) 
mlflow.set_experiment(experiment_name)

mlflow.langchain.autolog()

## Evaluation Dataset

When you build your Agent, you might want to test your agent performance based on some pre-defined groudtruth dataset. The following example shows the user inputs, agent actual outputs and expected outputs. You can find the dataset creation at [04-how-to-create-evaluation-dataset.ipynb](./04-how-to-create-evaluation-dataset.ipynb)



Read in the json data you created in 04-how-to-create-evaluation-dataset.ipynb:

In [None]:
eval_df = pd.read_json("./data/agent_evaluation_dataset.json", orient="records")

Or comment in the following block to use the example dataset:

In [None]:
eval_df = pd.DataFrame([{"inputs":"Can you help me solve the ticket: TICKET-001?","actual_output":"1. Check network connectivity between client and server\n2. Verify if the server is running and accessible\n3. Increase the connection timeout settings\n4. Check for firewall rules blocking the connection\n5. Monitor network latency and bandwidth\n    ","expected_output":"1. Check network connectivity between client and server\n2. Verify if the server is running and accessible\n3. Increase the connection timeout settings\n4. Check for firewall rules blocking the connection\n5. Monitor network latency and bandwidth"},{"inputs":"Can you help me with this ticket id: TICKET-002?","actual_output":"1. Verify database credentials are correct\n2. Check if the database user account is locked\n3. Ensure database service is running\n4. Review database access permissions\n5. Check for recent password changes\n    ","expected_output":"1. Verify database credentials are correct\n2. Check if the database user account is locked\n3. Ensure database service is running\n4. Review database access permissions\n5. Check for recent password changes"},{"inputs":"I got a ticket: TICKET-003, can you help me with this?","actual_output":"1. Analyze application memory usage patterns\n2. Increase available memory or swap space\n3. Look for memory leaks in the application\n4. Optimize database queries and caching\n5. Consider implementing memory pooling\n    ","expected_output":"1. Analyze application memory usage patterns\n2. Increase available memory or swap space\n3. Look for memory leaks in the application\n4. Optimize database queries and caching\n5. Consider implementing memory pooling"},{"inputs":"Help me solve the ticekt: TICKET-004","actual_output":"error type not found in the knowledge base, please use your own knowledge","expected_output":"1. Implement request throttling\n2. Use caching to reduce API calls\n3. Review API usage patterns\n4. Contact service provider for limit increase\n5. Optimize API call frequency"},{"inputs":"How do I solve this ticket: TICKET-005?","actual_output":"1. Check certificate expiration date\n2. Verify certificate chain is complete\n3. Ensure certificate matches domain name\n4. Update SSL certificate if expired\n5. Check certificate authority validity\n    ","expected_output":"1. Check certificate expiration date\n2. Verify certificate chain is complete\n3. Ensure certificate matches domain name\n4. Update SSL certificate if expired\n5. Check certificate authority validity"},{"inputs":"I need help with this ticket: TICKET-006","actual_output":"1. Remove temporary and unnecessary files\n2. Implement log rotation\n3. Archive old data\n4. Expand disk space\n5. Monitor disk usage trends\n    ","expected_output":"1. Remove temporary and unnecessary files\n2. Implement log rotation\n3. Archive old data\n4. Expand disk space\n5. Monitor disk usage trends"},{"inputs":"What do I do to resolve TICKET-007","actual_output":"1. Check physical network connections\n2. Verify router and switch status\n3. Test DNS resolution\n4. Check for network interface errors\n5. Monitor network traffic patterns\n    ","expected_output":"1. Check physical network connections\n2. Verify router and switch status\n3. Test DNS resolution\n4. Check for network interface errors\n5. Monitor network traffic patterns"},{"inputs":"I need help with this ticket: TICKET-008","actual_output":"1. Review user access rights\n2. Check file and directory permissions\n3. Verify group memberships\n4. Update security policies\n5. Audit access control lists\n    ","expected_output":"1. Review user access rights\n2. Check file and directory permissions\n3. Verify group memberships\n4. Update security policies\n5. Audit access control lists"},{"inputs":"How should I fix TICKET-009?","actual_output":"1. Check service status and logs\n2. Restart the service\n3. Verify dependencies are running\n4. Check system source documents. Monitor service health metrics\n    ","expected_output":"1. Check service status and logs\n2. Restart the service\n3. Verify dependencies are running\n4. Check system reyour source documents. Monitor service health metrics"},{"inputs":"I was just assigned TICKET-010, what do I do?","actual_output":"1. Locate backup configuration files\n2. Restore from version control\n3. Create new configuration file with default settings\n4. Check file path and permissions\n5. Verify application deployment process\n    ","expected_output":"1. Locate backup configuration files\n2. Restore from version control\n3. Create new configuration file with default settings\n4. Check file path and permissions\n5. Verify application deployment process"},{"inputs":"I need help with this ticket","actual_output":"I apologize, but you haven't provided the ticket number. Could you please share the specific ticket ID so I can help you identify and resolve the error?","expected_output":"I apologize, but I can only provide ETL resolution steps for specific ticket IDs. Please provide a ticket ID."}])

## Create evaluation metrics for MLflow

MLflow have both built-in [Heuristic](https://mlflow.org/docs/2.21.3/llms/llm-evaluate/#heuristic-based-metrics) and [LLM as a Judge metrics](https://mlflow.org/docs/2.21.3/llms/llm-evaluate/#llm-as-a-judge-metrics). In this workflow, we will show you one example using [ROUGE score](https://huggingface.co/spaces/evaluate-metric/rouge) and one example using LLM as a Judge [Answer Similarity](https://mlflow.org/docs/2.21.3/api_reference/python_api/mlflow.metrics.html#mlflow.metrics.genai.answer_similarity) metric.

In [None]:
rouge_metric = mlflow.metrics.rougeL()

In [None]:
from dotenv import load_dotenv
import os

# To use LLM as a judge with MLflow and Bedrock your AWS credentials must be configured as environment variables:
#  Create a .env file to us API key-based authentication:
# AWS_ACCESS_KEY_ID = <your-aws-access-key-id>
# AWS_SECRET_ACCESS_KEY" = <your-aws-secret-access-key>
# AWS_REGION = <your aws region>
# You can also use session token for temporary credentials.
# AWS_SESSION_TOKEN = <your-aws-session-token>
# use load_dotenv to load in the values from a .env file:
# load_dotenv()

answer_similarity_metric = mlflow.metrics.genai.answer_similarity(
    model="bedrock:/us.anthropic.claude-3-5-haiku-20241022-v1:0",
        parameters={
        "temperature": 0,
        "max_tokens": 256,
        "anthropic_version": "bedrock-2023-05-31",
    },
)


### MLflow Custom Metrics: 

While MLflow has some built-in [Heuristic](https://mlflow.org/docs/2.21.3/llms/llm-evaluate/#heuristic-based-metrics) and [LLM as a Judge metrics](https://mlflow.org/docs/2.21.3/llms/llm-evaluate/#llm-as-a-judge-metrics), those metrics are limited and may not be able to fit all situations. This section we will show you how to build your own custom metrics.

1. You need to define your custom metric method with two arguments `predictions` and `targets`. "predictions" is a list of predicted results from you agent and "targets" is a list of ground truth. You can define any custom certeria for judgement. Here we design it to be exact match 
2. Use `make_metrics` method to make our customer metrics to be a MLflow metrics. Notices that here we use `greater_is_better = True` This indicate the score is higher the better. In future if you have similar metrics like incorrectness, you can set `greater_is_better = False`

In [None]:
from mlflow.metrics import MetricValue 
from mlflow.models import make_metric

def custom_correctness(predictions, targets):
    '''
    custom correctness based on exact match 

    Args:
        precitions: list of agent predicted values 
        targets: list of ground truth values 

    Returns: 
        A list of MetricValue scores
    '''
    scores = [] 

    for prediction, target in zip(predictions, targets):
        if prediction.strip() == target.strip():
            scores.append(1)
        else:
            scores.append(0)

    return MetricValue(scores= scores)

custom_correctness_metric = make_metric(
    eval_fn=custom_correctness, greater_is_better= True , name="custom_answer_correctness"
)

## Run Evaluation

Now we can use `mlflow.evaluate()` to see the performance of the chain on an evaluation dataset we created. we will use both build in metrics and custom metric 



In [None]:
with mlflow.start_run() as evaluation_run:
    eval_dataset = mlflow.data.from_pandas(
        df=eval_df,
        name="eval_dataset",
        targets="expected_output",
        predictions="actual_output"
    )
    mlflow.log_input(dataset=eval_dataset)


    result = mlflow.evaluate(
        data=eval_dataset,
        extra_metrics=[
            rouge_metric,
            answer_similarity_metric,
            custom_correctness_metric
        ],
    )
       

## Offline Evaluation Results

The results will be attached to your evaluation dataset frameworks with additional columns for scores


In [None]:
print(f"See aggregated evaluation results below: \n{result.metrics}")
result.tables["eval_results_table"]

Let's inspect one of the poor performing test cases:

The error,  API Rate Limit Exceeded, for ticket TICKET-004 is not present in the solution book. This means the information_retriever tool could not return the steps to resolve the error.

The output from the model is "error type not found in the knowledge base, please use your own knowledge" instead of the expected result: 
1. Implement request throttling
2. Use caching to reduce API calls
3. Review API usage patterns
4. Contact service provider for limit increase
5. Optimize API call frequency

Thus the LLM as a Judge Metric for answer similarity is low, a 1 out of 5, for this test case, catching the gap in the informatiion_retriever tool.

In [None]:
with pd.option_context('display.max_colwidth', None):
    print(result.tables["eval_results_table"].iloc[3])
