# Evaluate Supply Chain Agent
This notebook demonstrates how to evaluate the agent with Mosaic AI Agent Evaluation.

## Cluster Configuration
This notebook was tested on the following Databricks cluster configuration:
- **Databricks Runtime Version:** 16.4 LTS ML (includes Apache Spark 3.5.2, Scala 2.12)
- **Single Node** 
    - Azure: Standard_DS4_v2 (28 GB Memory, 8 Cores)
    - AWS: m5d.2xlarge (32 GB Memory, 8 Cores)

In [0]:
%pip install -r ../requirements.txt --quiet
dbutils.library.restartPython()

## Load the agent

In [0]:
import mlflow
from databricks import agents
from supply_chain_agent import config

# Connect to the Unity catalog model registry
mlflow.set_registry_uri("databricks-uc")

catalog = config.get("catalog")
schema = config.get("schema")
agent_name = config.get("agent_name")

# Load the latest version of the model using pyfunc flavor
agent = mlflow.pyfunc.load_model(f"models:/{catalog}.{schema}.{agent_name}@latest")

user_email = spark.sql('select current_user() as user').collect()[0]['user']  # User email address
first_name = user_email.split('.')[0]                                         # User first name

mlflow.set_experiment(f"/Users/{user_email}/supply-chain-agent")

## Test the agent

Interact with the agent to test its output.

In [0]:
import os
import mlflow
from dbruntime.databricks_repl_context import get_context

# TODO: set WORKSPACE_URL manually if it cannot be inferred from the current notebook
WORKSPACE_URL = None
if WORKSPACE_URL is None:
  workspace_url_hostname = get_context().browserHostName
  assert workspace_url_hostname is not None, "Unable to look up current workspace URL. This can happen if running against serverless compute. Manually set WORKSPACE_URL yourself above, or run this notebook against classic compute"
  WORKSPACE_URL = f"https://{workspace_url_hostname}"

# TODO: set secret_scope_name and secret_key_name to access your PAT
secret_scope = first_name
secret_key = "token"

os.environ["HOST"] = WORKSPACE_URL
os.environ["TOKEN"] = dbutils.secrets.get(scope=secret_scope, key=secret_key)

In [0]:
from supply_chain_agent import AGENT

agent.predict({"messages": [{"role": "user", "content": "What happens if T2_4 goes down and takes 6 weeks to recover? What should I do?"}]})

## Evaluate the agent with [Agent Evaluation](https://learn.microsoft.com/azure/databricks/mlflow3/genai/eval-monitor)

You can edit the requests or expected responses in your evaluation dataset and run evaluation as you iterate your agent, leveraging mlflow to track the computed quality metrics. Evaluate your agent with one of our [predefined LLM scorers](https://learn.microsoft.com/azure/databricks/mlflow3/genai/eval-monitor/predefined-judge-scorers), or try adding [custom metrics](https://learn.microsoft.com/azure/databricks/mlflow3/genai/eval-monitor/custom-scorers).

In [0]:
import json
import mlflow.genai
from mlflow.genai.judges import custom_prompt_judge
from mlflow.entities import Trace, Feedback, SpanType
from mlflow.genai.scorers import Guidelines, RelevanceToQuery, Safety, scorer
from typing import Any

Create a sample evaluation dataset.

In [0]:
# Evaluation dataset
eval_data = [
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "List all downstream production sites for the raw material supplied by T3_10, and include any related information about these sites."
                }
            ],
        },
        "expectations": {
            "expected_tool": ["data_analysis_tool"],
        }
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Tell me what happens if T2_8 is disrupted and requires 9 to recover. What should I do?"
                }
            ],
        },
        "expectations": {
            "expected_tool": ["optimization_tool"],
        }
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "What happens if T2_4 goes down and takes 6 weeks to recover? What are the recommendations?"
                }
            ],
        },
        "expectations": {
            "expected_tool": ["optimization_tool"],
        }
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Tell me the demand for T1_5 and the inventory levels of all materials needed to produce this finished goods."
                }
            ],
        },
        "expectations": {
            "expected_tool": ["data_analysis_tool"],
        }
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "There has been an incident at T3_15 and it will go down for the next 10 time units. What can I do to mitigate the risk?"
                }
            ],
        },
        "expectations": {
            "expected_tool": ["optimization_tool"],
        }
    },
]

We will generate initial traces by running the agent. The results, including traces, are logged to the MLflow experiment defined above.

In [0]:
# Run evaluation
def evaluate_model(messages) -> dict:
    return agent.predict({"messages": messages})

@scorer
def dummy_metric():
    # This scorer is just to help generate initial traces.
    return 1
  
results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=evaluate_model,
    scorers=[dummy_metric]
)

Access data from the trace.

In [0]:
generated_traces = mlflow.search_traces(run_id=results.run_id)

Define our own evaluation criteria and wrap as custom scorers.

In [0]:
# Define the judge function for the tool usage
@scorer
def tool_usage(inputs: str, expectations: dict[str, Any], trace: Trace) -> Feedback:
    tool_spans = trace.search_spans(span_type=SpanType.TOOL)
    used_tools = [json.loads(s.to_dict()['attributes']['mlflow.spanOutputs'])['name'] for s in tool_spans]
    expected_tools = expectations.get("expected_tool")
    expected_tools_unused = list(set(expected_tools) - set(used_tools))
    if not expected_tools_unused:
        return Feedback(value="yes", rationale="All expected tools were used.")
    else:
        return Feedback(value="no", rationale=f"Following expected tools were not used: {expected_tools_unused}")

# Judge prompt for the right tool parameter setting
tool_parameter_prompt = """
Evaluate the trace and determine if the right parameters were passed to the tool given the question.

Question:
{{{{question}}}}

Tool spans to evaluate:
{{{{spans}}}}

Return ONLY a valid JSON object with exactly these keys:
- "rationale": short explanation (<=80 words)
- "result": one of [[pass]], [[fail]]

STRICT OUTPUT RULES:
- Output a single JSON object.
- No markdown, no code fences, no extra text before/after.
- Example (copy the format exactly):
{{"rationale":"...", "result":"pass"}}
""".strip()

# Define the judge function for the tool parameter setting
@scorer
def tool_parameter(trace: Trace) -> Feedback:
    question = trace.search_spans(span_type=SpanType.AGENT)[0].inputs['messages'][0]['content']
    tool_spans = trace.search_spans(span_type=SpanType.TOOL)
    tool_spans_str = "\n".join(
        json.dumps(s.to_dict(), ensure_ascii=False) for s in tool_spans
    )
    judge = custom_prompt_judge(
        name="tool_parameter",
        prompt_template=tool_parameter_prompt,
        numeric_values={
            "pass": 1,
            "fail": 0,
        },
    )
    return judge(question=question, spans=tool_spans_str)
 
# Define the judge function for the response time
@scorer
def response_time(trace: Trace) -> Feedback:
    
    # Search particular span type from the trace
    agent_span = trace.search_spans(span_type=SpanType.AGENT)[0]

    response_time = (agent_span.end_time_ns - agent_span.start_time_ns) / 1e9 # second
    max_duration = 120
    if response_time <= max_duration:
        return Feedback(
            value="yes",
            rationale=f"Response time {response_time:.2f}s is within the {max_duration}s limit."
        )
    else:
        return Feedback(
            value="no",
            rationale=f"Response time {response_time:.2f}s exceeds the {max_duration}s limit."
        )

# Define scorers
scorers = [
    Guidelines(
        name="response_length",
        guidelines="The response MUST be concise and to the point not longer than 500 words.",
    ),
    Guidelines(
        name="professional_tone",
        guidelines="The response MUST be in a professional tone.",
    ),
    Guidelines(
        name="includes_recommendations",
        guidelines="The response MUST include specific, actionable recommendations.",
    ),
    RelevanceToQuery(),
    response_time,
    tool_usage,
    tool_parameter
]

Run `mlflow.genai.evaluate` to apply the custom scoreres defined above.

In [0]:
# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
trace_check_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=scorers
)

## Assign `production` alias to this version of the agent

In [0]:
from mlflow import MlflowClient

client = MlflowClient()
model_info = client.get_model_version_by_alias(f"{catalog}.{schema}.{agent_name}", "latest")
client.set_registered_model_alias(f"{catalog}.{schema}.{agent_name}", "production", model_info.version)

## Next steps
In this notebook, we explored how to evaluate the agent using custom metrics. In the next notebook, we will deploy the agent in Model Serving. See you there!