
## End-to-End Evaluation of Multi-Agent Systems on Vertex AI with Cloud Run Deployment

This notebook demonstrates how to quickly deploy A2A Agents into Cloud run and  evaluate **A2A+ADK Multi-Agents** using Vertex AI Evaluation services.

**Summary**:
1. **Deploying A2A Agents to Cloud Run**: Learn how to containerize and deploy your Python-based A2A agents to Cloud Run, enabling them to communicate with each other through a secure and scalable architecture.
2. **Orchestration with a Hosting Agent**: See how to create a central "hosting" agent that orchestrates the interactions between the deployed A2A agents, routing user requests to the appropriate specialized agent.
3. Leveraging Vertex AI for Evaluation: Discover how to use Vertex AI's evaluation services to rigorously assess the performance of your multi-agent system. We'll cover how to:
  - Define evaluation datasets with prompts and expected tool calls (trajectories).
  - Run evaluation tasks to measure trajectory-based metrics like *trajectory_exact_match, trajectory_precision*, and *trajectory_recall*.
  - Evaluate the final generated responses for coherence and safety.
4. **Custom Evaluation Metrics**: Learn how to create custom metrics to evaluate specific aspects of your agent's behavior, such as whether the final response logically follows from the sequence of tool calls.



**Prerequisites:**
1.  **Google Cloud Project:** You need a Google Cloud Project with the Vertex AI API enabled.
2.  **Authentication:** You need to be authenticated to Google Cloud. In a Colab environment, this is usually handled by running `from google.colab import auth` and `auth.authenticate_user()`.
3.  **Agent Logic:** The Airbnb A2A Agent and Weather A2A Agent are imported from github into this colab and deployed to Cloud run directly. The logic for the Hosting/Routing Agent (e.g., a `HostingAgentExecutor` class) are defined or importable within this notebook. This executor should have a method like `async def execute(self, message_payload: a2a.types.MessagePayload) -> a2a.types.Message:`.

## Preparation


### Setup and Installs

In [None]:
!pip install google-cloud-aiplatform httpx "a2a-sdk==0.2.11" --quiet
!pip install --upgrade --quiet  'google-adk'
!pip install "langchain-google-genai==2.1.5" --quiet
!pip install "langchain-mcp-adapters==0.1.0" --quiet
!pip install "langchain-google-vertexai==2.0.24" --quiet
!pip install "langgraph==0.4.5" --quiet

In [None]:
import asyncio
import json
import logging
import os
import random
import string
import subprocess
import uuid

from collections.abc import Callable
from typing import Any

import httpx
import pandas as pd
import plotly.graph_objects as go

from IPython.display import HTML, Markdown, display
from a2a.client import A2ACardResolver, A2AClient
from a2a.types import (
    AgentCard,
    MessageSendParams,
    Part,
    SendMessageRequest,
    SendMessageResponse,
    SendMessageSuccessResponse,
    Task,
    TaskArtifactUpdateEvent,
    TaskStatusUpdateEvent,
)
from dotenv import load_dotenv
from google.adk import Agent
from google.adk.agents.callback_context import CallbackContext
from google.adk.agents.readonly_context import ReadonlyContext

# Build agent with adk
from google.adk.events import Event
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
from google.adk.tools.tool_context import ToolContext
from google.cloud import aiplatform
from google.colab import auth

# Evaluate agent
from google.genai import types
from vertexai.preview.evaluation import EvalTask
from vertexai.preview.evaluation.metrics import (
    PointwiseMetric,
    PointwiseMetricPromptTemplate,
)

### 2. Configuration

In [None]:
# --- Google Cloud Configuration ---
# You might need to authenticate gcloud first if you haven't already

PROJECT_ID = ""  # @param {type:"string"}
PROJECT_NUM = ""  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

# --- Authentication (for Colab) ---
if not PROJECT_ID:
    raise ValueError("Please set your PROJECT_ID.")

try:
    auth.authenticate_user()
    print("Colab user authenticated.")
except Exception as e:
    print(
        f"Not in a Colab environment or auth failed: {e}. Assuming local gcloud auth."
    )

aiplatform.init(project=PROJECT_ID, location=LOCATION)
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
os.environ["GOOGLE_CLOUD_LOCATION"] = LOCATION
os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = "True"

Colab user authenticated.


In [None]:
EXPERIMENT_NAME = "evaluate-a2a"  # @param {type:"string"}
BUCKET_NAME = "a2a-sdk-eval"  # @param {type: "string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

## Deploy A2A Agents to Cloud Run

In [None]:
# @title github pull to get A2A samples
# Download the a2a-samples from github so we deploy the A2A airbnb and weather agent samples
!git clone https://github.com/a2aproject/a2a-samples.git

Cloning into 'a2a-samples'...
remote: Enumerating objects: 3127, done.[K
remote: Counting objects: 100% (1075/1075), done.[K
remote: Compressing objects: 100% (436/436), done.[K
remote: Total 3127 (delta 872), reused 639 (delta 639), pack-reused 2052 (from 2)[K
Receiving objects: 100% (3127/3127), 4.75 MiB | 12.28 MiB/s, done.
Resolving deltas: 100% (1763/1763), done.


In [None]:
# Basic logging setup (helpful for seeing what the handler does)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

### Build Airbnb Agent and Deploy to Cloud Run

In [None]:
# We need to change the default host from localhost to 0.0.0.0 since the ingress container should not listen on 127.0.0.1. # noqa: E501
# Also change the APP_URL in the agent_card so it allow to set the correct Cloud run URL. # noqa: E501
!sed -i "s/DEFAULT_HOST = 'localhost'/DEFAULT_HOST = '0.0.0.0'/" a2a-samples/samples/python/agents/airbnb_planner_multiagent/airbnb_agent/__main__.py # noqa: E501
!sed -i "s/DEFAULT_HOST = 'localhost'/DEFAULT_HOST = '0.0.0.0'/" a2a-samples/samples/python/agents/airbnb_planner_multiagent/weather_agent/__main__.py # noqa: E501
!sed -i "s/url=f'http:\/\/{host}:{port}\/'/url=os.environ[\"APP_URL\"]/" a2a-samples/samples/python/agents/airbnb_planner_multiagent/airbnb_agent/__main__.py # noqa: E501
!sed -i "s/url=f'http:\/\/{host}:{port}\/'/url=os.environ[\"APP_URL\"]/" a2a-samples/samples/python/agents/airbnb_planner_multiagent/weather_agent/__main__.py # noqa: E501

In [None]:
#@title Prepare the docker files
%%writefile a2a-samples/samples/python/Dockerfile
FROM python:3.13-slim
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

# Add Node.js and npm
# Required for airbnb_agent to work
RUN apt-get update && apt-get install -y nodejs npm

EXPOSE 10002
WORKDIR /app

COPY . /app

RUN uv sync

WORKDIR /app/agents/airbnb_planner_multiagent/airbnb_agent/

ENTRYPOINT ["uv", "run", ".", "--host", "0.0.0.0", "--port", "10002"]

Writing a2a-samples/samples/python/Dockerfile


In [None]:
#@title Build the docker image for airbnb A2A Agent
# Replace [PROJECT_ID] with your Google Cloud Project ID
# Replace [IMAGE_NAME] with the desired name for your Docker image.
# Replace [TAG] with a tag for your image (e.g., latest)
# Replace [PATH_TO_YOUR_SOURCE_CODE] with the path to the source directory.
# If your source code is in the current directory, you can use '.'

# PROJECT_ID = "agent-sandbox-944298" # @param {type:"string"}
IMAGE_NAME = "airbnb-a2a-sample-agent" # @param {type:"string"}
# LOCATION = "us-central1" # @param {type:"string"}
TAG = "latest" # @param {type:"string"}
SOURCE_PATH = "a2a-samples/samples/python/" # @param {type:"string"}
# Using Google Container Registry (GCR)
IMAGE_URL = f"gcr.io/{PROJECT_ID}/{IMAGE_NAME}:{TAG}"

print(f"Building and pushing image to: {IMAGE_URL}")

!gcloud builds submit {SOURCE_PATH} \
  --project={PROJECT_ID} \
  --tag={IMAGE_URL}

In [None]:
# Replace [SERVICE-NAME] with the desired name for your A2A Agent.
# Replace [REGION] with the Google Cloud region where you want to deploy.
# Replace [IMAGE_URL] with the full path to your container image.

# Replace with your actual service name, region, and image URL
SERVICE_NAME = "airbnb-a2a-sample-agent"  # @param {type:"string"}
# Correctly format the IMAGE_URL string
IMAGE_URL = f"gcr.io/{PROJECT_ID}/{SERVICE_NAME}:latest"
AIRBNB_APP_URL = f"https://{SERVICE_NAME}-{PROJECT_NUM}.{LOCATION}.run.app"

In [None]:
#@title Run the airbnb A2A Agent in Cloud Run
!gcloud run deploy {SERVICE_NAME} \
    --verbosity=debug \
    --memory=1.5G \
    --image={IMAGE_URL} \
    --region={LOCATION} \
    --port=10002 \
    --project={PROJECT_ID} \
    --no-allow-unauthenticated \
    --set-env-vars=GOOGLE_GENAI_USE_VERTEXAI=TRUE,GOOGLE_GENAI_MODEL="gemini-2.5-flash",PROJECT_ID={PROJECT_ID},LOCATION={LOCATION},APP_URL={AIRBNB_APP_URL}

### Build Weather Agent and Deploy to Cloud Run

In [None]:
#@title Prepare the docker file
%%writefile a2a-samples/samples/python/Dockerfile
FROM python:3.13-slim
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

# Add Node.js and npm
# Required for airbnb_agent to work
# RUN apt-get update && apt-get install -y nodejs npm

EXPOSE 10001
WORKDIR /app

COPY . /app

RUN uv sync

WORKDIR /app/agents/airbnb_planner_multiagent/weather_agent/

ENTRYPOINT ["uv", "run", ".", "--host", "0.0.0.0", "--port", "10001"]

Overwriting a2a-samples/samples/python/Dockerfile


In [None]:
#@title Build the docker image for Weather A2A Agent

# Replace [PROJECT_ID] with your Google Cloud Project ID
# Replace [IMAGE_NAME] with the desired name for your Docker image.
# Replace [TAG] with a tag for your image (e.g., latest)
# Replace [SOURCE_PATH] with the path to the source directory.
# If your source code is in the current directory, you can use '.'
IMAGE_NAME = "weather-a2a-sample-agent" # @param {type:"string"}
TAG = "latest" # @param {type:"string"}
SOURCE_PATH = "a2a-samples/samples/python/" # @param {type:"string"}
# Using Google Container Registry (GCR)
IMAGE_URL = f"gcr.io/{PROJECT_ID}/{IMAGE_NAME}:{TAG}"

print(f"Building and pushing image to: {IMAGE_URL}")
!gcloud builds submit {SOURCE_PATH} \
  --verbosity=debug \
  --project={PROJECT_ID} \
  --tag={IMAGE_URL}

In [None]:
# Replace [SERVICE-NAME] with the desired name for your Cloud Run service
# Replace [REGION] with the Google Cloud region where you want to deploy (e.g., us-central1)
# Replace [IMAGE_URL] with the full path to your container image in GCR or Artifact Registry

# Replace with your actual service name, region, and image URL
SERVICE_NAME = "weather-a2a-sample-agent"  # @param {type:"string"}
IMAGE_URL = f"gcr.io/{PROJECT_ID}/{SERVICE_NAME}:latest"
# The agent service can either run using API_KEY or Vertex AI directly.
API_KEY = ""  # @param {type:"string"}
WEATHER_APP_URL = f"https://{SERVICE_NAME}-{PROJECT_NUM}.{LOCATION}.run.app"

In [None]:
#@title Run the Weather A2A Agent in Cloud Run
!gcloud run deploy {SERVICE_NAME} \
    --verbosity=debug \
    --memory=1.5G \
    --image={IMAGE_URL} \
    --region={LOCATION} \
    --port=10001 \
    --project={PROJECT_ID} \
    --no-allow-unauthenticated \
    --set-env-vars=GOOGLE_GENAI_USE_VERTEXAI=False,GOOGLE_API_KEY={API_KEY},APP_URL={WEATHER_APP_URL}

## Command line to quick test the Agent servers on Cloud Run

If the above gcloud run, if "--allow-unauthenticated" paramater is not set as disabled, the generated A2A endpoints is a public URL which can be accessed by anyone. If it's set as disabled as "--no-allow-unauthenticated", the URL is not a public URL and it's based IAM based auth (private). In this case, you'll need to set the cloud identity token in the auth header when talking to this A2A endpints.

The below will show how you can obtain the identity token and used in the Hosting Agent defined below. For public URL, you don't have to obtain this token and remove that auth header parts.

The token can be obtained by gcloud cli. Please follow the steps below in this notebook or you can run gcloud command in the shell and copy the token value to here.

In [None]:
!gcloud auth login
!gcloud config set project {PROJECT_ID}
!gcloud auth print-identity-token

In [None]:
try:
    # Run the gcloud command and capture its output
    token_bytes = subprocess.check_output(["gcloud", "auth", "print-identity-token"])  # noqa: E501

    # Decode the bytes to a string and remove any leading/trailing whitespace
    TOKEN = token_bytes.decode("utf-8").strip()

    print("Captured token:")
    print(TOKEN)

except subprocess.CalledProcessError as e:
    print(f"Error running gcloud command: {e}")
    print(f"Stderr: {e.stderr.decode('utf-8')}")
except FileNotFoundError:
    print(
        "Error: gcloud command not found. Make sure gcloud SDK is installed."  
    )

In [None]:
HOST = f"{AIRBNB_APP_URL}/.well-known/agent.json"
!curl -H "Authorization: Bearer {TOKEN}" {HOST}

{"capabilities":{"pushNotifications":true,"streaming":true},"defaultInputModes":["text","text/plain"],"defaultOutputModes":["text","text/plain"],"description":"Helps with searching accommodation","name":"Airbnb Agent","protocolVersion":"0.2.5","skills":[{"description":"Helps with accommodation search using airbnb","examples":["Please find a room in LA, CA, April 15, 2025, checkout date is april 18, 2 adults"],"id":"airbnb_search","name":"Search airbnb accommodation","tags":["airbnb accommodation"]}],"url":"https://airbnb-a2a-sample-agent-559264237696.us-central1.run.app","version":"1.0.0"}

In [None]:
HOST = f"{WEATHER_APP_URL}/.well-known/agent.json"
!curl -H "Authorization: Bearer {TOKEN}" {HOST}

{"capabilities":{"streaming":true},"defaultInputModes":["text"],"defaultOutputModes":["text"],"description":"Helps with weather","name":"Weather Agent","protocolVersion":"0.2.5","skills":[{"description":"Helps with weather in city, or states","examples":["weather in LA, CA"],"id":"weather_search","name":"Search weather","tags":["weather"]}],"url":"https://weather-a2a-sample-agent-559264237696.us-central1.run.app","version":"1.0.0"}

## Define Eval helper functions

Initiate a set of helper functions to print tutorial results.

In [None]:
# @title eval helper functions


def get_id(length: int = 8) -> str:
    """Generate a uuid of a specified length (default=8)."""
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length)) # noqa: E501

def parse_adk_output_to_dictionary(
    events: list[Event], *, as_json: bool = False
) -> dict[str, Any]:
    """Parse ADK event output into a structured dictionary format."""
    final_response = ""
    trajectory = []

    for event in events:
        if not getattr(event, "content", None) or not getattr(
            event.content, "parts", None
        ):
            continue
        for part in event.content.parts:
            if getattr(part, "function_call", None):
                info = {
                    "tool_name": part.function_call.name,
                    "tool_input": dict(part.function_call.args),
                }
                if info not in trajectory:
                    trajectory.append(info)
            if event.content.role == "model" and getattr(part, "text", None):
                final_response = part.text.strip()

    trajectory_out = json.dumps(trajectory) if as_json else trajectory
    return {"response": final_response, "predicted_trajectory": trajectory_out}


def format_a2a_output_as_markdown(output: dict) -> str:
    """Convert the output dictionary to a formatted markdown string."""
    markdown = "### AI Response\n"
    markdown += f"{output['response']}\n\n"

    if output["predicted_trajectory"]:
        output["predicted_trajectory"] = json.loads(output["predicted_trajectory"]) # noqa
        markdown += "### Function Calls\n"
        for call in output["predicted_trajectory"]:
            markdown += f"- **Function**: `{call['tool_name']}`\n"
            markdown += "  - **Arguments**:\n"
            for key, value in call["tool_input"].items():
                markdown += f"    - `{key}`: `{value}`\n"

    return markdown


def format_output_as_markdown(output: dict) -> str:
    """Convert the output dictionary to a formatted markdown string."""
    markdown = "### AI Response\n" + output["response"] + "\n\n"
    if output["predicted_trajectory"]:
        markdown += "### Function Calls\n"
        for call in output["predicted_trajectory"]:
            markdown += f"- **Function**: `{call['tool_name']}`\n"
            markdown += "  - **Arguments**\n"
            for key, value in call["tool_input"].items():
                markdown += f"    - `{key}`: `{value}`\n"
    return markdown


def display_eval_report(eval_result: pd.DataFrame) -> None:
    """Display the evaluation results."""
    display(Markdown("### Summary Metrics"))
    display(
        pd.DataFrame(eval_result.summary_metrics.items(), columns=["metric", "value"]) # noqa: E501
    )
    if getattr(eval_result, "metrics_table", None) is not None:
        display(Markdown("### Rowwise Metrics"))
        display(eval_result.metrics_table.head())


def display_drilldown(row: pd.Series) -> None:
    """Displays a drill-down view for trajectory data within a row."""
    style = "white-space: pre-wrap; width: 800px; overflow-x: auto;"

    if not (
        isinstance(row["predicted_trajectory"], list)
        and isinstance(row["reference_trajectory"], list)
    ):
        return

    for predicted_trajectory, reference_trajectory in zip(
        row["predicted_trajectory"], row["reference_trajectory"], strict=False
    ):
        display(
            HTML(
                f"<h3>Tool Names:</h3><div style='{style}'>{predicted_trajectory['tool_name'], reference_trajectory['tool_name']}</div>" # noqa: E501
            )
        )

        if not (
            isinstance(predicted_trajectory.get("tool_input"), dict)
            and isinstance(reference_trajectory.get("tool_input"), dict)
        ):
            continue

        for tool_input_key in predicted_trajectory["tool_input"]:
            print("Tool Input Key: ", tool_input_key)

            if tool_input_key in reference_trajectory["tool_input"]:
                print(
                    "Tool Values: ",
                    predicted_trajectory["tool_input"][tool_input_key],
                    reference_trajectory["tool_input"][tool_input_key],
                )
            else:
                print(
                    "Tool Values: ",
                    predicted_trajectory["tool_input"][tool_input_key],
                    "N/A",
                )
        print("\n")
    display(HTML("<hr>"))


def display_dataframe_rows(
    df: pd.DataFrame,
    columns: list[str] | None = None,
    num_rows: int = 3,
    display_drilldown: bool = False,
) -> None:
    """Displays a subset of rows from a DataFrame, optionally including a drill-down view.""" # noqa
    if columns:
        df = df[columns]

    base_style = "font-family: monospace; font-size: 14px; white-space: pre-wrap; width: auto; overflow-x: auto;" # noqa: E501
    header_style = base_style + "font-weight: bold;"

    for _, row in df.head(num_rows).iterrows():
        for column in df.columns:
            display(
                HTML(
                    f"<span style='{header_style}'>{column.replace('_', ' ').title()}: </span>" # noqa: E501
                )
            )
            display(HTML(f"<span style='{base_style}'>{row[column]}</span><br>"))

        display(HTML("<hr>"))

        if (
            display_drilldown
            and "predicted_trajectory" in df.columns
            and "reference_trajectory" in df.columns
        ):
            display_drilldown(row)


def plot_bar_plot(
    eval_result: pd.DataFrame, title: str, metrics: list[str] | None = None
) -> None:
    """Plot the bar plot for summary metrics."""
    fig = go.Figure()
    data = []

    summary_metrics = eval_result.summary_metrics
    if metrics:
        summary_metrics = {
            k: v
            for k, v in summary_metrics.items()
            if any(selected_metric in k for selected_metric in metrics)
        }

    data.append(
        go.Bar(
            x=list(summary_metrics.keys()),
            y=list(summary_metrics.values()),
            name=title,
        )
    )

    fig = go.Figure(data=data)

    # Change the bar mode
    fig.update_layout(barmode="group")
    fig.show()


def display_radar_plot(
    eval_results: pd.DataFrame, title: str, metrics: list[str] | None = None
) -> None:
    """Plot the radar plot."""
    fig = go.Figure()
    summary_metrics = eval_results.summary_metrics
    if metrics:
        summary_metrics = {
            k: v
            for k, v in summary_metrics.items()
            if any(selected_metric in k for selected_metric in metrics)
        }

    min_val = min(summary_metrics.values())
    max_val = max(summary_metrics.values())

    fig.add_trace(
        go.Scatterpolar(
            r=list(summary_metrics.values()),
            theta=list(summary_metrics.keys()),
            fill="toself",
            name=title,
        )
    )
    fig.update_layout(
        title=title,
        polar_radialaxis_range=[min_val, max_val],
        showlegend=True,
    )
    fig.show()

## Assemble the Hosting (ADK) Agent

The Vertex AI Gen AI Evaluation works directly with 'Queryable' agents, and also lets you add your own custom functions with a specific structure (signature).

In this case, you assemble the agent using a custom function. The function triggers the agent for a given input and parse the agent outcome to extract the response and called tools.

### Defining the RmoteAgentConnection helper class
This class uses the bearer token to authenticate so it's able to talk to the remote A2A endpoints which is under Cloud IAM permission control.

In [None]:
load_dotenv()

TaskCallbackArg = Task | TaskStatusUpdateEvent | TaskArtifactUpdateEvent
TaskUpdateCallback = Callable[[TaskCallbackArg, AgentCard], Task]

# --- End Authentication Header ---


class RemoteAgentConnections:
    """A class to hold the connections to the remote agents."""

    def __init__(self, agent_card: AgentCard, agent_url: str):
        print(f"agent_card: {agent_card}")
        print(f"agent_url: {agent_url}")
        headers = {"Authorization": f"Bearer {TOKEN}"}
        self._httpx_client = httpx.AsyncClient(timeout=30, headers=headers)
        self.agent_client = A2AClient(self._httpx_client, agent_card, url=agent_url) # noqa: E501
        self.card = agent_card

    def get_agent(self) -> AgentCard:
        """Get the agent card."""
        return self.card

    async def send_message(
        self, message_request: SendMessageRequest
    ) -> SendMessageResponse:
        """Send a message to the agent."""
        return await self.agent_client.send_message(message_request)

### Defining the Hosting Agent
This hosting agent does orchetration and routing to different A2A Agents deployed in Cloud Run


In [None]:
load_dotenv()


def convert_part(part: Part, tool_context: ToolContext) -> str:
    """Convert a part to text. Only text parts are supported."""
    if part.type == "text":
        return part.text

    return f"Unknown type: {part.type}"


def convert_parts(parts: list[Part], tool_context: ToolContext) -> list[str]:
    """Convert parts to text."""
    rval = []
    for p in parts:
        rval.append(convert_part(p, tool_context))
    return rval


def create_send_message_payload(
    text: str, task_id: str | None = None, context_id: str | None = None
) -> dict[str, Any]:
    """Helper function to create the payload for sending a task."""
    payload: dict[str, Any] = {
        "message": {
            "role": "user",
            "parts": [{"type": "text", "text": text}],
            "messageId": uuid.uuid4().hex,
        },
    }

    if task_id:
        payload["message"]["taskId"] = task_id

    if context_id:
        payload["message"]["contextId"] = context_id
    return payload


class RoutingAgent:
    """The Routing agent.

    This is the agent responsible for choosing which remote seller agents to
    send tasks to and coordinate their work.
    """

    def __init__(
        self,
        task_callback: TaskUpdateCallback | None = None,
    ):
        self.task_callback = task_callback
        self.remote_agent_connections: dict[str, RemoteAgentConnections] = {}
        self.cards: dict[str, AgentCard] = {}
        self.agents: str = ""

    async def _async_init_components(self, remote_agent_addresses: list[str]) -> None: # noqa: E501
        """Asynchronous part of initialization."""
        # Use a single httpx.AsyncClient for all card resolutions for efficiency
        headers = {"Authorization": f"Bearer {TOKEN}"}
        print("Use auth headers")

        async with httpx.AsyncClient(timeout=30, headers=headers) as client:
            for address in remote_agent_addresses:
                card_resolver = A2ACardResolver(client, address)  # Constructor is sync
                try:
                    card = (
                        await card_resolver.get_agent_card()
                    )  # get_agent_card is async

                    remote_connection = RemoteAgentConnections(
                        agent_card=card, agent_url=address
                    )
                    self.remote_agent_connections[card.name] = remote_connection
                    self.cards[card.name] = card
                except httpx.ConnectError as e:
                    print(f"ERROR: Failed to get agent card from {address}: {e}")
                except Exception as e:  # Catch other potential errors
                    print(f"ERROR: Failed to initialize connection for {address}: {e}") # noqa

        # Populate self.agents using the logic from original __init__ (via list_remote_agents)
        agent_info = []
        for agent_detail_dict in self.list_remote_agents():
            agent_info.append(json.dumps(agent_detail_dict))
        self.agents = "\n".join(agent_info)

    @classmethod
    async def create(
        cls,
        remote_agent_addresses: list[str],
        task_callback: TaskUpdateCallback | None = None,
    ) -> "RoutingAgent":
        """Create and asynchronously initialize an instance of the RoutingAgent."""
        instance = cls(task_callback)
        await instance._async_init_components(remote_agent_addresses)
        return instance

    def create_agent(self) -> Agent:
        """Create an instance of the RoutingAgent."""
        model_id = "gemini-2.5-flash"
        print(f"Using hardcoded model: {model_id}")
        return Agent(
            model=model_id,
            name="Routing_agent",
            instruction=self.root_instruction,
            before_model_callback=self.before_model_callback,
            description=(
                "This Routing agent orchestrates the decomposition of the user asking for weather forecast or airbnb accommodation" # noqa: E501
            ),
            tools=[
                self.send_message,
            ],
        )

    def root_instruction(self, context: ReadonlyContext) -> str:
        """Generate the root instruction for the RoutingAgent."""
        current_agent = self.check_active_agent(context)
        return f"""
        **Role:** You are an expert Routing Delegator. Your primary function is to accurately delegate user inquiries regarding weather or accommodations to the appropriate specialized remote agents.

        **Core Directives:**

        * **Task Delegation:** Utilize the `send_message` function to assign actionable tasks to remote agents.
        * **Contextual Awareness for Remote Agents:** If a remote agent repeatedly requests user confirmation, assume it lacks access to the         full conversation history. In such cases, enrich the task description with all necessary contextual information relevant to that         specific agent.
        * **Autonomous Agent Engagement:** Never seek user permission before engaging with remote agents. If multiple agents are required to         fulfill a request, connect with them directly without requesting user preference or confirmation.
        * **Transparent Communication:** Always present the complete and detailed response from the remote agent to the user.
        * **User Confirmation Relay:** If a remote agent asks for confirmation, and the user has not already provided it, relay this         confirmation request to the user.
        * **Focused Information Sharing:** Provide remote agents with only relevant contextual information. Avoid extraneous details.
        * **No Redundant Confirmations:** Do not ask remote agents for confirmation of information or actions.
        * **Tool Reliance:** Strictly rely on available tools to address user requests. Do not generate responses based on assumptions. If         information is insufficient, request clarification from the user.
        * **Prioritize Recent Interaction:** Focus primarily on the most recent parts of the conversation when processing requests.
        * **Active Agent Prioritization:** If an active agent is already engaged, route subsequent related requests to that agent using the         appropriate task update tool.

        **Agent Roster:**

        * Available Agents: `{self.agents}`
        * Currently Active Seller Agent: `{current_agent['active_agent']}`
                """

    def check_active_agent(self, context: ReadonlyContext) -> dict[str, str]:
        """Check if there is an active agent in the current context."""
        state = context.state
        if (
            "session_id" in state
            and "session_active" in state
            and state["session_active"]
            and "active_agent" in state
        ):
            return {"active_agent": f'{state["active_agent"]}'}
        return {"active_agent": "None"}

    def before_model_callback(
        self, callback_context: CallbackContext, llm_request: Any
    ) -> None:
        """Callback to set up the session state before the model is called."""
        state = callback_context.state
        if "session_active" not in state or not state["session_active"]:
            if "session_id" not in state:
                state["session_id"] = str(uuid.uuid4())
            state["session_active"] = True

    def list_remote_agents(self) -> list[dict[str, str]]:
        """List the available remote agents you can use to delegate the task."""
        if not self.cards:
            return []

        remote_agent_info = []
        for card in self.cards.values():
            print(f"Found agent card: {card.model_dump(exclude_none=True)}")
            print("=" * 100)
            remote_agent_info.append(
                {"name": card.name, "description": card.description}
            )
        return remote_agent_info

    async def send_message(
        self, agent_name: str, task: str, tool_context: ToolContext
    ) -> Any:
        """Sends a task to remote seller agent.

        This will send a message to the remote agent named agent_name.

        Args:
            agent_name: The name of the agent to send the task to.
            task: The comprehensive conversation context summary
                and goal to be achieved regarding user inquiry and purchase request.
            tool_context: The tool context this method runs in.

        Yields:
            A dictionary of JSON data.
        """
        if agent_name not in self.remote_agent_connections:
            raise ValueError(f"Agent {agent_name} not found")
        state = tool_context.state
        state["active_agent"] = agent_name
        client = self.remote_agent_connections[agent_name]

        if not client:
            raise ValueError(f"Client not available for {agent_name}")
        task_id = state["task_id"] if "task_id" in state else str(uuid.uuid4())

        if "context_id" in state:
            context_id = state["context_id"]
        else:
            context_id = str(uuid.uuid4())

        message_id = ""
        metadata: dict[str, str] = {}
        if "input_message_metadata" in state:
            metadata.update(**state["input_message_metadata"])
            if "message_id" in state["input_message_metadata"]:
                message_id = state["input_message_metadata"]["message_id"]
        if not message_id:
            message_id = str(uuid.uuid4())

        payload = {
            "message": {
                "role": "user",
                "parts": [
                    {"type": "text", "text": task}
                ],  # Use the 'task' argument here
                "messageId": message_id,
            },
        }

        if task_id:
            payload["message"]["taskId"] = task_id

        if context_id:
            payload["message"]["contextId"] = context_id

        message_request = SendMessageRequest(
            id=message_id, params=MessageSendParams.model_validate(payload)
        )
        send_response: SendMessageResponse = await client.send_message(
            message_request=message_request
        )

        if not isinstance(send_response.root, SendMessageSuccessResponse):
            print("received non-success response. Aborting get task ")
            return None

        if not isinstance(send_response.root.result, Task):
            print("received non-task response. Aborting get task ")
            return None

        return send_response.root.result


async def create_routing_agent() -> Agent:
    """Creates and asynchronously initializes the RoutingAgent."""
    routing_agent_instance = await RoutingAgent.create(
        remote_agent_addresses=[
            AIRBNB_APP_URL,
            WEATHER_APP_URL,
        ]
    )
    return routing_agent_instance.create_agent()

### Define the Agent helper

In [None]:
async def agent_parsed_outcome(query: str) -> dict[str, Any]:
    """Runs the routing agent with the provided query and returns the parsed outcome."""
    app_name = "airbnb_weather_app"
    user_id = "user1"
    session_id = "session_one"

    routing_agent = await create_routing_agent()  # Await the async function

    session_service = InMemorySessionService()
    await session_service.create_session(
        app_name=app_name, user_id=user_id, session_id=session_id
    )

    runner = Runner(
        agent=routing_agent, app_name=app_name, session_service=session_service
    )

    content = types.Content(
        role="user", parts=[types.Part(text=query)]
    )  # Changed role to 'user'
    events = [
        event
        async for event in runner.run_async(
            user_id=user_id, session_id=session_id, new_message=content
        )
    ]

    return parse_adk_output_to_dictionary(events)

In [None]:
# --- Sync wrapper for Vertex AI evaluation


def agent_parsed_outcome_sync(prompt: str) -> dict[str, Any]:
    """Synchronous wrapper for the async agent_parsed_outcome function."""
    result = asyncio.run(agent_parsed_outcome(prompt))
    result["predicted_trajectory"] = json.dumps(result["predicted_trajectory"])
    return result

### Quick test with Agent Runner

In [None]:
response = agent_parsed_outcome_sync(prompt="Get product details for shoes")
display(Markdown(format_output_as_markdown(response)))

response = agent_parsed_outcome_sync(
    prompt="What's the weather in Yosemite Valley, CA")
display(Markdown(format_output_as_markdown(response)))

response = agent_parsed_outcome_sync(
    prompt="Looking for Airbnb in Yosemite for August 1 to 6, 2025"
)
display(Markdown(format_output_as_markdown(response)))

response = agent_parsed_outcome_sync(
    prompt="What's the weather in San Francisco, CA")
display(Markdown(format_output_as_markdown(response)))

response = agent_parsed_outcome_sync(
    prompt="Looking for Airbnb in Paris, France for August 10 to 12, 2025"
)
display(Markdown(format_output_as_markdown(response)))

Use auth headers
agent_card: additionalInterfaces=None capabilities=AgentCapabilities(extensions=None, pushNotifications=True, stateTransitionHistory=None, streaming=True) defaultInputModes=['text', 'text/plain'] defaultOutputModes=['text', 'text/plain'] description='Helps with searching accommodation' documentationUrl=None iconUrl=None name='Airbnb Agent' preferredTransport=None protocolVersion='0.2.5' provider=None security=None securitySchemes=None skills=[AgentSkill(description='Helps with accommodation search using airbnb', examples=['Please find a room in LA, CA, April 15, 2025, checkout date is april 18, 2 adults'], id='airbnb_search', inputModes=None, name='Search airbnb accommodation', outputModes=None, tags=['airbnb accommodation'])] supportsAuthenticatedExtendedCard=None url='https://airbnb-a2a-sample-agent-559264237696.us-central1.run.app' version='1.0.0'
agent_url: https://airbnb-a2a-sample-agent-559264237696.us-central1.run.app
agent_card: additionalInterfaces=None capabi

### AI Response
I am sorry, I cannot help you with this request. I can only help with weather or accommodations.



Use auth headers
agent_card: additionalInterfaces=None capabilities=AgentCapabilities(extensions=None, pushNotifications=True, stateTransitionHistory=None, streaming=True) defaultInputModes=['text', 'text/plain'] defaultOutputModes=['text', 'text/plain'] description='Helps with searching accommodation' documentationUrl=None iconUrl=None name='Airbnb Agent' preferredTransport=None protocolVersion='0.2.5' provider=None security=None securitySchemes=None skills=[AgentSkill(description='Helps with accommodation search using airbnb', examples=['Please find a room in LA, CA, April 15, 2025, checkout date is april 18, 2 adults'], id='airbnb_search', inputModes=None, name='Search airbnb accommodation', outputModes=None, tags=['airbnb accommodation'])] supportsAuthenticatedExtendedCard=None url='https://airbnb-a2a-sample-agent-559264237696.us-central1.run.app' version='1.0.0'
agent_url: https://airbnb-a2a-sample-agent-559264237696.us-central1.run.app
agent_card: additionalInterfaces=None capabi



### AI Response
### Weather Forecast for Yosemite Valley, CA

**Tonight:**
*   **Temperature:** 47°F
*   **Wind:** 5 mph NE
*   **Short Forecast:** Mostly Clear
*   **Detailed Forecast:** Mostly clear. Low around 47, with temperatures rising to around 49 overnight. Northeast wind around 5 mph.

**Sunday:**
*   **Temperature:** 89°F
*   **Wind:** 0 to 15 mph E
*   **Short Forecast:** Sunny
*   **Detailed Forecast:** Sunny. High near 89, with temperatures falling to around 84 in the afternoon. East wind 0 to 15 mph.

**Sunday Night:**
*   **Temperature:** 49°F
*   **Wind:** 5 to 10 mph N
*   **Short Forecast:** Mostly Clear
*   **Detailed Forecast:** Mostly clear. Low around 49, with temperatures rising to around 51 overnight. North wind 5 to 10 mph.

**Monday:**
*   **Temperature:** 93°F
*   **Wind:** 5 to 10 mph SW
*   **Short Forecast:** Sunny
*   **Detailed Forecast:** Sunny, with a high near 93. Southwest wind 5 to 10 mph, with gusts as high as 20 mph.

**Monday Night:**
*   **Temperature:** 52°F
*   **Wind:** 5 to 10 mph N
*   **Short Forecast:** Mostly Clear
*   **Detailed Forecast:** Mostly clear, with a low around 52. North wind 5 to 10 mph.

### Function Calls
- **Function**: `send_message`
  - **Arguments**
    - `task`: `What's the weather in Yosemite Valley, CA`
    - `agent_name`: `Weather Agent`


Use auth headers
agent_card: additionalInterfaces=None capabilities=AgentCapabilities(extensions=None, pushNotifications=True, stateTransitionHistory=None, streaming=True) defaultInputModes=['text', 'text/plain'] defaultOutputModes=['text', 'text/plain'] description='Helps with searching accommodation' documentationUrl=None iconUrl=None name='Airbnb Agent' preferredTransport=None protocolVersion='0.2.5' provider=None security=None securitySchemes=None skills=[AgentSkill(description='Helps with accommodation search using airbnb', examples=['Please find a room in LA, CA, April 15, 2025, checkout date is april 18, 2 adults'], id='airbnb_search', inputModes=None, name='Search airbnb accommodation', outputModes=None, tags=['airbnb accommodation'])] supportsAuthenticatedExtendedCard=None url='https://airbnb-a2a-sample-agent-559264237696.us-central1.run.app' version='1.0.0'
agent_url: https://airbnb-a2a-sample-agent-559264237696.us-central1.run.app
agent_card: additionalInterfaces=None capabi



### AI Response
Here are some Airbnb listings in Yosemite for August 1 to 6, 2025:

*   **Studio Near Yosemite, Little White Barn, Oakhurst**
    *   **Price:** $634 for 5 nights, originally $885
    *   **Rating:** 4.76 out of 5 average rating, 153 reviews
    *   **Badges:** Superhost
    *   **Link:** https://www.airbnb.com/rooms/53878563

*   **#14 Darling Vintage Apt | Historic Downtown Strip**
    *   **Price:** $810 for 5 nights, originally $908
    *   **Rating:** 4.92 out of 5 average rating, 95 reviews
    *   **Badges:** Guest favorite
    *   **Link:** https://www.airbnb.com/rooms/942390741063724911

*   **Yosemite Way Inn Sunrise Room**
    *   **Price:** $413 for 5 nights, originally $450
    *   **Rating:** 4.86 out of 5 average rating, 249 reviews
    *   **Badges:** Guest favorite
    *   **Link:** https://www.airbnb.com/rooms/6653773

*   **Peaceful Oaks Cabin**
    *   **Price:** $1,290 for 5 nights
    *   **Rating:** 4.98 out of 5 average rating, 41 reviews
    *   **Badges:** Guest favorite
    *   **Link:** https://www.airbnb.com/rooms/931482680552312113

*   **THE PEAKS @ Monte Vista: Mountain Setting / Views!**
    *   **Price:** $1,857 for 5 nights, originally $2,625
    *   **Rating:** 4.96 out of 5 average rating, 196 reviews
    *   **Badges:** Guest favorite
    *   **Link:** https://www.airbnb.com/rooms/52995793

*   **THE PEAKS @Ahwahnee: See-forever views! (NEW!)**
    *   **Price:** $2,122 for 5 nights, originally $3,143
    *   **Rating:** 4.95 out of 5 average rating, 174 reviews
    *   **Badges:** Guest favorite
    *   **Link:** https://www.airbnb.com/rooms/647168403629178944

*   **Homestead Barn Loft: Tesla Chargers**
    *   **Price:** $862 for 5 nights
    *   **Rating:** 4.99 out of 5 average rating, 288 reviews
    *   **Badges:** Guest favorite
    *   **Link:** https://www.airbnb.com/rooms/43640988

*   **Loft Condo A208 in Yosmeite West**
    *   **Price:** $1,535 for 5 nights, originally $1,985
    *   **Rating:** 4.67 out of 5 average rating, 55 reviews
    *   **Badges:** No badges
    *   **Link:** https://www.airbnb.com/rooms/772581147247566131

*   **LOV Big Wall. Yose Reservation+View~WiFi+Breakfast**
    *   **Price:** $4,455 for 5 nights
    *   **Rating:** 5.0 out of 5 average rating, 22 reviews
    *   **Badges:** Guest favorite
    *   **Link:** https://www.airbnb.com/rooms/1350353080518658836

*   **Tahanan, welcome home! Family Fun and Pet Friendly**
    *   **Price:** $1,126 for 5 nights, originally $1,635
    *   **Rating:** 4.8 out of 5 average rating, 5 reviews
    *   **Badges:** No badges
    *   **Link:** https://www.airbnb.com/rooms/1436455336041392319

*   **White Feather Suite/Hot Tub/Private/Sleeps 2**
    *   **Price:** $718 for 5 nights, originally $1,479
    *   **Rating:** 4.93 out of 5 average rating, 211 reviews
    *   **Badges:** Guest favorite
    *   **Link:** https://www.airbnb.com/rooms/544807652764436017

*   **Glamour Mountain Getaway by Casa Oso with Hot Tub**
    *   **Price:** $2,049 for 5 nights, originally $2,516
    *   **Rating:** 4.9 out of 5 average rating, 133 reviews
    *   **Badges:** Guest favorite
    *   **Link:** https://www.airbnb.com/rooms/53974557

*   **Peacock Suite/Hot Tub/Sleeps up to 4/Bbq**
    *   **Price:** $811 for 5 nights, originally $1,949
    *   **Rating:** 4.9 out of 5 average rating, 142 reviews
    *   **Badges:** Guest favorite
    *   **Link:** https://www.airbnb.com/rooms/49115775

*   **MODERN & PEACEFUL PRIVATE RETREAT**
    *   **Price:** $676 for 5 nights
    *   **Rating:** 4.95 out of 5 average rating, 275 reviews
    *   **Badges:** Guest favorite
    *   **Link:** https://www.airbnb.com/rooms/13607156

*   **Rustic cabin just outside of Oakhurst*Clean*Cozy**
    *   **Price:** $826 for 5 nights, originally $920
    *   **Rating:** 4.9 out of 5 average rating, 200 reviews
    *   **Badges:** Superhost
    *   **Link:** https://www.airbnb.com/rooms/50104090

*   **Hans Basecamp In Yosemite Park**
    *   **Price:** $2,089 for 5 nights, originally $2,509
    *   **Rating:** 4.67 out of 5 average rating, 335 reviews
    *   **Badges:** No badges
    *   **Link:** https://www.airbnb.com/rooms/4842172

*   **The Base Cabin A-Frame is Your Gateway to Yosemite, Relaxation and Chill Vibes**
    *   **Price:** $1,778 for 5 nights, originally $2,036
    *   **Rating:** 4.91 out of 5 average rating, 113 reviews
    *   **Badges:** Guest favorite
    *   **Link:** https://www.airbnb.com/rooms/585520945934835827

*   **Yosemite Critter Creek 2 guests**
    *   **Price:** $759 for 5 nights, originally $899
    *   **Rating:** 4.68 out of 5 average rating, 110 reviews
    *   **Badges:** Superhost
    *   **Link:** https://www.airbnb.com/rooms/23956426

You can also view all search results here: https://www.airbnb.com/s/Yosemite/homes?checkin=2025-08-01&checkout=2025-08-06&adults=1&children=0&infants=0&pets=0

### Function Calls
- **Function**: `send_message`
  - **Arguments**
    - `task`: `Find an Airbnb in Yosemite for August 1 to 6, 2025.`
    - `agent_name`: `Airbnb Agent`


Use auth headers
agent_card: additionalInterfaces=None capabilities=AgentCapabilities(extensions=None, pushNotifications=True, stateTransitionHistory=None, streaming=True) defaultInputModes=['text', 'text/plain'] defaultOutputModes=['text', 'text/plain'] description='Helps with searching accommodation' documentationUrl=None iconUrl=None name='Airbnb Agent' preferredTransport=None protocolVersion='0.2.5' provider=None security=None securitySchemes=None skills=[AgentSkill(description='Helps with accommodation search using airbnb', examples=['Please find a room in LA, CA, April 15, 2025, checkout date is april 18, 2 adults'], id='airbnb_search', inputModes=None, name='Search airbnb accommodation', outputModes=None, tags=['airbnb accommodation'])] supportsAuthenticatedExtendedCard=None url='https://airbnb-a2a-sample-agent-559264237696.us-central1.run.app' version='1.0.0'
agent_url: https://airbnb-a2a-sample-agent-559264237696.us-central1.run.app
agent_card: additionalInterfaces=None capabi



### AI Response
The weather in San Francisco, CA is as follows:

Tonight:
  Temperature: 57°F
  Wind: 7 to 10 mph W
  Short Forecast: Mostly Cloudy
  Detailed Forecast: Mostly cloudy, with a low around 57. West wind 7 to 10 mph, with gusts as high as 18 mph.

---

Sunday:
  Temperature: 66°F
  Wind: 5 to 14 mph WSW
  Short Forecast: Partly Sunny
  Detailed Forecast: Partly sunny. High near 66, with temperatures falling to around 64 in the afternoon. West southwest wind 5 to 14 mph, with gusts as high as 21 mph.

---

Sunday Night:
  Temperature: 56°F
  Wind: 6 to 14 mph WSW
  Short Forecast: Partly Cloudy
  Detailed Forecast: Partly cloudy, with a low around 56. West southwest wind 6 to 14 mph, with gusts as high as 21 mph.

---

Monday:
  Temperature: 69°F
  Wind: 5 to 14 mph WSW
  Short Forecast: Mostly Sunny
  Detailed Forecast: Mostly sunny, with a high near 69. West southwest wind 5 to 14 mph, with gusts as high as 21 mph.

---

Monday Night:
  Temperature: 56°F
  Wind: 5 to 14 mph WSW
  Short Forecast: Partly Cloudy
  Detailed Forecast: Partly cloudy, with a low around 56. West southwest wind 5 to 14 mph, with gusts as high as 21 mph.

### Function Calls
- **Function**: `send_message`
  - **Arguments**
    - `agent_name`: `Weather Agent`
    - `task`: `What's the weather in San Francisco, CA`


Use auth headers
agent_card: additionalInterfaces=None capabilities=AgentCapabilities(extensions=None, pushNotifications=True, stateTransitionHistory=None, streaming=True) defaultInputModes=['text', 'text/plain'] defaultOutputModes=['text', 'text/plain'] description='Helps with searching accommodation' documentationUrl=None iconUrl=None name='Airbnb Agent' preferredTransport=None protocolVersion='0.2.5' provider=None security=None securitySchemes=None skills=[AgentSkill(description='Helps with accommodation search using airbnb', examples=['Please find a room in LA, CA, April 15, 2025, checkout date is april 18, 2 adults'], id='airbnb_search', inputModes=None, name='Search airbnb accommodation', outputModes=None, tags=['airbnb accommodation'])] supportsAuthenticatedExtendedCard=None url='https://airbnb-a2a-sample-agent-559264237696.us-central1.run.app' version='1.0.0'
agent_url: https://airbnb-a2a-sample-agent-559264237696.us-central1.run.app
agent_card: additionalInterfaces=None capabi



### AI Response
I couldn't find any listings in Paris, France for the specified dates. However, here's a link to the search results on Airbnb:
[Airbnb Search Results for Paris, France](https://www.airbnb.com/s/Paris%2C%20France/homes?checkin=2025-08-10&checkout=2025-08-12&adults=1&children=0&infants=0&pets=0)

### Function Calls
- **Function**: `send_message`
  - **Arguments**
    - `task`: `Find an Airbnb in Paris, France for August 10 to 12, 2025.`
    - `agent_name`: `Airbnb Agent`


## Evaluation

### Prepare Agent Evaluation dataset

To evaluate your AI agent using the Vertex AI Gen AI Evaluation service, you need a specific dataset depending on what aspects you want to evaluate of your agent.  

This dataset should include the prompts given to the agent. It can also contain the ideal or expected response (ground truth) and the intended sequence of tool calls the agent should take (reference trajectory) representing the sequence of tools you expect agent calls for each given prompt.

> Optionally, you can provide both generated responses and predicted trajectory (**Bring-Your-Own-Dataset scenario**).

Below you have an example of dataset you might have with a customer support agent with user prompt and the reference trajectory.

In [None]:
# @title Define eval datasets
# The reference trajectory are empty in this example.
eval_data_a2a = {
    "prompt": [
        "What's the weather in Yosemite Valley, CA",
        "Looking for Airbnb in Yosemite for August 1 to 6, 2025",
        "What's the weather in San Francisco, CA",
        "Looking for Airbnb in Paris, France for August 10 to 12, 2025",
    ],
    "predicted_trajectory": [
        [
            {
                "tool_name": "send_message",
                "tool_input": {
                    "task": "What's the weather in Yosemite Valley, CA",
                    "agent_name": "Weather Agent",
                },
            }
        ],
        [
            {
                "tool_name": "send_message",
                "tool_input": {
                    "task": "Find Airbnb in Yosemite for August 1 to 6, 2025",
                    "agent_name": "Airbnb Agent",
                },
            }
        ],
        [
            {
                "tool_name": "send_message",
                "tool_input": {
                    "task": "What's the weather in San Francisco, CA",
                    "agent_name": "Weather Agent",
                },
            }
        ],
        [
            {
                "tool_name": "send_message",
                "tool_input": {
                    "task": "Find Airbnb in Yosemite for August 10 to 12, 2025",
                    "agent_name": "Airbnb Agent",
                },
            }
        ],
    ],
}

eval_sample_dataset = pd.DataFrame(eval_data_a2a)

In [None]:
display_dataframe_rows(eval_sample_dataset, num_rows=30)

### Trajectory Evaluation

After evaluating the agent's ability to select the single most appropriate tool for a given task, you generalize the evaluation by analyzing the tool sequence choices with respect to the user input (trajectory). This assesses whether the agent not only chooses the right tools but also utilizes them in a rational and effective order.

#### Set trajectory metrics

To evaluate agent's trajectory, Vertex AI Gen AI Evaluation provides several ground-truth based metrics:

* `trajectory_exact_match`: identical trajectories (same actions, same order)

* `trajectory_in_order_match`: reference actions present in predicted trajectory, in order (extras allowed)

* `trajectory_any_order_match`: all reference actions present in predicted trajectory (order, extras don't matter).

* `trajectory_precision`: proportion of predicted actions present in reference

* `trajectory_recall`: proportion of reference actions present in predicted.  

All metrics score 0 or 1, except `trajectory_precision` and `trajectory_recall` which range from 0 to 1.

In [None]:
trajectory_metrics = [
    "trajectory_exact_match",
    "trajectory_in_order_match",
    "trajectory_any_order_match",
    "trajectory_precision",
    "trajectory_recall",
]

#### Run an evaluation task

Submit an evaluation by running `evaluate` method of the new `EvalTask`.

In [None]:
EXPERIMENT_RUN = f"trajectory-{get_id()}"

trajectory_eval_task = EvalTask(
    dataset=eval_sample_dataset,
    metrics=trajectory_metrics,
    experiment=EXPERIMENT_NAME,
    output_uri_prefix=BUCKET_URI + "/multiple-metric-eval",
)

trajectory_eval_result = trajectory_eval_task.evaluate(
    runnable=agent_parsed_outcome_sync, experiment_run_name=EXPERIMENT_RUN
)

display_eval_report(trajectory_eval_result)

#### Visualize evaluation results

Print and visualize a sample of evaluation results.

In [None]:
display_dataframe_rows(trajectory_eval_result.metrics_table, num_rows=3)

In [None]:
plot_bar_plot(
    trajectory_eval_result,
    title="Trajectory Metrics",
    metrics=[f"{metric}/mean" for metric in trajectory_metrics],
)

### Evaluate final response

Similar to model evaluation, you can evaluate the final response of the agent using Vertex AI Gen AI Evaluation.

#### Set response metrics

After agent inference, Vertex AI Gen AI Evaluation provides several metrics to evaluate generated responses. You can use computation-based metrics to compare the response to a reference (if needed) and using existing or custom model-based metrics to determine the quality of the final response.

Check out the [documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval) to learn more.


In [None]:
response_metrics = ["safety", "coherence"]

In [None]:
EXPERIMENT_RUN = f"response-{get_id()}"

response_eval_task = EvalTask(
    dataset=eval_sample_dataset,
    metrics=response_metrics,
    experiment=EXPERIMENT_NAME,
    output_uri_prefix=BUCKET_URI + "/response-metric-eval",
)

response_eval_result = response_eval_task.evaluate(
    runnable=agent_parsed_outcome_sync, experiment_run_name=EXPERIMENT_RUN
)

display_eval_report(response_eval_result)

#### Visualize evaluation results


Print new evaluation result sample.

In [None]:
display_dataframe_rows(response_eval_result.metrics_table, num_rows=5)

After running the code above, you should see the public URL printed. You can access your running Flask service through this URL.

**Important:** Keep the cell running to keep the service and the ngrok tunnel active. Stopping the cell will stop the service and invalidate the public URL.

### Evaluate generated response conditioned by tool choosing

When evaluating AI agents that interact with environments, standard text generation metrics like coherence may not be sufficient. This is because these metrics primarily focus on text structure, while agent responses should be assessed based on their effectiveness within the environment.

Instead, use custom metrics that assess whether the agent's response logically follows from its tools choices like the one you have in this section.

#### Define a custom metric

According to the [documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval#model-based-metrics), you can define a prompt template for evaluating whether an AI agent's response follows logically from its actions by setting up criteria and a rating system for this evaluation.

Define a `criteria` to set the evaluation guidelines and a `pointwise_rating_rubric` to provide a scoring system (1 or 0). Then use a `PointwiseMetricPromptTemplate` to create the template using these components.


In [None]:
criteria = {
    "Follows trajectory": (
        "Evaluate whether the agent's response logically follows from the "
        "sequence of actions it took. Consider these sub-points:\n"
        "  - Does the response reflect the information gathered during the trajectory?\n"
        "  - Is the response consistent with the goals and constraints of the task?\n"
        "  - Are there any unexpected or illogical jumps in reasoning?\n"
        "Provide specific examples from the trajectory and response to support your evaluation."
    )
}

pointwise_rating_rubric = {
    "1": "Follows trajectory",
    "0": "Does not follow trajectory",
}

response_follows_trajectory_prompt_template = PointwiseMetricPromptTemplate(
    criteria=criteria,
    rating_rubric=pointwise_rating_rubric,
    input_variables=["prompt", "predicted_trajectory"],
)

Print the prompt_data of this template containing the combined criteria and rubric information ready for use in an evaluation.

In [None]:
print(response_follows_trajectory_prompt_template.prompt_data)

After you define the evaluation prompt template, set up the associated metric to evaluate how well a response follows a specific trajectory. The `PointwiseMetric` creates a metric where `response_follows_trajectory` is the metric's name and `response_follows_trajectory_prompt_template` provides instructions or context for evaluation you set up before.


In [None]:
response_follows_trajectory_metric = PointwiseMetric(
    metric="response_follows_trajectory",
    metric_prompt_template=response_follows_trajectory_prompt_template,
)

#### Set response metrics

Set new generated response evaluation metrics by including the custom metric.


In [None]:
response_tool_metrics = [
    "trajectory_exact_match",
    "trajectory_in_order_match",
    "safety",
    response_follows_trajectory_metric,
]

#### Run an evaluation task

Run a new agent's evaluation.

In [None]:
EXPERIMENT_RUN = f"response-over-tools-{get_id()}"

response_eval_tool_task = EvalTask(
    dataset=eval_sample_dataset,
    metrics=response_tool_metrics,
    experiment=EXPERIMENT_NAME,
    output_uri_prefix=BUCKET_URI + "/reasoning-metric-eval",
)

response_eval_tool_result = response_eval_tool_task.evaluate(
    # Uncomment the line below if you are providing the agent with an unparsed dataset
    runnable=agent_parsed_outcome_sync,
    experiment_run_name=EXPERIMENT_RUN,
)

display_eval_report(response_eval_tool_result)

#### Visualize evaluation results

Visualize evaluation result sample.

In [None]:
display_dataframe_rows(response_eval_tool_result.metrics_table, num_rows=3)

In [None]:
plot_bar_plot(
    response_eval_tool_result,
    title="Response Metrics",
    metrics=[f"{metric}/mean" for metric in response_tool_metrics],
)