# MCP Evaluation

## Context

For the MCP server tool calls evaluation, there are several objectives:

1. Help to identify problems in the description of the tools
2. Create a test suite that can be run manually or automatically in a CI
3. Allow for quick iteration on the tool descriptions

We considered several options to evaluate the MCP server tool calls.

1. ✍️ **Create test cases manually**

- **Pros:**
  - Straightforward approach
  - Simple to create test cases for each tool

- **Cons:**
  - A bit complicated to create flows (several tool calls in a row)
  - Needs to be maintained (every change in the MCP server will require a change in the test cases). But this is true for any other approach.
  - Maintenance can be simplified using LLM that collects the data automatically.

2. 📊 **Collect traces using MCP tester client**

- **Pros:**
  - Testing exactly in the same way as the end users do
  - Collects the data automatically
  - Easily scalable to many users

- **Cons:**
  - It might be impossible to get to a wrong flow of tool calls
  - So far, handling complicated flows is not straightforward, and one needs to drag all the data, including tool calls
  - It can be done using sessions, similarly to how chatbots are tested, but this brings additional complexity. It might well happen that the evaluator gets confused by the session and will not be able to evaluate the tool calls correctly.


## Create test cases manually

Simple example:
```
"What are the best Instagram scrapers": "search-actors"
```

Flow:
```
- user: Search for the weather MCP server and then add it into the available tools
- assistant: I'll help you to do that
- tool_use: search-actors, "input": {"search": "weather mcp","limit": 5}
- tool_use_id: 12, content: Tool \"search-actors\" successful, Actor found: jiri.spilka/wheather-mcp-server
- assistant:
```

Expected tool call: `add-actor`

## Evaluation

Follow the Phoenix evaluation process:

1. **Create the dataset**
2. **Define the system prompt and tool definitions**
3. **Set up the evaluator**
4. **Run the experiment**
5. **Iterate and refine**

For evaluation, we can either specify ground truth for the tool calls or leverage LLM as a judge. Since we are manually creating test cases, we can directly specify the expected tool calls. However, this does not exclude the possibility of using LLM as a judge at a later stage.

## Links

- [Tutorial on how to use evals](https://colab.research.google.com/github/Arize-ai/phoenix/blob/main/tutorials/evals/evaluate_agent.ipynb#scrollTo=ANh3q56OojLA).
- [System prompts for vscode, cursor](https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools)
- [Claude Desktop system prompt](https://github.com/asgeirtj/system_prompts_leaks/blob/main/Anthropic/claude.txt)



### Environment setup and project setup

You should already have your OpenAI API key.
You can find the Phoenix API key in 1Password, under the "shared" space.

In [2]:
%pip install "arize-phoenix==12.5.0" anthropic openai tqdm pandas dotenv --quiet

Note: you may need to restart the kernel to use updated packages.


In [3]:
# If imports fails in first run, run it again

import nest_asyncio

import json
from phoenix import Client as PhoenixClient
from phoenix.evals import TOOL_CALLING_PROMPT_TEMPLATE
from phoenix.evals.classify import llm_classify
from phoenix.evals.models import OpenAIModel
from phoenix.experiments import evaluate_experiment, run_experiment
from phoenix.experiments.evaluators import create_evaluator
from phoenix.experiments.types import Example
from phoenix.trace import SpanEvaluations
from phoenix.trace.dsl import SpanQuery
from openai import OpenAI
from anthropic import Anthropic

import pandas as pd

nest_asyncio.apply()

In [4]:
import os
from getpass import getpass

import dotenv
dotenv.load_dotenv()


model_name="gpt-4o-mini"
#model_name="gpt-4.1-nano"
#model_name="gpt-4.1-mini"
# model_name="gpt-4.1"
# model_name="gpt-5-nano"
# model_name="gpt-5-mini"
#model_name="gpt-5"
# model_name="claude-3-5-haiku-latest"
# model_name="claude-sonnet-4-20250514"

project_name = "mcp-client"
endpoint = "https://app.phoenix.arize.com/s/apify"

# Check if env vars exist, only prompt if missing (Phoenix API key is in 1pass)
if not os.environ.get("PHOENIX_API_KEY"):
    os.environ["PHOENIX_API_KEY"] = getpass("Enter YOUR PHOENIX_API_KEY")

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter YOUR OPENAI_API_KEY")

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = endpoint
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={os.getenv('PHOENIX_API_KEY')}"

if not os.environ.get("ANTHROPIC_API_KEY"):
    os.environ["ANTHROPIC_API_KEY"] = getpass("Enter YOUR ANTHROPIC_API_KEY")

px_client = PhoenixClient(endpoint=endpoint)
eval_model = OpenAIModel(model=model_name)

openai_client = OpenAI()
anthropic_client = Anthropic(timeout=10)


# Evaluation using manual test cases with ground truth

Follow a standard step-by-step process in Phoenix:

1. Define system prompt and tool definition
2. Create a dataset of test cases, and optionally, expected outputs
3. Create a task to run on each test case
4. Create evaluator(s) to run on each output of your task
5. Visualize results in Phoenix

### System prompt

In [6]:
SYSTEM_PROMPT_SIMPLE = "You are a helpful assistant"

### TOOLS (17.9.2025)

In [7]:
TOOLS = [
    {
        "name": "fetch-actor-details",
        "description": "Get detailed information about an Actor by its ID or full name (format: \"username/name\", e.g., \"apify/rag-web-browser\").\nThis returns the Actor’s title, description, URL, README (documentation), input schema, pricing/usage information, and basic stats.\nPresent the information in a user-friendly Actor card.\n\nUSAGE:\n- Use when a user asks about an Actor’s details, input schema, README, or how to use it.\n\nEXAMPLES:\n- user_input: How to use apify/rag-web-browser\n- user_input: What is the input schema for apify/rag-web-browser?\n- user_input: What is the pricing for apify/instagram-scraper?",
        "inputSchema": {
            "type": "object",
            "properties": {
                "actor": {
                    "type": "string",
                    "minLength": 1,
                    "description": "Actor ID or full name in the format \"username/name\", e.g., \"apify/rag-web-browser\"."
                }
            },
            "required": [
                "actor"
            ],
            "additionalProperties": False,
            "$schema": "http://json-schema.org/draft-07/schema#"
        }
    },
    {
        "name": "search-actors",
        "description": "Search the Apify Store for Actors or Model Context Protocol (MCP) servers using keywords.\nApify Store features solutions for web scraping, automation, and AI agents (e.g., Instagram, TikTok, LinkedIn, flights, bookings).\n\nThe results will include curated Actor cards with title, description, pricing model, usage statistics, and ratings.\nFor best results, use simple space-separated keywords (e.g., \"instagram posts\", \"twitter profile\", \"playwright mcp\").\nFor detailed information about a specific Actor, use the fetch-actor-details tool.\n\nUSAGE:\n- Use when you need to discover Actors for a specific task or find MCP servers.\n- Use to explore available tools in the Apify ecosystem based on keywords.\n\nEXAMPLES:\n- user_input: Find Actors for scraping e-commerce\n- user_input: Find browserbase MCP server\n- user_input: I need to scrape instagram profiles and comments\n- user_input: I need to get flights and airbnb data",
        "inputSchema": {
            "type": "object",
            "properties": {
                "limit": {
                    "type": "integer",
                    "minimum": 1,
                    "maximum": 100,
                    "default": 10,
                    "description": "The maximum number of Actors to return. The default value is 10."
                },
                "offset": {
                    "type": "integer",
                    "minimum": 0,
                    "default": 0,
                    "description": "The number of elements to skip at the start. The default value is 0."
                },
                "search": {
                    "type": "string",
                    "default": "",
                    "description": "A string to search for in the Actor's title, name, description, username, and readme.\nUse simple space-separated keywords, such as \"web scraping\", \"data extraction\", or \"playwright browser mcp\".\nDo not use complex queries, AND/OR operators, or other advanced syntax, as this tool uses full-text search only."
                },
                "category": {
                    "type": "string",
                    "default": "",
                    "description": "Filter the results by the specified category."
                }
            },
            "additionalProperties": False,
            "$schema": "http://json-schema.org/draft-07/schema#"
        }
    },
    {
        "name": "search-apify-docs",
        "description": "Search Apify documentation using full-text search.\n    You can use it to find relevant documentation based on keywords.\n    Apify documentation has information about Apify console, Actors (development\n    (actor.json, input schema, dataset schema, dockerfile), deployment, builds, runs),\n    schedules, storages (datasets, key-value store), Proxy, Integrations,\n    Apify Academy (crawling and webscraping with Crawlee),\n\n    The results will include the URL of the documentation page, a fragment identifier (if available),\n    and a limited piece of content that matches the search query.\n\n    Fetch the full content of the document using the fetch-apify-docs tool by providing the URL.\n\n    USAGE:\n    - Use when user asks about Apify documentation, Actor development, Crawlee, or Apify platform.\n\n    EXAMPLES:\n    - query: How to use create Apify Actor?\n    - query: How to define Actor input schema?\n    - query: How scrape with Crawlee?",
        "inputSchema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "minLength": 1,
                    "description": "Algolia full-text search query to find relevant documentation pages.\nUse only keywords, do not use full sentences or questions.\nFor example, \"standby actor\" will return documentation pages that contain the words \"standby\" and \"actor\"."
                },
                "limit": {
                    "type": "number",
                    "default": 5,
                    "description": "Maximum number of search results to return. Defaults to 5.\nYou can increase this limit if you need more results, but keep in mind that the search results are limited to the most relevant pages."
                },
                "offset": {
                    "type": "number",
                    "default": 0,
                    "description": "Offset for the search results. Defaults to 0.\nUse this to paginate through the search results. For example, if you want to get the next 5 results, set the offset to 5 and limit to 5."
                }
            },
            "required": [
                "query"
            ],
            "additionalProperties": False,
            "$schema": "http://json-schema.org/draft-07/schema#"
        }
    },
    {
        "name": "fetch-apify-docs",
        "description": "Fetch the full content of an Apify documentation page by its URL.\nUse this after finding a relevant page with the search-apify-docs tool.\n\nUSAGE:\n- Use when you need the complete content of a specific docs page for detailed answers.\n\nEXAMPLES:\n- user_input: Fetch https://docs.apify.com/platform/actors/running#builds\n- user_input: Fetch https://docs.apify.com/academy.",
        "inputSchema": {
            "type": "object",
            "properties": {
                "url": {
                    "type": "string",
                    "minLength": 1,
                    "description": "URL of the Apify documentation page to fetch. This should be the full URL, including the protocol (e.g., https://docs.apify.com/)."
                }
            },
            "required": [
                "url"
            ],
            "additionalProperties": False,
            "$schema": "http://json-schema.org/draft-07/schema#"
        }
    },
    {
        "name": "call-actor",
        "description": "Call any Actor from the Apify Store using a mandatory two-step workflow.\nThis ensures you first get the Actor’s input schema and details before executing it safely.\n\nThe results of a successful run include a datasetId (Actor output stored as an Apify dataset) and a short preview of items.\nFetch the full output later using the get-actor-output tool by providing the datasetId.\n\nUSAGE:\n- Use when you need to run an Actor that does not have a dedicated tool.\n- Do not use if a dedicated tool exists (e.g., apify-slash-rag-web-browser).\n\nWORKFLOW:\n- Step 1 (step=\"info\", default): Get Actor details and input schema to understand required fields.\n- Step 2 (step=\"call\"): Provide valid input per the schema to execute the Actor. A datasetId will be returned in the result.\n\nEXAMPLES:\n- user_input: Show input schema for apify/instagram-scraper (step=\"info\")\n- user_input: Run apify/rag-web-browser with query=\"scrape apify.com\" and outputFormats=[\"markdown\"] (step=\"call\")",
        "inputSchema": {
            "type": "object",
            "properties": {
                "actor": {
                    "type": "string",
                    "description": "The name of the Actor to call. For example, \"apify/rag-web-browser\"."
                },
                "step": {
                    "type": "string",
                    "enum": ["info", "call"],
                    "default": "info",
                    "description": "Step to perform: \"info\" to get Actor details and input schema (required first step), \"call\" to execute the Actor (only after getting info)."
                },
                "input": {
                    "type": "object",
                    "description": "The input JSON to pass to the Actor. For example, {\"query\": \"apify\", \"maxResults\": 5, \"outputFormats\": [\"markdown\"]}. Required only when step is \"call\".",
                    "additionalProperties": True
                },
                "callOptions": {
                    "type": "object",
                    "properties": {
                        "memory": {
                            "type": "number",
                            "minimum": 128,
                            "maximum": 32768,
                            "description": "Memory allocation for the Actor in MB. Must be a power of 2 (e.g., 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768). Minimum: 128 MB, Maximum: 32768 MB (32 GB)."
                        },
                        "timeout": {
                            "type": "number",
                            "minimum": 0,
                            "description": "Maximum runtime for the Actor in seconds. After this time elapses, the Actor will be automatically terminated. Use 0 for infinite timeout (no time limit). Minimum: 0 seconds (infinite)."
                        }
                    },
                    "additionalProperties": False
                }
            },
            "required": [
                "actor", "step"
            ],
            "additionalProperties": False,
            "$schema": "http://json-schema.org/draft-07/schema#"
        }
    },
    {
        "name": "apify-slash-rag-web-browser",
        "description": "This tool calls the Actor \"apify/rag-web-browser\" and retrieves its output results.\nUse this tool instead of the \"call-actor\" if user requests this specific Actor.\nActor description: Web browser for OpenAI Assistants, RAG pipelines, or AI agents, similar to a web browser in ChatGPT. It queries Google Search, scrapes the top N pages, and returns their content as Markdown for further processing by an LLM. It can also scrape individual URLs.This tool provides general web browsing functionality, for specific sites like e-commerce, social media it is always better to search for a specific Actor",
        "inputSchema": {
            "title": "RAG Web Browser",
            "type": "object",
            "schemaVersion": 1,
            "properties": {
                "query": {
                    "title": "Search term or URL",
                    "description": "**REQUIRED** Enter Google Search keywords or a URL of a specific web page. The keywords might include the [advanced search operators](https://blog.apify.com/how-to-scrape-google-like-a-pro/). Examples:\n\n- <code>san francisco weather</code>\n- <code>https://www.cnn.com</code>\n- <code>function calling site:openai.com</code>\nExample values: \"web browser for RAG pipelines -site:reddit.com\"",
                    "type": "string",
                    "prefill": "web browser for RAG pipelines -site:reddit.com",
                    "examples": [
                        "web browser for RAG pipelines -site:reddit.com"
                    ]
                },
                "maxResults": {
                    "title": "Maximum results",
                    "description": "The maximum number of top organic Google Search results whose web pages will be extracted. If `query` is a URL, then this field is ignored and the Actor only fetches the specific web page.\nExample values: 3",
                    "type": "integer",
                    "default": 3,
                    "examples": [
                        3
                    ]
                },
                "outputFormats": {
                    "title": "Output formats",
                    "description": "Select one or more formats to which the target web pages will be extracted and saved in the resulting dataset.\nExample values: [\"markdown\"]",
                    "type": "array",
                    "default": [
                        "markdown"
                    ],
                    "items": {
                        "type": "string",
                        "enum": [
                            "text",
                            "markdown",
                            "html"
                        ],
                        "enumTitles": [
                            "Plain text",
                            "Markdown",
                            "HTML"
                        ]
                    },
                    "examples": [
                        "markdown"
                    ]
                }
            },
            "required": [
                "query"
            ],
            "$id": "https://apify.com/mcp/apify-slash-rag-web-browser/schema.json"
        }
    },
    {
        "name": "get-actor-output",
        "description": "Fetch the dataset of a specific Actor run based on datasetId.\nYou can also retrieve only specific fields from the output if needed. \n.USAGE:\nUse this tool to get Actor dataset outside of the preview, or to access fields from the Actor output dataset schema that are not included in the preview.\nEXAMPLES:\n- user_input: Get data of my last Actor run?\n- user_input: Get number_of_likes from my dataset?\n\nNote: This tool is automatically included if the Apify MCP Server is configured with any Actor tools (e.g. `apify-slash-rag-web-browser`) or tools that can interact with Actors (e.g. `call-actor`, `add-actor`).",
        "inputSchema": {
            "type": "object",
            "properties": {
                "datasetId": {
                    "type": "string",
                    "minLength": 1,
                    "description": "Actor output dataset ID to retrieve from."
                },
                "fields": {
                    "type": "string",
                    "description": "Comma-separated list of fields to include (supports dot notation like \"crawl.statusCode\"). For example: \"crawl.statusCode,text,metadata\""
                },
                "offset": {
                    "type": "number",
                    "default": 0,
                    "description": "Number of items to skip (default: 0)."
                },
                "limit": {
                    "type": "number",
                    "default": 100,
                    "description": "Maximum number of items to return (default: 100)."
                }
            },
            "required": [
                "datasetId"
            ],
            "additionalProperties": False,
            "$schema": "http://json-schema.org/draft-07/schema#"
        }
    }
]

### Create test cases and upload dataset to Phoenix

In [8]:
import uuid

id = str(uuid.uuid4())

tool_responses = dict()

tool_get_actor_details = {
    # Basic actor information requests
    "What are the details of apify/instagram-scraper?": "fetch-actor-details",
    "Give me the documentation for apify/rag-web-browser": "fetch-actor-details",
    "Scrape details of apify/google-search-scraper": "fetch-actor-details",
    # Specific actor capabilities
    "What can apify/instagram-scraper do?": "fetch-actor-details",
    "How does apify/rag-web-browser work?": "fetch-actor-details",
    "Tell me about apify/social-media-hashtag-research features": "fetch-actor-details",
    # Pricing and usage information
    "How much does apify/instagram-scraper cost?": "fetch-actor-details",
    "What's the pricing model for apify/rag-web-browser?": "fetch-actor-details",
    # Input schema and configuration
    "What parameters does apify/instagram-scraper accept?": "fetch-actor-details",
    "Show me the input schema for apify/rag-web-browser": "fetch-actor-details",
}

# search-actors
tool_search_actors = {
    # Social media scraping
    "How to search for Instagram posts": "search-actors",
    "What are the best Instagram scrapers?": "search-actors",
    "Find actors for scraping social media": "search-actors",
    "Show me Twitter scraping tools": "search-actors",
    "What actors can scrape TikTok content?": "search-actors",
    "Find Facebook data extraction tools": "search-actors",
    "What actors can be used for scraping social media?": "search-actors",
    # General web scraping
    "Show me actors for web scraping": "search-actors",
    "Find actors that can scrape news articles": "search-actors",
    "What tools can extract data from e-commerce sites?": "search-actors",
    "Show me Amazon product scrapers": "search-actors",
    "Show me actors for web scraping": "search-actors",
    "Find actors for data extraction tasks": "search-actors",
    "Search for Playwright browser MCP server": "search-actors",
    "Look for actors that can scrape news articles": "search-actors",
    "Find actors that extract data from e-commerce sites": "search-actors",
    "I need to find solution to scrape details of Amazon products": "search-actors",
    "Fetch posts from Twitter about AI": "search-actors",
    "Get flight information from Skyscanner": "search-actors",
    "Can you find actors to scrape weather data?": "search-actors",
}

# rag-web browser
tool_rag_web_browser = {
    "Search articles about AI from tech blogs": "apify-slash-rag-web-browser",
    "Fetch recent articles about climate change": "apify-slash-rag-web-browser",
    "Get the latest weather forecast for San Francisco": "apify-slash-rag-web-browser",
    "Get data from example.com": "apify-slash-rag-web-browser",
    "Get the latest tech industry news": "apify-slash-rag-web-browser",
}

# search vs rag-web-browser
# we want to use rag-web-browser as general purpose tool, not for everything
tool_search_actor_vs_rag_web_browser = {
    "Find posts about AI on Instagram": "search-actors",
    "Scrape Instagram posts about AI": "search-actors",

    "Search for AI articles on tech blogs": "apify-slash-rag-web-browser",
    "Fetch articles about AI from Wired and The Verge": "apify-slash-rag-web-browser",

    "Get the latest weather forecast for New York": "apify-slash-rag-web-browser",
    "Search for weather data scraping tools": "search-actors",

    "Fetch flight details for New York to London": "search-actors",
    "Find actors for flight data extraction": "search-actors",

    "Look for news articles on AI": "apify-slash-rag-web-browser",
    "Fetch AI-related news from CNN and BBC": "apify-slash-rag-web-browser",
}

# DOCS
tool_search_apify_docs = {
    "How to build an Apify Actor": "search-apify-docs",
    "Ho to define Actor input schema, provide examples": "search-apify-docs",
    "How to use Playwright library with Apify": "search-apify-docs",
    "Is there is a documentation for MCP server": "search-apify-docs",
    "How to use Apify Proxy": "search-apify-docs",
    "Web scraping with Crawlee": "search-apify-docs",
    "Apify API integration guide": "search-apify-docs",
    "Error handling in Actors": "search-apify-docs",
}

tool_call_actor_scenarios = {
    # Direct actor calls
    "Run apify/instagram-scraper to scrape #dwaynejohnson": "call-actor",
    "Run apidojo/tweet-scraper to scrape twitter profiles": "call-actor",
    "Call apify/google-search-scraper to find restaurants in London": "call-actor",
    "Run apify/social-media-hashtag-research for #AI": "call-actor",
    "Scrape iPhone15 at Amazon using apify/e-commerce-scraping-tool": "call-actor",
    "Call epctex/weather-scraper for New York": "call-actor",
}

tool_actor_output_management = {
    # get-actor-output: Retrieve output from actor executions
    "Get output from my latest actor with datasetId des32s": "get-actor-output",
    "Retrieve results from dataset abc123": "get-actor-output",
    "Show me the data from my Instagram scraper run with datasetId d23d2, ": "get-actor-output",
    "Get the first 50 items from my datasetId abc123": "get-actor-output",
    "Retrieve all results from my web scraper with datasetID abc123": "get-actor-output",
}

tool_fetch_apify_docs = {
    "Get configuration info from: https://docs.apify.com/platform/integrations/mcp": "fetch-apify-docs",
}

# RUNS
tool_get_actor_run = {
    "What is the status of the latest run of apify/instagram-scraper?": "get-actor-run",
    "Can you fetch the status and datasetId of the run of apify/google-search-scraper?": "get-actor-run",
}

tool_get_actor_run_list = {
    "Get the Actor that failed": "get-actor-run-list",
}

tool_get_actor_log = {
    "Retrieve logs for the run of apify/instagram-scraper": "get-actor-log",
    "Show the last 20 lines of logs for apify/google-search-scraper": "get-actor-log",
    "Get the log for the latest run of apify/rag-web-browser": "get-actor-log",
}

# STORAGE
tool_get_dataset_list = {
    "List the datasets": "get-dataset-list",
}

tool_get_dataset = {
    "Can you provide details for the datasetId: 123?": "get-dataset",
}

tool_get_dataset_items = {
    "Fetch the first 10 items from the dataset apify/instagram-scraper-dataset": "get-dataset-items",
    "Retrieve the first 5 items from the dataset apify/rag-web-browser-dataset, omitting the 'metadata.timestamp' field": "get-dataset-items",
}

tool_get_dataset_schema = {
    "Get dataset for datasetId: xyz": "get-dataset-schema",
}

tool_get_key_value_store = {
    "Get details of the key-value store with id: xyz": "get-key-value-store",
}

tool_get_key_value_store_keys = {
    "Fetch the first 10 keys for the key-value store id: xyz": "get-key-value-store-keys",
}

tool_get_key_value_store_record = {
    "Retrieve the record for key 'user-details' in the key-value store id: xyz": "get-key-value-store-record",
}

tool_get_key_value_store_list = {
    "Show all key-value stores, including unnamed ones": "get-key-value-store-list",
}

# Adding all new tool responses to the overall tool responses


# msg = """
# - user: Search for the weather MCP server and then add it into the available tools
# - assistant: I'll help you to do that
# - tool_use: search-actors, "input": {"search": "weather mcp","limit": 5}
# - tool_use_id: 12, content: Tool \"search-actors\" successful, Actor found: jiri.spilka/wheather-mcp-server
# - assistant:
# """

# tool_responses = {
#     "Search for the weather MCP server and then call add-actor to it into available tools": "search-actors,add-actor",
#     msg: "add-actor"
# }

# CORE
tool_responses |= tool_get_actor_details
tool_responses |= tool_search_actors
tool_responses |= tool_rag_web_browser
tool_responses |= tool_search_actor_vs_rag_web_browser
tool_responses |= tool_actor_output_management
tool_responses |= tool_call_actor_scenarios
# DOCS
tool_responses |= tool_search_apify_docs
tool_responses |= tool_fetch_apify_docs
# RUNS
# tool_responses |= tool_get_actor_run
# tool_responses |= tool_get_actor_run_list
# tool_responses |= tool_get_actor_log
# # STORAGE
# tool_responses |= tool_get_dataset
# tool_responses |= tool_get_dataset_list
# tool_responses |= tool_get_dataset_items
# tool_responses |= tool_get_dataset_schema
# tool_responses |= tool_get_key_value_store
# tool_responses |= tool_get_key_value_store_keys
# tool_responses |= tool_get_key_value_store_record
# tool_responses |= tool_get_key_value_store_list


tool_calling_df = pd.DataFrame(tool_responses.items(), columns=["question", "tool_calls"])

dataset = px_client.upload_dataset(
    dataframe=tool_calling_df,
    dataset_name=f"tool_calling_ground_truth_{id}",
    input_keys=["question"],
    output_keys=["tool_calls"],
)

  dataset = px_client.upload_dataset(


📤 Uploading dataset...
💾 Examples uploaded: https://app.phoenix.arize.com/s/apify/datasets/RGF0YXNldDo0/examples
🗄️ Dataset version ID: RGF0YXNldFZlcnNpb246NA==


In [9]:
#dataset_id = "RGF0YXNldDo1NQ=="
#dataset = px_client.get_dataset(id=dataset_id)

### Transform tools, define router

In [10]:
def transfrom_tools_to_openai_format(tools):
    """ Transforms the tools to the OpenAI format."""
    return [
        {
            "type": "function",
            "function": {
                "name": tool["name"],
                "description": tool["description"],
                "parameters": tool["inputSchema"],
            },
        }
        for tool in tools
    ]

def transfrom_tools_to_antrophic_format(tools):
    """ Transforms the tools to the Antrophic format."""
    from copy import deepcopy
    t = deepcopy(tools)
    for tool_ in t:
        tool_["input_schema"] = tool_.pop("inputSchema")
    return t


def run_router_step_open_ai(example: Example) -> str:
    messages = [{"role": "system","content": SYSTEM_PROMPT_SIMPLE}]
    messages.append({"role": "user", "content": example.input.get("question")})

    response = openai_client.chat.completions.create(
        model=model_name,
        messages=messages,
        tools=transfrom_tools_to_openai_format(TOOLS),
    )
    tool_calls = []
    print(example.input.get('question'), response.choices[0].message)
    if response.choices[0].message.tool_calls:
        tool_calls.append(response.choices[0].message.tool_calls[0].function.name)
    return tool_calls

def run_router_step_antrophic(example: Example) -> str:

  response = anthropic_client.messages.create(
    model=model_name,
    system=SYSTEM_PROMPT_SIMPLE,
    messages=[{"role": "user","content": example.input.get("question")}],
    tools=transfrom_tools_to_antrophic_format(TOOLS),
    max_tokens=2048,
  )

  tool_calls = []
  print(example.input.get('question'), response.content)
  for content in response.content:
    if content.type == 'tool_use':
      tool_calls.append(content.name)

  return tool_calls

# Define evaluator
Your evaluator can also be simple, since you have expected outputs. If you didn't have those expected outputs, you could instead use an LLM as a Judge here, or even basic code:

In [11]:
def tools_match(expected: str, output: str) -> bool:
    expected_tools = (expected.get('tool_calls') and expected.get('tool_calls').split(', ')) or []
    print(f"Tool output = {output}, expected = {expected_tools}, output==expected = {sorted(expected_tools) == sorted(output)}")
    return sorted(expected_tools) == sorted(output)

### Evaluation (multiple models)



In [None]:
#SELECTED_MODELS = ["claude-3-5-haiku-latest"]

SELECTED_MODELS = ["gpt-4o-mini"]
#SELECTED_MODELS = ["claude-sonnet-4-5-20250929"]
#SELECTED_MODELS = [ "gpt-4o-mini", "claude-3-5-haiku-latest"]
#SELECTED_MODELS = [ "gpt-4.1-nano", "gpt-4.1-mini", "gpt-4.1"]
#SELECTED_MODELS = [ "gpt-5-nano", "gpt-5-mini", "gpt-5", "claude-sonnet-4-0"]
#SELECTED_MODELS = ["gpt-4o-mini", "gpt-4.1-nano", "gpt-4.1-mini", "gpt-4.1", "gpt-5-nano", "gpt-5-mini", "gpt-5", "claude-3-5-haiku-latest", "claude-sonnet-4-0" ]

#for model_name in SELECTED_MODELS:

experiment_name = f"Eval 21 {model_name}"
experiment_description = model_name

if model_name.startswith("gpt"):
  run_router_step = run_router_step_open_ai
elif model_name.startswith("claude"):
  run_router_step = run_router_step_antrophic

experiment = run_experiment(
    dataset,
    run_router_step,
    evaluators=[tools_match],
    experiment_name=experiment_name,
    experiment_description=experiment_description,
)

🧪 Experiment started.
📺 View dataset experiments: https://app.phoenix.arize.com/s/apify/datasets/RGF0YXNldDo0/experiments
🔗 View this experiment: https://app.phoenix.arize.com/s/apify/datasets/RGF0YXNldDo0/compare?experimentId=RXhwZXJpbWVudDoy


running tasks |          | 0/64 (0.0%) | ⏳ 00:00<? | ?it/s

What are the details of apify/instagram-scraper? ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=[ChatCompletionMessageFunctionToolCall(id='call_x6aWzyANNrTCvHBbB1ZfoULO', function=Function(arguments='{"actor":"apify/instagram-scraper"}', name='fetch-actor-details'), type='function')])
Give me the documentation for apify/rag-web-browser ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=[ChatCompletionMessageFunctionToolCall(id='call_6pDJzTghGCBJODzbRqLGqunW', function=Function(arguments='{"actor":"apify/rag-web-browser"}', name='fetch-actor-details'), type='function')])
Scrape details of apify/google-search-scraper ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=[ChatCompletionMessageFunctionToolCall(id='call_9yurPDtx5f3MiBmEsHZ2MSCv', function=Function(a

running experiment evaluations |          | 0/64 (0.0%) | ⏳ 00:00<? | ?it/s

Tool output = ['fetch-apify-docs'], expected = ['fetch-apify-docs'], output==expected = True
Tool output = ['search-apify-docs'], expected = ['search-apify-docs'], output==expected = True
Tool output = ['search-apify-docs'], expected = ['search-apify-docs'], output==expected = True
Tool output = ['search-apify-docs'], expected = ['search-apify-docs'], output==expected = True
Tool output = ['search-apify-docs'], expected = ['search-apify-docs'], output==expected = True
Tool output = ['search-apify-docs'], expected = ['search-apify-docs'], output==expected = True
Tool output = ['search-apify-docs'], expected = ['search-apify-docs'], output==expected = True
Tool output = ['search-apify-docs'], expected = ['search-apify-docs'], output==expected = True
Tool output = ['search-apify-docs'], expected = ['search-apify-docs'], output==expected = True
Tool output = ['call-actor'], expected = ['call-actor'], output==expected = True
Tool output = ['call-actor'], expected = ['call-actor'], output==e

In [27]:
v = experiment.get_evaluations()
v

Unnamed: 0_level_0,name,score,label,output,input,expected,example_id
run_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
RXhwZXJpbWVudFJ1bjoxMjg=,tools_match,1.0,True,[fetch-apify-docs],{'question': 'Get configuration info from: htt...,{'tool_calls': 'fetch-apify-docs'},RGF0YXNldEV4YW1wbGU6MjQ5
RXhwZXJpbWVudFJ1bjoxMjc=,tools_match,1.0,True,[search-apify-docs],{'question': 'Error handling in Actors'},{'tool_calls': 'search-apify-docs'},RGF0YXNldEV4YW1wbGU6MjQ4
RXhwZXJpbWVudFJ1bjoxMjY=,tools_match,1.0,True,[search-apify-docs],{'question': 'Apify API integration guide'},{'tool_calls': 'search-apify-docs'},RGF0YXNldEV4YW1wbGU6MjQ3
RXhwZXJpbWVudFJ1bjoxMjU=,tools_match,1.0,True,[search-apify-docs],{'question': 'Web scraping with Crawlee'},{'tool_calls': 'search-apify-docs'},RGF0YXNldEV4YW1wbGU6MjQ2
RXhwZXJpbWVudFJ1bjoxMjQ=,tools_match,1.0,True,[search-apify-docs],{'question': 'How to use Apify Proxy'},{'tool_calls': 'search-apify-docs'},RGF0YXNldEV4YW1wbGU6MjQ1
...,...,...,...,...,...,...,...
RXhwZXJpbWVudFJ1bjo2OQ==,tools_match,1.0,True,[fetch-actor-details],{'question': 'How does apify/rag-web-browser w...,{'tool_calls': 'fetch-actor-details'},RGF0YXNldEV4YW1wbGU6MTkw
RXhwZXJpbWVudFJ1bjo2OA==,tools_match,1.0,True,[fetch-actor-details],{'question': 'What can apify/instagram-scraper...,{'tool_calls': 'fetch-actor-details'},RGF0YXNldEV4YW1wbGU6MTg5
RXhwZXJpbWVudFJ1bjo2Nw==,tools_match,1.0,True,[fetch-actor-details],{'question': 'Scrape details of apify/google-s...,{'tool_calls': 'fetch-actor-details'},RGF0YXNldEV4YW1wbGU6MTg4
RXhwZXJpbWVudFJ1bjo2Ng==,tools_match,1.0,True,[fetch-actor-details],{'question': 'Give me the documentation for ap...,{'tool_calls': 'fetch-actor-details'},RGF0YXNldEV4YW1wbGU6MTg3


# Evaluation using traces collected by MCP tester client

In [None]:
query = (
    SpanQuery()
    .where(
        "span_kind == 'AGENT'",
    )
    .select(question="input.value", output_messages="llm.output_messages", tool_definitions="llm.tools")
)

# The Phoenix Client can take this query and return the dataframe.
tool_calls_df = px_client.query_spans(query, project_name=project_name, timeout=None)

tool_calls_df.dropna(subset=["output_messages"], inplace=True)
tool_calls_df


Unnamed: 0_level_0,question,output_messages,tool_definitions
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
f800bf93ae2e1bac,Load comments for video https://www.youtube.co...,"[{""role"":""user"",""content"":""Load comments for v...","[{""name"":""get-actor-details"",""description"":""Re..."
db37646f75a9aaa8,"hey, search for playwrigth mcp","[{""role"":""user"",""content"":""hey, search for pla...","[{""name"":""get-actor-details"",""description"":""Ge..."
4f1b0d9e2317dfcc,"Test 3, find instagram actors","[{""role"":""user"",""content"":""Test 3, find instag...","[{""name"":""get-actor-details"",""description"":""Re..."
3bc45129053c71d8,Find Mexical restaurants in san francisco,"[{""role"":""user"",""content"":""Find Mexical restau...","[{""name"":""get-actor-details"",""description"":""Re..."
f7b9da0fc0609a88,"no, use agent for google maps","[{""role"":""user"",""content"":""Find Mexical restau...","[{""name"":""get-actor-details"",""description"":""Re..."
254571bc85d90808,"no, search for google maps actor and use this ...","[{""role"":""user"",""content"":""Find Mexical restau...","[{""name"":""get-actor-details"",""description"":""Re..."


In [None]:
OVERRIDE_TOOLS = False




def simulate_tools_output_openai(query):
    """ Run query with system prompt and tools."""
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": query}],
        tools=transfrom_tools_to_openai_format(TOOLS),
        max_tokens=2048,
    )
    return response.choices[0].message.to_dict()["tool_calls"]


def simulate_tools_output_anthropic(prompt):
    """ Run query with system prompt and tools."""
    message = anthropic_client.messages.create(
        model="claude-3-5-haiku-latest",
        system=SYSTEM_PROMPT,
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        tools=TOOLS,
        max_tokens=2048,
    )
    return message.content[-1].to_dict()

if OVERRIDE_TOOLS:
    # tool_calls_df["tool_call"] = tool_calls_df["question"].progress_apply(simulate_tools_output_openai)
    tool_calls_df["tool_call"] = tool_calls_df["question"].progress_apply(simulate_tools_output_anthropic)

tool_calls_df

Unnamed: 0_level_0,question,output_messages,tool_definitions
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
f800bf93ae2e1bac,Load comments for video https://www.youtube.co...,"[{""role"":""user"",""content"":""Load comments for v...","[{""name"":""get-actor-details"",""description"":""Re..."
db37646f75a9aaa8,"hey, search for playwrigth mcp","[{""role"":""user"",""content"":""hey, search for pla...","[{""name"":""get-actor-details"",""description"":""Ge..."
4f1b0d9e2317dfcc,"Test 3, find instagram actors","[{""role"":""user"",""content"":""Test 3, find instag...","[{""name"":""get-actor-details"",""description"":""Re..."
3bc45129053c71d8,Find Mexical restaurants in san francisco,"[{""role"":""user"",""content"":""Find Mexical restau...","[{""name"":""get-actor-details"",""description"":""Re..."
f7b9da0fc0609a88,"no, use agent for google maps","[{""role"":""user"",""content"":""Find Mexical restau...","[{""name"":""get-actor-details"",""description"":""Re..."
254571bc85d90808,"no, search for google maps actor and use this ...","[{""role"":""user"",""content"":""Find Mexical restau...","[{""name"":""get-actor-details"",""description"":""Re..."


## Opt: Simulate other tool definitions using user prompts

## Transform data

Get tool calls from conversions

In [None]:
def get_tool_calls(conversation_data):
    """
    Extract function calls from conversation data in the format:
    [{tool: "name_of_tool", input: {key: value}, output: {...}}, ...]

    Args:
        conversation_data (str or list): JSON string or parsed conversation data

    Returns:
        list: Array of function call objects with tool, input, and output
    """
    # Parse JSON if it's a string
    if isinstance(conversation_data, str):
        data = json.loads(conversation_data)
    else:
        data = conversation_data

    function_calls = []
    tool_calls_map = {}  # Map tool IDs to their calls for matching with results

    # First pass: collect tool calls
    for message in data:
        if isinstance(message.get("content"), list):
            for content_item in message["content"]:
                if content_item.get("type") == "tool_use":
                    tool_id = content_item.get("id")
                    tool_call = {
                        "tool": content_item.get("name"),
                        "input": content_item.get("input", {}),
                        # 'output': None  # Will be filled in second pass
                    }
                    tool_calls_map[tool_id] = tool_call
                    function_calls.append(tool_call)

    # Second pass: match tool results with tool calls
    # for message in data:
    #     if isinstance(message.get('content'), list):
    #         for content_item in message['content']:
    #             if content_item.get('type') == 'tool_result':
    #                 tool_id = content_item.get('tool_use_id')
    #                 if tool_id in tool_calls_map:
    #                     # Extract output from tool result
    #                     result_content = content_item.get('content', [])
    #                     output = {}

    #                     # Parse the result content
    #                     for result_item in result_content:
    #                         if result_item.get('type') == 'text':
    #                             text = result_item.get('text', '')

    #                             # Try to parse JSON from the text if it looks like JSON
    #                             if text.startswith('{') and text.endswith('}'):
    #                                 try:
    #                                     output = json.loads(text)
    #                                 except json.JSONDecodeError:
    #                                     output = {'text': text}
    #                             else:
    #                                 # If not JSON, store as text
    #                                 if 'text' not in output:
    #                                     output['text'] = text
    #                                 else:
    #                                     output['text'] += '\n' + text

    #                     # Update the tool call with output
    #                     tool_calls_map[tool_id]['output'] = output

    return function_calls


if not OVERRIDE_TOOLS:
    # Transform only original data, not overridden tools
    tool_calls_df["tool_call"] = tool_calls_df["output_messages"].apply(get_tool_calls)
    tool_calls_df.head()

## Evaluation

Run LLM template to evaluate each conversation. Check if the tool usage was correct.

In [None]:
TOOL_CALLING_BASE_TEMPLATE = """
You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would answer the question. The tool
calls have been generated by a separate agent, and chosen from the list of
tools provided below. It is your job to decide whether that agent chose
the right tool to call.

    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Called]: {tool_call}
    [END DATA]

Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the chosen tool would not answer the question,
the tool includes information that is not presented in the question,
or that the tool signature includes parameter values that don't match
the formats specified in the tool signatures below.

"correct" means the correct tool call was chosen, the correct parameters
were extracted from the question, the tool call generated is runnable and correct,
and that no outside information not present in the question was used
in the generated question.

    [Tool Definitions]: {tool_definitions}
"""

tool_call_eval = llm_classify(
    data=tool_calls_df,
    template=TOOL_CALLING_BASE_TEMPLATE,
    rails=["correct", "incorrect"],
    model=eval_model,
    provide_explanation=True,
)

tool_call_eval["score"] = tool_call_eval.apply(
    lambda x: 1 if x["label"] == "correct" else 0, axis=1
)

tool_call_eval.head()

llm_classify |          | 0/6 (0.0%) | ⏳ 00:00<? | ?it/s

Exception in worker on attempt 1: raised NotFoundError("Error code: 404 - {'error': {'message': 'The model `claude-sonnet-4-20250514` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}")
Requeuing...
Exception in worker on attempt 1: raised NotFoundError("Error code: 404 - {'error': {'message': 'The model `claude-sonnet-4-20250514` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}")
Requeuing...
Exception in worker on attempt 1: raised NotFoundError("Error code: 404 - {'error': {'message': 'The model `claude-sonnet-4-20250514` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}")
Requeuing...
Exception in worker on attempt 1: raised NotFoundError("Error code: 404 - {'error': {'message': 'The model `claude-sonnet-4-20250514` does not exist or you do not have access 

Unnamed: 0_level_0,label,explanation,exceptions,execution_status,execution_seconds,prompt_tokens,completion_tokens,total_tokens,score
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
f800bf93ae2e1bac,,,"[NotFoundError(""Error code: 404 - {'error': {'...",DID NOT RUN,2.214166,,,,0
db37646f75a9aaa8,,,"[NotFoundError(""Error code: 404 - {'error': {'...",FAILED,2.218618,,,,0
4f1b0d9e2317dfcc,,,"[NotFoundError(""Error code: 404 - {'error': {'...",DID NOT RUN,2.18335,,,,0
3bc45129053c71d8,,,"[NotFoundError(""Error code: 404 - {'error': {'...",DID NOT RUN,2.180711,,,,0
f7b9da0fc0609a88,,,"[NotFoundError(""Error code: 404 - {'error': {'...",DID NOT RUN,2.213214,,,,0


## Push evaluation results back to Phoenix

In Phoenix UI will be visible results of evaluation

In [None]:
px_client.log_evaluations(
    SpanEvaluations(eval_name="Tool Calling Eval (JS)", dataframe=tool_call_eval),
)