# Langfuse × Pydantic AI – Agent Evals 

## 1. Setup – install packages & add credentials

In [None]:
# If you are running this on colab/@home comment‑out what you already have
%pip install -q --upgrade "pydantic-ai[mcp]" langfuse openai nest_asyncio aiohttp

Note: you may need to restart the kernel to use updated packages.


In [None]:
import os

# Get keys for your project from the project settings page
# https://cloud.langfuse.com
os.environ["LANGFUSE_PUBLIC_KEY"] = ""
os.environ["LANGFUSE_SECRET_KEY"] = ""
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com" # 🇪🇺 EU region
# os.environ["LANGFUSE_HOST"] = "https://us.cloud.langfuse.com" # 🇺🇸 US region

# Your openai key
os.environ["OPENAI_API_KEY"] = ""

## 2. Enable Langfuse Tracing

All integrations: https://langfuse.com/integrations

In [2]:
from langfuse import get_client
from pydantic_ai.agent import Agent

# Initialise Langfuse client and verify connectivity
langfuse = get_client()
assert langfuse.auth_check(), "Langfuse auth failed - check your keys ✋"

# Turn on OpenTelemetry instrumentation for *all* future Agent instances
Agent.instrument_all()
print("✅ Pydantic AI instrumentation enabled - traces will stream to Langfuse")

✅ Pydantic AI instrumentation enabled - traces will stream to Langfuse


## Mock tool response

In [3]:
overview= """
# Langfuse

> Langfuse is an **open-source LLM engineering platform** ([GitHub](https://github.com/langfuse/langfuse)) that helps teams collaboratively debug, analyze, and iterate on their LLM applications. All platform features are natively integrated to accelerate the development workflow.

## Langfuse Docs MCP Server

Connect to the Langfuse Docs MCP server to access documentation directly in your AI editor:

- **Endpoint**: `https://langfuse.com/api/mcp`
- **Transport**: `streamableHttp`
- **Documentation**: [Langfuse Docs MCP Server](https://langfuse.com/docs/docs-mcp)

The MCP server provides tools to search Langfuse documentation, GitHub issues, and discussions. See the [installation guide](https://langfuse.com/docs/docs-mcp) for setup instructions in Cursor, VS Code, Claude Desktop, and other MCP clients.

## Docs

- [Docs](https://langfuse.com/docs)
- [Custom Dashboards](https://langfuse.com/docs/analytics/custom-dashboards)
- [Example Intent Classification](https://langfuse.com/docs/analytics/example-intent-classification)
- [Metrics Api](https://langfuse.com/docs/analytics/metrics-api)
- [Overview](https://langfuse.com/docs/analytics/overview)
- [Api](https://langfuse.com/docs/api)
- [Ask Ai](https://langfuse.com/docs/ask-ai)
- [Audit Logs](https://langfuse.com/docs/audit-logs)
- [Core Features](https://langfuse.com/docs/core-features)
- [Data Deletion](https://langfuse.com/docs/data-deletion)
- [Data Retention](https://langfuse.com/docs/data-retention)
- [Example Synthetic Datasets](https://langfuse.com/docs/datasets/example-synthetic-datasets)
- [Get Started](https://langfuse.com/docs/datasets/get-started)
- [Overview](https://langfuse.com/docs/datasets/overview)
- [Prompt Experiments](https://langfuse.com/docs/datasets/prompt-experiments)
- [Python Cookbook](https://langfuse.com/docs/datasets/python-cookbook)
- [Demo](https://langfuse.com/docs/demo)
- [Docs Mcp](https://langfuse.com/docs/docs-mcp)
- [Fine Tuning](https://langfuse.com/docs/fine-tuning)
- [Get Started](https://langfuse.com/docs/get-started)
- [Model Usage And Cost](https://langfuse.com/docs/model-usage-and-cost)
- [Example Arize](https://langfuse.com/docs/opentelemetry/example-arize)
- [Example Mlflow](https://langfuse.com/docs/opentelemetry/example-mlflow)
- [Example Openlit](https://langfuse.com/docs/opentelemetry/example-openlit)
- [Example Openllmetry](https://langfuse.com/docs/opentelemetry/example-openllmetry)
- [Example Python Sdk](https://langfuse.com/docs/opentelemetry/example-python-sdk)
- [Get Started](https://langfuse.com/docs/opentelemetry/get-started)
- [Playground](https://langfuse.com/docs/playground)
- [A B Testing](https://langfuse.com/docs/prompts/a-b-testing)
- [Example Langchain](https://langfuse.com/docs/prompts/example-langchain)
- [Example Langchain Js](https://langfuse.com/docs/prompts/example-langchain-js)
- [Example Openai Functions](https://langfuse.com/docs/prompts/example-openai-functions)
- [Get Started](https://langfuse.com/docs/prompts/get-started)
- [Mcp Server](https://langfuse.com/docs/prompts/mcp-server)
- [N8n Node](https://langfuse.com/docs/prompts/n8n-node)
- [Query Traces](https://langfuse.com/docs/query-traces)
- [Rbac](https://langfuse.com/docs/rbac)
- [Roadmap](https://langfuse.com/docs/roadmap)
- [Annotation](https://langfuse.com/docs/scores/annotation)
- [Custom](https://langfuse.com/docs/scores/custom)
- [Data Model](https://langfuse.com/docs/scores/data-model)
- [External Evaluation Pipelines](https://langfuse.com/docs/scores/external-evaluation-pipelines)
- [Model Based Evals](https://langfuse.com/docs/scores/model-based-evals)
- [Overview](https://langfuse.com/docs/scores/overview)
- [User Feedback](https://langfuse.com/docs/scores/user-feedback)
- [Overview](https://langfuse.com/docs/sdk/overview)
- [Decorators](https://langfuse.com/docs/sdk/python/decorators)
- [Example](https://langfuse.com/docs/sdk/python/example)
- [Low Level Sdk](https://langfuse.com/docs/sdk/python/low-level-sdk)
- [Sdk V3](https://langfuse.com/docs/sdk/python/sdk-v3)
- [Example Notebook](https://langfuse.com/docs/sdk/typescript/example-notebook)
- [Guide](https://langfuse.com/docs/sdk/typescript/guide)
- [Guide Web](https://langfuse.com/docs/sdk/typescript/guide-web)
- [Example Python](https://langfuse.com/docs/security/example-python)
- [Getting Started](https://langfuse.com/docs/security/getting-started)
- [Overview](https://langfuse.com/docs/security/overview)
- [Tracing](https://langfuse.com/docs/tracing)
- [Tracing Data Model](https://langfuse.com/docs/tracing-data-model)
- [Agent Graphs](https://langfuse.com/docs/tracing-features/agent-graphs)
- [Comments](https://langfuse.com/docs/tracing-features/comments)
- [Environments](https://langfuse.com/docs/tracing-features/environments)
- [Log Levels](https://langfuse.com/docs/tracing-features/log-levels)
- [Masking](https://langfuse.com/docs/tracing-features/masking)
- [Metadata](https://langfuse.com/docs/tracing-features/metadata)
- [Multi Modality](https://langfuse.com/docs/tracing-features/multi-modality)
- [Releases And Versioning](https://langfuse.com/docs/tracing-features/releases-and-versioning)
- [Sampling](https://langfuse.com/docs/tracing-features/sampling)
- [Sessions](https://langfuse.com/docs/tracing-features/sessions)
- [Tags](https://langfuse.com/docs/tracing-features/tags)
- [Trace Ids](https://langfuse.com/docs/tracing-features/trace-ids)
- [Url](https://langfuse.com/docs/tracing-features/url)
- [Users](https://langfuse.com/docs/tracing-features/users)
"""

## 3. Create an agent that can search the Langfuse docs

We use the Lagfuse Docs MCP Server to provide tools to the agent: https://langfuse.com/docs/docs-mcp

In [None]:
from pydantic_ai import Agent, RunContext
from pydantic_ai.mcp import MCPServerStreamableHTTP, CallToolFunc, ToolResult
from langfuse import observe
from typing import Any

# Public MCP server that exposes Langfuse docs tools
LANGFUSE_MCP_URL = "https://langfuse.com/api/mcp"

@observe
async def run_agent(question: str, system_prompt: str, model="openai:o3-mini"):
    langfuse.update_current_trace(input=question)

    tool_call_history = []

    # Log all tool calls for trajectory analysis
    async def process_tool_call(
        ctx: RunContext[int],
        call_tool: CallToolFunc,
        tool_name: str,
        args: dict[str, Any],
    ) -> ToolResult:
        """A tool call processor that passes along the deps."""
        print(f"MCP Tool call: {tool_name} with args: {args}")
        tool_call_history.append({
            "tool_name": tool_name,
            "args": args
        })
        return await call_tool(tool_name, args)
    
    langfuse_docs_server = MCPServerStreamableHTTP(
        LANGFUSE_MCP_URL,
        process_tool_call=process_tool_call
    )

    agent = Agent(
        model=model,
        mcp_servers=[langfuse_docs_server],
        system_prompt=system_prompt
    )

    async with agent.run_mcp_servers():
        print("\n---")
        print("Q:", question)
        result = await agent.run(question)
        print("A:", result.output)

        langfuse.update_current_trace(
            output=result.output,
            metadata={"tool_call_history": tool_call_history
        })

        return result.output, tool_call_history

In [5]:
await run_agent(
    question="What is Langfuse and how does it help monitor LLM applications?",
    system_prompt="You are an expert on Langfuse. Answer user questions accurately and concisely using the available MCP tools. Cite sources when appropriate. Please make sure to use the tools in the best way possible to answer.",
    model="openai:gpt-4.1-nano"
);


---
Q: What is Langfuse and how does it help monitor LLM applications?
MCP Tool call: getLangfuseOverview with args: {}
MCP Tool call: searchLangfuseDocs with args: {'query': 'What is Langfuse and how does it help monitor LLM applications'}
A: Langfuse is an open-source LLM engineering platform designed to help teams collaboratively debug, analyze, and improve their large language model (LLM) applications. It offers comprehensive features such as tracing, prompt management, evaluation, and monitoring, all integrated to streamline the development workflow. 

It helps monitor LLM applications by providing detailed tracing of LLM interactions, capturing performance metrics like latency and error rates, and allowing for evaluation of output quality over time. Langfuse enables real-time visibility into how models are functioning in production, identifying bottlenecks, errors, or biases, and ultimately ensuring the robustness and reliability of AI systems. It is model-agnostic and supports 

## Evaluation

1. Create Test Cases
    - input
    - reference for reference-based evaluations
2. Set up evaluators
3. Run experiments

In [6]:
tests_cases = [
    {
        "input": {"question": "What is Langfuse?"},
        "expected_output": {
            "response_facts": [
                "Open Source LLM Engineering Platform",
                "Product modules: Tracing, Evaluation and Prompt Management"
            ],
            "trajectory": [
                "getLangfuseOverview"
            ],
        }
    },
    {
        "input": {
            "question": "How to trace a python application with Langfuse?"
        },
        "expected_output": {
            "response_facts": [
                "Python SDK, you can use the observe() decorator",
                "Lots of integrations, LangChain, LlamaIndex, Pydantic AI, and many more."
            ],
            "trajectory": [
                "getLangfuseOverview",
                "searchLangfuseDocs"
            ],
            "search_term": "Python Tracing"
        }
    },
    {
        "input": {"question": "How to connect to the Langfuse Docs MCP server?"},
        "expected_output": {
            "response_facts": [
                "Connect via the MCP server endpoint: https://langfuse.com/api/mcp",
                "Transport protocol: `streamableHttp`"
            ],
            "trajectory": ["getLangfuseOverview"]
        }
    },
    {
        "input": {
            "question": "How long are traces retained in langfuse?",
        },
        "expected_output": {
            "response_facts": [
                "By default, traces are retained indefinetly",
                "You can set custom data retention policy in the project settings"
            ],
            "trajectory": ["getLangfuseOverview", "searchLangfuseDocs"],
            "search_term": "Data retention"
        }
    }
]

Upload to Langfuse datasets

In [7]:
DATSET_NAME = "pydantic-ai-mcp-agent-evaluation"

In [8]:
dataset = langfuse.create_dataset(
    name=DATSET_NAME
)
for case in tests_cases:
    langfuse.create_dataset_item(
        dataset_name=DATSET_NAME,
        input=case["input"],
        expected_output=case["expected_output"]
    )

### Set up Evaluations in Langfuse

#### Final response evaluation

```md
You are a teacher grading a student based on the factual correctness of his statements. In the following please find some example gradings that you did in the past.

### Examples

#### **Example 1:**
- **Response:** "The sun is shining brightly."
- **Facts to verify:** ["The sun is up.", "It is a beautiful day."]

Grading
- Reasoning: The response accurately includes both facts and aligns with the context of a beautiful day.
- Score: 1

#### **Example 2:**
- **Response:** "When I was in the kitchen, the dog was there"
- **Facts to verify:** ["The cat is on the table.", "The dog is in the kitchen."]

Grading
- Reasoning: The response includes that the dog is in the kitchen but does not mention that the cat is on the table.
- Score: 0

### New Student Response

- **Response**: {{response}}
- **Facts to verify:** {{facts_to_verify}}
```

#### Trajectory

```md
You are comparing two lists of strings. Please check whether the lists contain exactly the same items. The order does not matter.

## Examples

Input
Expected: ["searchWeb", "visitWebsite"]
Output: ["searchWeb"]

Grading
Reasoning: ["searchWeb", "visitWebsite"] are expected. In the output, "visitWebsite" is missing. Thus the two arrays are not the same.
Score: 0

Input
Expected: ["drawImage", "visitWebsite", "speak"]
Output: ["visitWebsite", "speak", "drawImage"]

Grading
Reasoning: The output matches the items from the expected output.
Score: 1

Input
Expected: ["getNews"]
Output: ["getNews", "watchTv"]

Grading
Reasoning: The output contains "watchTv" which was not expected.
Score: 0

## This excercise

Expected: {{expected}}
Output: {{output}}
```

#### Search quality

```md
You are a teacher grading a student based on whether he has looked for the right information in order to answe a question. In the following please find some example gradings that you did in the past.

The search by the student does not need to exactly match the response you expected; searches are often brief. The search term should correspond vaguely with the expected search term.

### Examples
#### **Example 1:**
- **Response:** How can I contact support?
- **Expected search topics**: Support

Grading
- Reasoning: The response accurately searches for support.
- Score: 1

#### **Example 2:**
- **Response:** Deployment
- **Expected search topics:** Tracing

Grading
- Reasoning: The response does not match the expected search topic of Tracing. Deployment questions are unrelated.
- Score: 0

#### **Example 3:**
- **Response:**
- **Expected search topics:**

Grading
- Reasoning: No search was done and no search term was expected.
- Score: 1

#### **Example 4:**
- **Response:** How to view sessions?
- **Expected search topics:**

Grading
- Reasoning: No search was expected, but search was used. This is not a problem.
- Score: 1

#### **Example 5:**
- **Response:**
- **Expected search topics:** How to run Langfuse locally?

Grading
- Reasoning: Even though we expected a search regarding running Langfuse locally, no search was made.
- Score: 0

### New Student Response

- **Response:** {{search}}
- **Expected search topics:** {{expected_search_topic}}
```

### Run Experiments

In [None]:
system_prompts = {
    "simple": (
        "You are an expert on Langfuse. "
        "Answer user questions accurately and concisely using the available MCP tools. "
        "Cite sources when appropriate."
    ),
    "nudge_search_and_sources": (
        "You are an expert on Langfuse. "
        "Answer user questions accurately and concisely using the available MCP tools. "
        "Always cite sources when appropriate."
        "When you are unsure, always use getLangfuseOverview tool to do some research and then search the docs for more information. You can if needed use these tools multiple times."
    )
}

models = [
    "openai:gpt-4.1-nano",
    "openai:o4-mini"
]

In [10]:
from datetime import datetime

d = langfuse.get_dataset(DATSET_NAME)

for prompt_name, prompt_content in list(system_prompts.items()):
    for test_model in models:
        now = datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
        
        for item in d.items:
            with item.run(
                run_name=f"{test_model}-{prompt_name}-{now}",
                run_metadata={"model": test_model, "prompt": prompt_content},
            ) as root_span:
                
                await run_agent(
                    item.input["question"],
                    prompt_content,
                    test_model
                )



---
Q: How long are traces retained in langfuse?
MCP Tool call: searchLangfuseDocs with args: {'query': 'trace retention period'}
A: In Langfuse, traces are retained based on the data retention settings configured at the project level. The minimum retention period is 3 days, but it can be set up to a longer period depending on your needs. By default, Langfuse stores event data indefinitely, but you can specify a retention period, and older data will be automatically deleted nightly once it surpasses this setting.

---
Q: How to connect to the Langfuse Docs MCP server?
MCP Tool call: searchLangfuseDocs with args: {'query': 'connect to the Langfuse Docs MCP server'}
A: To connect to the Langfuse Docs MCP server, use the following endpoint and configurations:

- **Endpoint:** `https://langfuse.com/api/mcp`
- **Transport:** `streamableHttp`

Depending on your client, you may need to add the MCP server configuration in your setup. For example:

- In JSON format:
```json
{
  "mcpServers": {