# Agent-Diff Benchmark: LangChain Agent

Run the [Agent-Diff benchmark](https://arxiv.org/abs/2602.11224) using LangChain's built-in agent with tool calling.

Unlike the [ReAct notebook](react_agent_benchmark.ipynb) which uses a custom XML-tag loop, this notebook lets LangChain handle the agent loop via the model's native function-calling protocol.

Two options are shown:
- **Option A** — Load tests from HuggingFace dataset (no server-side test suites needed)
- **Option B** — Load tests from Agent-Diff server test suites (used in production evaluations)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/agent-diff-bench/agent-diff/blob/main/examples/langchain_agent_benchmark.ipynb)

**Links:** [Paper](https://arxiv.org/abs/2602.11224) | [Dataset](https://huggingface.co/datasets/hubertmarek/agent-diff-bench) | [GitHub](https://github.com/agent-diff-bench/agent-diff)

In [None]:
!pip install agent-diff langchain langchain-openai datasets -q

In [None]:
# Get your API key at https://www.agentdiff.dev/dashboard
%env AGENT_DIFF_API_KEY=
%env AGENT_DIFF_BASE_URL=https://api.agentdiff.dev
# OpenRouter key (or any OpenAI-compatible provider) e.g. https://openrouter.ai/anthropic/claude-haiku-4.5
%env OPENAI_API_KEY=

In [None]:
import time
import json
from agent_diff import AgentDiff, PythonExecutorProxy, create_langchain_tool
from langchain.agents import create_agent
from langchain_openai import ChatOpenAI

client = AgentDiff()

model = ChatOpenAI(
    model="anthropic/claude-haiku-4.5",
    base_url="https://openrouter.ai/api/v1",
)

SERVICE_PROMPTS = {
    "slack": "Use execute_python to interact with Slack API at https://slack.com/api. Authentication is handled automatically via proxy. Leave a placeholder credential where you would add a real token.",
    "box": "Use execute_python to interact with Box API at https://api.box.com/2.0. Authentication is handled automatically via proxy. Leave a placeholder credential where you would add a real token.",
    "calendar": "Use execute_python to interact with Google Calendar API at https://www.googleapis.com/calendar/v3. Authentication is handled automatically via proxy. Leave a placeholder credential where you would add a real token. Current Date/Time: Sunday, June 17, 2018 at 00:01 (midnight), timezone America/Los_Angeles.",
    "linear": "Use execute_python to interact with Linear GraphQL API at https://api.linear.app/graphql. Authentication is handled automatically via proxy. Leave a placeholder credential where you would add a real token.",
}

## Option A: Load from HuggingFace Dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset("hubertmarek/agent-diff-bench", split="test")
results = []

for example in dataset.select(range(5)):  # First 5 tasks; remove .select() for full benchmark
    info = json.loads(example["info"]) if isinstance(example["info"], str) else example["info"]
    expected = json.loads(example["answer"]) if isinstance(example["answer"], str) else example["answer"]
    service = info["service"]

    print(f"Running: {example.get('test_name', example['test_id'])}")

    env = client.init_env(
        templateService=info["service"],
        templateName=info["seed_template"],
        impersonateUserId=info["impersonate_user_id"],
    )
    run = client.start_run(envId=env.environmentId)

    python_tool = create_langchain_tool(
        PythonExecutorProxy(env.environmentId, base_url=client.base_url, api_key=client.api_key)
    )

    agent = create_agent(
        model=model,
        tools=[python_tool],
        system_prompt=SERVICE_PROMPTS[service],
    )

    start = time.perf_counter()
    try:
        response = agent.invoke({"messages": [
            {"role": "user", "content": example["question"]}
        ]})
    except Exception as e:
        response = {"error": str(e)}
    elapsed = time.perf_counter() - start

    client.evaluate_run(runId=run.runId, expectedOutput=expected)
    result = client.get_results_for_run(runId=run.runId)

    results.append({
        "test_id": example["test_id"],
        "service": service,
        "passed": result.passed,
        "score": result.score,
        "time": round(elapsed, 1),
    })
    print(f"  {'PASS' if result.passed else 'FAIL'} | score={result.score} | {elapsed:.1f}s")

    client.delete_env(envId=env.environmentId)

passed = sum(1 for r in results if r["passed"])
print(f"\nResults: {passed}/{len(results)} passed")

## Option B: Load from Server Test Suites

Uses the Agent-Diff platform's test suite API. Assertions are defined server-side so you don't need to pass `expectedOutput` — just call `evaluate_run`. Available test suites: [docs](https://agentdiff.mintlify.app/test-suites/benchmarks).

In [None]:
SUITES = ["Slack Bench v2", "Box Bench v2", "Calendar Bench", "Linear Bench"]

results = []

for suite_name in SUITES:
    suite_list = client.list_test_suites(name=suite_name)
    if not suite_list.testSuites:
        print(f"[SKIP] '{suite_name}' not found")
        continue
    suite = client.get_test_suite(suite_list.testSuites[0].id, expand=True)
    tests = suite.tests[:5]  # First 5 tests per suite; remove [:5] for full benchmark

    print(f"\n{'='*50}")
    print(f"  {suite_name} — {len(tests)} tests")
    print(f"{'='*50}")

    for test in tests:
        env = client.init_env(testId=test.id)
        run = client.start_run(envId=env.environmentId, testId=test.id)

        python_tool = create_langchain_tool(
            PythonExecutorProxy(env.environmentId, base_url=client.base_url, api_key=client.api_key)
        )

        service = env.service
        agent = create_agent(
            model=model,
            tools=[python_tool],
            system_prompt=SERVICE_PROMPTS.get(service, SERVICE_PROMPTS["slack"]),
        )

        start = time.perf_counter()
        try:
            response = agent.invoke({"messages": [
                {"role": "user", "content": test.prompt}
            ]})
        except Exception as e:
            response = {"error": str(e)}
        elapsed = time.perf_counter() - start

        client.evaluate_run(runId=run.runId)
        result = client.get_results_for_run(runId=run.runId)

        results.append({
            "test_id": str(test.id),
            "suite": suite_name,
            "passed": result.passed,
            "score": result.score,
            "time": round(elapsed, 1),
        })
        status = "PASS" if result.passed else "FAIL"
        print(f"  [{status}] {getattr(test, 'name', str(test.id))[:60]}  score={result.score} | {elapsed:.1f}s")

        client.delete_env(envId=env.environmentId)

passed = sum(1 for r in results if r["passed"])
print(f"\nResults: {passed}/{len(results)} passed")