The Universal Evaluation Framework for AI Agents across CEX and Web3.
This repository provides three things:
- benchmark datasets for real-world Web3 agent tasks;
- an automated harness that runs multi-turn evaluations and produces reports;
- adapter examples that show how to connect external agents into the test flow.
The current public benchmark set is organized around practical Web3 product scenarios rather than abstract capability buckets. You can use the included datasets directly, or adapt your own agent and run the same automation pipeline against it.
The public benchmark assets are organized under two parallel dataset trees:
dataset/: the primary English benchmark setdataset-zh/: the Chinese version of the same benchmark questions and scenario coverage
The benchmark is organized by scenario:
cex: centralized exchange operations and account workflowsdex: swap and on-chain trading workflowswallet: wallet balances and wallet-driven actionsmarket_analysis: market lookup and market-reading tasksproject_research: asset and project research tasksonchain_investigation: address and transaction investigation tasks
Both trees follow the same scenario structure and are intended to stay aligned
case-for-case, with dataset/ as the primary benchmark set and dataset-zh/
as its Chinese counterpart.
Each case is written as multi-turn YAML and is designed to test whether an agent can complete a realistic user task, explain limitations clearly, and stay aligned with product constraints.
The runtime executes benchmark cases against a target agent, records the full conversation trace, and generates structured reports.
Core capabilities:
- multi-turn execution with session preservation;
- configurable adapters for different agent protocols;
- execution traces, including responses and any intermediate steps captured by the adapter;
- LLM-based judging with configurable models;
- JSON and HTML report generation.
This repository includes working adapter examples for:
gateaigate_dexgateclaw_ws
These examples are here to make external integration easier. The project is not limited to Gate products. If your agent can expose an SSE or WebSocket interaction protocol, it can be connected directly. If not, you can still add a thin adapter layer that translates your agent's interface into the harness request/response model.
The benchmark suite is currently focused on Web3 web-agent scenarios, including tasks such as:
- balance and account queries;
- swap and execution flows;
- wallet lookups;
- market and price analysis;
- project research and information synthesis;
- on-chain investigation and transaction tracing.
The benchmark philosophy is scenario-first: cases are designed around what a user wants done in a real product environment, not around a fixed tool path or one hardcoded execution strategy.
The evaluation flow is straightforward:
- Load benchmark cases from YAML.
- Send each case to the target agent turn by turn.
- Record the execution trace and final response.
- Ask judge models to score the result based on
judge_criteria. - Generate machine-readable and visual reports.
Important scoring note:
judge_criteriais the scoring source of truth.expected_outcomeis reference context that helps define what a reasonable answer looks like.- The harness does not require one fixed tool path in order for an agent to be considered correct.
- Python 3.10+
- pip
git clone <repository-url>
cd ai-abc
pip install -r requirements.txtCopy the example config:
cp eval_config.yaml.example eval_config.yamlAt minimum, configure:
agent.urlagent.parseragent.api_keyif requiredcases.base_dirjudge.models
For the current public benchmark set, point cases.base_dir to ./dataset for
the primary English benchmark, or ./dataset-zh for the Chinese version of the
same benchmark set.
Example:
agent:
url: "https://your-agent-api.com/v1/chat"
parser: "gateai"
api_key: "${YOUR_AGENT_API_KEY}"
extra_headers:
x-agent-mode: "react"
session_mode: "body"
session_field: "conversation_id"
timeout: 120
cases:
base_dir: "./dataset"
dimensions:
- dex
- wallet
judge:
models:
- name: "claude-primary"
provider: "anthropic"
base_url: "https://api.anthropic.com/v1"
api_key: "${ANTHROPIC_API_KEY}"
model: "claude-sonnet-4-6"
timeout: 60# Run with the default config
python main.py
# Run with a custom config
python main.py --config /path/to/config.yaml
# Run only selected scenario groups
python main.py --dimensions dex wallet
# Override the dataset directory
python main.py --cases-dir ./dataset
# Run the Chinese mirror of the same benchmark set
python main.py --cases-dir ./dataset-zh
# Limit concurrency
python main.py --max-concurrent 5You can use this harness with your own agent in three common ways:
If your agent already exposes an SSE chat interface, implement or adapt an SSE adapter that maps:
- benchmark turn input -> your request format
- streamed events -> intermediate steps and final answer
- session handling -> your product's conversation model
If your agent speaks over WebSocket, use the WS adapter path and map the same concepts:
- turn payload construction;
- response event parsing;
- session and message correlation.
If your agent does not natively expose SSE or WebSocket, you can still add a thin adapter layer that translates your internal API, queue, RPC, or service protocol into the harness interface.
In practice, the harness only needs a stable way to:
- send user turns;
- preserve session context;
- collect the final answer;
- optionally capture intermediate traces.
The existing adapters in agenteval/adapters/ are reference implementations
for this mapping step.
Benchmark cases are defined as YAML files with one or more cases under a
cases list.
Recommended fields:
id: unique case identifierdimension: scenario group namedifficulty: optional difficulty tag such asL1,L2,L3name: human-readable case titledescription: what the case is testingturns: ordered multi-turn conversation inputexpected_outcome: reference answer guidancejudge_criteria: the scoring rubric used by judge modelstimeout_seconds: optional per-case timeout
Example:
cases:
- id: dex_001
dimension: dex
difficulty: L1
name: Supported chains and token discovery
description: Query supported chains and major swappable assets on Ethereum.
turns:
- role: user
content: "What chains do you support? What tokens can I swap on Ethereum?"
expected_outcome: |
Return supported chains and representative Ethereum tokens.
If the upstream source is unavailable, explain that clearly instead of inventing data.
judge_criteria:
- name: Scope coverage
weight: 30
pass: Answers both supported chains and Ethereum token availability.
partial: Answers one part clearly but the other part is incomplete.
partial_score: 0.6
fail: Misses half the request or misclassifies it as a trade.
- name: Result validity
weight: 50
pass: Returns real and usable chain and token information.
partial: Returns related but incomplete information.
partial_score: 0.6
fail: Returns no useful result or fabricates data.
- name: Presentation and failure handling
weight: 20
pass: Clear structure and honest failure handling.
fail: Confusing response or false claims.
timeout_seconds: 90Difficulty is orthogonal to the scenario group. A case can be simple or complex within any benchmark area.
Each run writes outputs under results/{timestamp}/.
Key outputs:
report.html: primary human-readable reportreport.json: machine-readable summary and case-level resultsexecution.json: per-case execution trace when enabledprompt.md: rendered judge prompt when enabledjudgement.json: raw judge output when enabled
The HTML report is the main artifact for reviewing benchmark results across cases, scenarios, and failure types.
- Code in this repository, unless otherwise noted, is licensed under the Apache License 2.0.
- Benchmark Assets in
dataset/anddataset-zh/are licensed under CC BY-NC 4.0 with Additional Restrictions. See the localLICENSEfiles in those directories. - Benchmark Assets may be used for non-commercial research, academic evaluation, internal reproducibility, and personal experimentation, provided attribution is preserved.
- Benchmark Assets may not be used for commercial purposes, for training or fine-tuning commercial AI models, or for building, enhancing, or operating competing benchmarks, leaderboards, or evaluation services.
- Future directories whose names include
data,dataset,prompt,prompts,task,tasks, orbenchmark_casesshould be treated as restricted Benchmark Assets by default unless a more specific local license states otherwise. - For commercial licensing, enterprise cooperation, or platform-level integrations, contact the project maintainers and the Gate platform.
Contributions to benchmark cases, adapters, and runtime improvements are welcome.
Suggested workflow:
- Fork the repository.
- Create a feature branch.
- Make your changes.
- Open a pull request with a clear summary of the benchmark or integration change.
If you have questions or suggestions, open an issue or contact the maintainer.