AI-ABC: AI Agent Benchmark for Crypto

The Universal Evaluation Framework for AI Agents across CEX and Web3.

This repository provides three things:

benchmark datasets for real-world Web3 agent tasks;
an automated harness that runs multi-turn evaluations and produces reports;
adapter examples that show how to connect external agents into the test flow.

The current public benchmark set is organized around practical Web3 product scenarios rather than abstract capability buckets. You can use the included datasets directly, or adapt your own agent and run the same automation pipeline against it.

What This Repo Provides

1. Benchmark datasets

The public benchmark assets are organized under two parallel dataset trees:

dataset/: the primary English benchmark set
dataset-zh/: the Chinese version of the same benchmark questions and scenario coverage

The benchmark is organized by scenario:

cex: centralized exchange operations and account workflows
dex: swap and on-chain trading workflows
wallet: wallet balances and wallet-driven actions
market_analysis: market lookup and market-reading tasks
project_research: asset and project research tasks
onchain_investigation: address and transaction investigation tasks

Both trees follow the same scenario structure and are intended to stay aligned case-for-case, with dataset/ as the primary benchmark set and dataset-zh/ as its Chinese counterpart.

Each case is written as multi-turn YAML and is designed to test whether an agent can complete a realistic user task, explain limitations clearly, and stay aligned with product constraints.

2. Automated evaluation harness

The runtime executes benchmark cases against a target agent, records the full conversation trace, and generates structured reports.

Core capabilities:

multi-turn execution with session preservation;
configurable adapters for different agent protocols;
execution traces, including responses and any intermediate steps captured by the adapter;
LLM-based judging with configurable models;
JSON and HTML report generation.

3. Adapter examples for external agents

This repository includes working adapter examples for:

gateai
gate_dex
gateclaw_ws

These examples are here to make external integration easier. The project is not limited to Gate products. If your agent can expose an SSE or WebSocket interaction protocol, it can be connected directly. If not, you can still add a thin adapter layer that translates your agent's interface into the harness request/response model.

Benchmark Coverage

The benchmark suite is currently focused on Web3 web-agent scenarios, including tasks such as:

balance and account queries;
swap and execution flows;
wallet lookups;
market and price analysis;
project research and information synthesis;
on-chain investigation and transaction tracing.

The benchmark philosophy is scenario-first: cases are designed around what a user wants done in a real product environment, not around a fixed tool path or one hardcoded execution strategy.

How Evaluation Works

The evaluation flow is straightforward:

Load benchmark cases from YAML.
Send each case to the target agent turn by turn.
Record the execution trace and final response.
Ask judge models to score the result based on judge_criteria.
Generate machine-readable and visual reports.

Important scoring note:

judge_criteria is the scoring source of truth.
expected_outcome is reference context that helps define what a reasonable answer looks like.
The harness does not require one fixed tool path in order for an agent to be considered correct.

Quick Start

Requirements

Python 3.10+
pip

Install

git clone <repository-url>
cd ai-abc
pip install -r requirements.txt

Configure a target agent

Copy the example config:

cp eval_config.yaml.example eval_config.yaml

At minimum, configure:

agent.url
agent.parser
agent.api_key if required
cases.base_dir
judge.models

For the current public benchmark set, point cases.base_dir to ./dataset for the primary English benchmark, or ./dataset-zh for the Chinese version of the same benchmark set.

Example:

agent:
  url: "https://your-agent-api.com/v1/chat"
  parser: "gateai"
  api_key: "${YOUR_AGENT_API_KEY}"
  extra_headers:
    x-agent-mode: "react"
  session_mode: "body"
  session_field: "conversation_id"
  timeout: 120

cases:
  base_dir: "./dataset"
  dimensions:
    - dex
    - wallet

judge:
  models:
    - name: "claude-primary"
      provider: "anthropic"
      base_url: "https://api.anthropic.com/v1"
      api_key: "${ANTHROPIC_API_KEY}"
      model: "claude-sonnet-4-6"
      timeout: 60

Run a benchmark

# Run with the default config
python main.py

# Run with a custom config
python main.py --config /path/to/config.yaml

# Run only selected scenario groups
python main.py --dimensions dex wallet

# Override the dataset directory
python main.py --cases-dir ./dataset

# Run the Chinese mirror of the same benchmark set
python main.py --cases-dir ./dataset-zh

# Limit concurrency
python main.py --max-concurrent 5

Bring Your Own Agent

You can use this harness with your own agent in three common ways:

Option 1: Direct SSE integration

If your agent already exposes an SSE chat interface, implement or adapt an SSE adapter that maps:

benchmark turn input -> your request format
streamed events -> intermediate steps and final answer
session handling -> your product's conversation model

Option 2: Direct WebSocket integration

If your agent speaks over WebSocket, use the WS adapter path and map the same concepts:

turn payload construction;
response event parsing;
session and message correlation.

Option 3: A thin compatibility adapter

If your agent does not natively expose SSE or WebSocket, you can still add a thin adapter layer that translates your internal API, queue, RPC, or service protocol into the harness interface.

In practice, the harness only needs a stable way to:

send user turns;
preserve session context;
collect the final answer;
optionally capture intermediate traces.

The existing adapters in agenteval/adapters/ are reference implementations for this mapping step.

Case Format

Benchmark cases are defined as YAML files with one or more cases under a cases list.

Recommended fields:

id: unique case identifier
dimension: scenario group name
difficulty: optional difficulty tag such as L1, L2, L3
name: human-readable case title
description: what the case is testing
turns: ordered multi-turn conversation input
expected_outcome: reference answer guidance
judge_criteria: the scoring rubric used by judge models
timeout_seconds: optional per-case timeout

Example:

cases:
  - id: dex_001
    dimension: dex
    difficulty: L1
    name: Supported chains and token discovery
    description: Query supported chains and major swappable assets on Ethereum.

    turns:
      - role: user
        content: "What chains do you support? What tokens can I swap on Ethereum?"

    expected_outcome: |
      Return supported chains and representative Ethereum tokens.
      If the upstream source is unavailable, explain that clearly instead of inventing data.

    judge_criteria:
      - name: Scope coverage
        weight: 30
        pass: Answers both supported chains and Ethereum token availability.
        partial: Answers one part clearly but the other part is incomplete.
        partial_score: 0.6
        fail: Misses half the request or misclassifies it as a trade.
      - name: Result validity
        weight: 50
        pass: Returns real and usable chain and token information.
        partial: Returns related but incomplete information.
        partial_score: 0.6
        fail: Returns no useful result or fabricates data.
      - name: Presentation and failure handling
        weight: 20
        pass: Clear structure and honest failure handling.
        fail: Confusing response or false claims.

    timeout_seconds: 90

Difficulty is orthogonal to the scenario group. A case can be simple or complex within any benchmark area.

Reports and Outputs

Each run writes outputs under results/{timestamp}/.

Key outputs:

report.html: primary human-readable report
report.json: machine-readable summary and case-level results
execution.json: per-case execution trace when enabled
prompt.md: rendered judge prompt when enabled
judgement.json: raw judge output when enabled

The HTML report is the main artifact for reviewing benchmark results across cases, scenarios, and failure types.

License and Usage Policy

Code in this repository, unless otherwise noted, is licensed under the Apache License 2.0.
Benchmark Assets in dataset/ and dataset-zh/ are licensed under CC BY-NC 4.0 with Additional Restrictions. See the local LICENSE files in those directories.
Benchmark Assets may be used for non-commercial research, academic evaluation, internal reproducibility, and personal experimentation, provided attribution is preserved.
Benchmark Assets may not be used for commercial purposes, for training or fine-tuning commercial AI models, or for building, enhancing, or operating competing benchmarks, leaderboards, or evaluation services.
Future directories whose names include data, dataset, prompt, prompts, task, tasks, or benchmark_cases should be treated as restricted Benchmark Assets by default unless a more specific local license states otherwise.
For commercial licensing, enterprise cooperation, or platform-level integrations, contact the project maintainers and the Gate platform.

Contributing

Contributions to benchmark cases, adapters, and runtime improvements are welcome.

Suggested workflow:

Fork the repository.
Create a feature branch.
Make your changes.
Open a pull request with a clear summary of the benchmark or integration change.

Contact

If you have questions or suggestions, open an issue or contact the maintainer.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.agents/skills/case-writer		.agents/skills/case-writer
agenteval		agenteval
dataset-zh		dataset-zh
dataset		dataset
docs		docs
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
eval_config.yaml.example		eval_config.yaml.example
main.py		main.py
readme_zh.md		readme_zh.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-ABC: AI Agent Benchmark for Crypto

What This Repo Provides

1. Benchmark datasets

2. Automated evaluation harness

3. Adapter examples for external agents

Benchmark Coverage

How Evaluation Works

Quick Start

Requirements

Install

Configure a target agent

Run a benchmark

Bring Your Own Agent

Option 1: Direct SSE integration

Option 2: Direct WebSocket integration

Option 3: A thin compatibility adapter

Case Format

Reports and Outputs

License and Usage Policy

Contributing

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

AI-ABC: AI Agent Benchmark for Crypto

What This Repo Provides

1. Benchmark datasets

2. Automated evaluation harness

3. Adapter examples for external agents

Benchmark Coverage

How Evaluation Works

Quick Start

Requirements

Install

Configure a target agent

Run a benchmark

Bring Your Own Agent

Option 1: Direct SSE integration

Option 2: Direct WebSocket integration

Option 3: A thin compatibility adapter

Case Format

Reports and Outputs

License and Usage Policy

Contributing

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages