Skip to content

gate/ai-abc

Repository files navigation

English | 简体中文

AI-ABC: AI Agent Benchmark for Crypto

The Universal Evaluation Framework for AI Agents across CEX and Web3.

This repository provides three things:

  • benchmark datasets for real-world Web3 agent tasks;
  • an automated harness that runs multi-turn evaluations and produces reports;
  • adapter examples that show how to connect external agents into the test flow.

The current public benchmark set is organized around practical Web3 product scenarios rather than abstract capability buckets. You can use the included datasets directly, or adapt your own agent and run the same automation pipeline against it.

What This Repo Provides

1. Benchmark datasets

The public benchmark assets are organized under two parallel dataset trees:

  • dataset/: the primary English benchmark set
  • dataset-zh/: the Chinese version of the same benchmark questions and scenario coverage

The benchmark is organized by scenario:

  • cex: centralized exchange operations and account workflows
  • dex: swap and on-chain trading workflows
  • wallet: wallet balances and wallet-driven actions
  • market_analysis: market lookup and market-reading tasks
  • project_research: asset and project research tasks
  • onchain_investigation: address and transaction investigation tasks

Both trees follow the same scenario structure and are intended to stay aligned case-for-case, with dataset/ as the primary benchmark set and dataset-zh/ as its Chinese counterpart.

Each case is written as multi-turn YAML and is designed to test whether an agent can complete a realistic user task, explain limitations clearly, and stay aligned with product constraints.

2. Automated evaluation harness

The runtime executes benchmark cases against a target agent, records the full conversation trace, and generates structured reports.

Core capabilities:

  • multi-turn execution with session preservation;
  • configurable adapters for different agent protocols;
  • execution traces, including responses and any intermediate steps captured by the adapter;
  • LLM-based judging with configurable models;
  • JSON and HTML report generation.

3. Adapter examples for external agents

This repository includes working adapter examples for:

  • gateai
  • gate_dex
  • gateclaw_ws

These examples are here to make external integration easier. The project is not limited to Gate products. If your agent can expose an SSE or WebSocket interaction protocol, it can be connected directly. If not, you can still add a thin adapter layer that translates your agent's interface into the harness request/response model.

Benchmark Coverage

The benchmark suite is currently focused on Web3 web-agent scenarios, including tasks such as:

  • balance and account queries;
  • swap and execution flows;
  • wallet lookups;
  • market and price analysis;
  • project research and information synthesis;
  • on-chain investigation and transaction tracing.

The benchmark philosophy is scenario-first: cases are designed around what a user wants done in a real product environment, not around a fixed tool path or one hardcoded execution strategy.

How Evaluation Works

The evaluation flow is straightforward:

  1. Load benchmark cases from YAML.
  2. Send each case to the target agent turn by turn.
  3. Record the execution trace and final response.
  4. Ask judge models to score the result based on judge_criteria.
  5. Generate machine-readable and visual reports.

Important scoring note:

  • judge_criteria is the scoring source of truth.
  • expected_outcome is reference context that helps define what a reasonable answer looks like.
  • The harness does not require one fixed tool path in order for an agent to be considered correct.

Quick Start

Requirements

  • Python 3.10+
  • pip

Install

git clone <repository-url>
cd ai-abc
pip install -r requirements.txt

Configure a target agent

Copy the example config:

cp eval_config.yaml.example eval_config.yaml

At minimum, configure:

  • agent.url
  • agent.parser
  • agent.api_key if required
  • cases.base_dir
  • judge.models

For the current public benchmark set, point cases.base_dir to ./dataset for the primary English benchmark, or ./dataset-zh for the Chinese version of the same benchmark set.

Example:

agent:
  url: "https://your-agent-api.com/v1/chat"
  parser: "gateai"
  api_key: "${YOUR_AGENT_API_KEY}"
  extra_headers:
    x-agent-mode: "react"
  session_mode: "body"
  session_field: "conversation_id"
  timeout: 120

cases:
  base_dir: "./dataset"
  dimensions:
    - dex
    - wallet

judge:
  models:
    - name: "claude-primary"
      provider: "anthropic"
      base_url: "https://api.anthropic.com/v1"
      api_key: "${ANTHROPIC_API_KEY}"
      model: "claude-sonnet-4-6"
      timeout: 60

Run a benchmark

# Run with the default config
python main.py

# Run with a custom config
python main.py --config /path/to/config.yaml

# Run only selected scenario groups
python main.py --dimensions dex wallet

# Override the dataset directory
python main.py --cases-dir ./dataset

# Run the Chinese mirror of the same benchmark set
python main.py --cases-dir ./dataset-zh

# Limit concurrency
python main.py --max-concurrent 5

Bring Your Own Agent

You can use this harness with your own agent in three common ways:

Option 1: Direct SSE integration

If your agent already exposes an SSE chat interface, implement or adapt an SSE adapter that maps:

  • benchmark turn input -> your request format
  • streamed events -> intermediate steps and final answer
  • session handling -> your product's conversation model

Option 2: Direct WebSocket integration

If your agent speaks over WebSocket, use the WS adapter path and map the same concepts:

  • turn payload construction;
  • response event parsing;
  • session and message correlation.

Option 3: A thin compatibility adapter

If your agent does not natively expose SSE or WebSocket, you can still add a thin adapter layer that translates your internal API, queue, RPC, or service protocol into the harness interface.

In practice, the harness only needs a stable way to:

  • send user turns;
  • preserve session context;
  • collect the final answer;
  • optionally capture intermediate traces.

The existing adapters in agenteval/adapters/ are reference implementations for this mapping step.

Case Format

Benchmark cases are defined as YAML files with one or more cases under a cases list.

Recommended fields:

  • id: unique case identifier
  • dimension: scenario group name
  • difficulty: optional difficulty tag such as L1, L2, L3
  • name: human-readable case title
  • description: what the case is testing
  • turns: ordered multi-turn conversation input
  • expected_outcome: reference answer guidance
  • judge_criteria: the scoring rubric used by judge models
  • timeout_seconds: optional per-case timeout

Example:

cases:
  - id: dex_001
    dimension: dex
    difficulty: L1
    name: Supported chains and token discovery
    description: Query supported chains and major swappable assets on Ethereum.

    turns:
      - role: user
        content: "What chains do you support? What tokens can I swap on Ethereum?"

    expected_outcome: |
      Return supported chains and representative Ethereum tokens.
      If the upstream source is unavailable, explain that clearly instead of inventing data.

    judge_criteria:
      - name: Scope coverage
        weight: 30
        pass: Answers both supported chains and Ethereum token availability.
        partial: Answers one part clearly but the other part is incomplete.
        partial_score: 0.6
        fail: Misses half the request or misclassifies it as a trade.
      - name: Result validity
        weight: 50
        pass: Returns real and usable chain and token information.
        partial: Returns related but incomplete information.
        partial_score: 0.6
        fail: Returns no useful result or fabricates data.
      - name: Presentation and failure handling
        weight: 20
        pass: Clear structure and honest failure handling.
        fail: Confusing response or false claims.

    timeout_seconds: 90

Difficulty is orthogonal to the scenario group. A case can be simple or complex within any benchmark area.

Reports and Outputs

Each run writes outputs under results/{timestamp}/.

Key outputs:

  • report.html: primary human-readable report
  • report.json: machine-readable summary and case-level results
  • execution.json: per-case execution trace when enabled
  • prompt.md: rendered judge prompt when enabled
  • judgement.json: raw judge output when enabled

The HTML report is the main artifact for reviewing benchmark results across cases, scenarios, and failure types.

License and Usage Policy

  • Code in this repository, unless otherwise noted, is licensed under the Apache License 2.0.
  • Benchmark Assets in dataset/ and dataset-zh/ are licensed under CC BY-NC 4.0 with Additional Restrictions. See the local LICENSE files in those directories.
  • Benchmark Assets may be used for non-commercial research, academic evaluation, internal reproducibility, and personal experimentation, provided attribution is preserved.
  • Benchmark Assets may not be used for commercial purposes, for training or fine-tuning commercial AI models, or for building, enhancing, or operating competing benchmarks, leaderboards, or evaluation services.
  • Future directories whose names include data, dataset, prompt, prompts, task, tasks, or benchmark_cases should be treated as restricted Benchmark Assets by default unless a more specific local license states otherwise.
  • For commercial licensing, enterprise cooperation, or platform-level integrations, contact the project maintainers and the Gate platform.

Contributing

Contributions to benchmark cases, adapters, and runtime improvements are welcome.

Suggested workflow:

  1. Fork the repository.
  2. Create a feature branch.
  3. Make your changes.
  4. Open a pull request with a clear summary of the benchmark or integration change.

Contact

If you have questions or suggestions, open an issue or contact the maintainer.

About

AI-ABC: AI Agent Benchmark for Crypto

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages