Skip to content

billjamno58/chainreason

 
 

Repository files navigation

ChainReason

A small benchmark for evaluating LLM reasoning on Ethereum and DeFi tasks.

Python 3.9+ License: MIT

ChainReason is a lightweight evaluation suite that asks language models to do five things a smart-contract engineer or DeFi analyst would consider routine:

  1. protocol_qa — multiple-choice questions about specific DeFi protocol mechanics.
  2. vuln_detect — classify a Solidity snippet by vulnerability category.
  3. contract_class — classify a contract from its ABI summary + an optional hint.
  4. tx_intent — given a sequence of decoded actions, infer the transaction's intent.
  5. slippage_pred — given an AMM pool state and a swap, compute the output amount.

The point of having five tasks instead of one is that each one stresses a different capability — symbolic reasoning, code understanding, structural pattern recognition, numeric reasoning. A model that's strong on vuln_detect but weak on slippage_pred tells you something different than a model that's strong on both.

Why another benchmark

Existing benchmarks for Solidity / blockchain LLMs largely focus on either (a) code generation, or (b) vulnerability detection. ChainReason adds three other axes that I haven't seen consolidated elsewhere:

  • Protocol-level reasoning. Knowing what getReserves() returns is one thing; knowing what happens when you yank 30% of the reserves out of a Uniswap v2 pair is another.
  • Transaction-graph understanding. Telling a sandwich apart from a swap from an arbitrage requires looking at the structure of an execution trace, not just opcodes.
  • Numeric grounding. AMMs have closed-form pricing. If a model gets the CPMM math wrong, it'll be wrong about every downstream task.

The dataset is small and hand-curated — this is not a leaderboard scraper or a ten-thousand-row crawl of Etherscan. The included seed examples are meant to be illustrative; you can extend them with your own data via --data-path.

Installation

git clone https://github.com/joshawome/chainreason
cd chainreason
pip install -e .

For local model inference (HuggingFace), also install:

pip install torch transformers accelerate

Quick start

export OPENAI_API_KEY=...
python scripts/run_eval.py --task protocol_qa --client openai --model gpt-4o-mini --limit 5

Or run a full sweep from a YAML config:

python scripts/run_eval.py --config configs/full_run.yaml
python scripts/aggregate_results.py results/full -o results/full/SUMMARY.md

Programmatic use

from chainreason.tasks import get_task
from chainreason.models.openai_client import OpenAIClient
from chainreason.runner import run_eval

task = get_task("vuln_detect")
model = OpenAIClient(model="gpt-4o-mini")
summary = run_eval(task, model, limit=10, output_dir="results/")
print(summary["metrics"])

Tasks

Task n (seed) Output type Metric
protocol_qa 14 A/B/C/D accuracy
vuln_detect 12 label (1 of 6) accuracy + macro-F1
contract_class 14 label (1 of 11) accuracy + macro-F1
tx_intent 14 label (1 of 14) accuracy + macro-F1
slippage_pred 10 numeric tiered relative error

For baseline numbers, see results/BASELINES.md.

The seed sets are intentionally small. They exist to make sure the benchmark runs and so you can sanity-check a new model in under a minute. Real evaluation should use a larger held-out set.

Adding your own task

Subclass Task, implement four methods, register it:

from chainreason.tasks.base import Task, Example

class MyTask(Task):
    name = "my_task"
    def load(self): ...
    def build_prompt(self, ex): ...
    def parse_response(self, text): ...
    def score(self, prediction, target): ...

Add it to TASK_REGISTRY in chainreason/tasks/__init__.py and you're set.

Citation

If this is useful in your work:

@misc{yamamoto2025chainreason,
  author       = {Yamamoto, Joshua},
  title        = {{ChainReason}: A Benchmark for LLM Reasoning over On-Chain Tasks},
  year         = {2025},
  howpublished = {\url{https://github.com/joshawome/chainreason}}
}

License

MIT — see LICENSE.

About

A benchmark for evaluating LLM reasoning on Ethereum and DeFi tasks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%