A small benchmark for evaluating LLM reasoning on Ethereum and DeFi tasks.
ChainReason is a lightweight evaluation suite that asks language models to do five things a smart-contract engineer or DeFi analyst would consider routine:
protocol_qa— multiple-choice questions about specific DeFi protocol mechanics.vuln_detect— classify a Solidity snippet by vulnerability category.contract_class— classify a contract from its ABI summary + an optional hint.tx_intent— given a sequence of decoded actions, infer the transaction's intent.slippage_pred— given an AMM pool state and a swap, compute the output amount.
The point of having five tasks instead of one is that each one stresses a different
capability — symbolic reasoning, code understanding, structural pattern recognition,
numeric reasoning. A model that's strong on vuln_detect but weak on slippage_pred
tells you something different than a model that's strong on both.
Existing benchmarks for Solidity / blockchain LLMs largely focus on either (a) code generation, or (b) vulnerability detection. ChainReason adds three other axes that I haven't seen consolidated elsewhere:
- Protocol-level reasoning. Knowing what
getReserves()returns is one thing; knowing what happens when you yank 30% of the reserves out of a Uniswap v2 pair is another. - Transaction-graph understanding. Telling a sandwich apart from a swap from an arbitrage requires looking at the structure of an execution trace, not just opcodes.
- Numeric grounding. AMMs have closed-form pricing. If a model gets the CPMM math wrong, it'll be wrong about every downstream task.
The dataset is small and hand-curated — this is not a leaderboard scraper or a
ten-thousand-row crawl of Etherscan. The included seed examples are meant to be
illustrative; you can extend them with your own data via --data-path.
git clone https://github.com/joshawome/chainreason
cd chainreason
pip install -e .For local model inference (HuggingFace), also install:
pip install torch transformers accelerateexport OPENAI_API_KEY=...
python scripts/run_eval.py --task protocol_qa --client openai --model gpt-4o-mini --limit 5Or run a full sweep from a YAML config:
python scripts/run_eval.py --config configs/full_run.yaml
python scripts/aggregate_results.py results/full -o results/full/SUMMARY.mdfrom chainreason.tasks import get_task
from chainreason.models.openai_client import OpenAIClient
from chainreason.runner import run_eval
task = get_task("vuln_detect")
model = OpenAIClient(model="gpt-4o-mini")
summary = run_eval(task, model, limit=10, output_dir="results/")
print(summary["metrics"])| Task | n (seed) | Output type | Metric |
|---|---|---|---|
protocol_qa |
14 | A/B/C/D | accuracy |
vuln_detect |
12 | label (1 of 6) | accuracy + macro-F1 |
contract_class |
14 | label (1 of 11) | accuracy + macro-F1 |
tx_intent |
14 | label (1 of 14) | accuracy + macro-F1 |
slippage_pred |
10 | numeric | tiered relative error |
For baseline numbers, see results/BASELINES.md.
The seed sets are intentionally small. They exist to make sure the benchmark runs and so you can sanity-check a new model in under a minute. Real evaluation should use a larger held-out set.
Subclass Task, implement four methods, register it:
from chainreason.tasks.base import Task, Example
class MyTask(Task):
name = "my_task"
def load(self): ...
def build_prompt(self, ex): ...
def parse_response(self, text): ...
def score(self, prediction, target): ...Add it to TASK_REGISTRY in chainreason/tasks/__init__.py and you're set.
If this is useful in your work:
@misc{yamamoto2025chainreason,
author = {Yamamoto, Joshua},
title = {{ChainReason}: A Benchmark for LLM Reasoning over On-Chain Tasks},
year = {2025},
howpublished = {\url{https://github.com/joshawome/chainreason}}
}MIT — see LICENSE.