Skip to content

cross stones

b2o2i edited this page Apr 3, 2026 · 1 revision

Cross-Stones Benchmark

Cross-Stones measures AI accuracy and speed across a fixed set of research domains. Run the same benchmark today, in six months, and in two years — the scores are directly comparable because the prompts, claim count, and scoring formula never change.

The name comes from curling: each AI gets one throw at the same target. The cross-product design means every AI also evaluates every other AI's throw.


Quick start

# Run the full benchmark for one domain (generates + fact-checks all AIs)
st-cross cross_stones/domains/healthcare_medical.json

# Score all 10 domains in the standard set
st-stones cross_stones/cross-stones-10.json

# Run any missing domains, then score
st-stones --run cross_stones/cross-stones-10.json

# After your first complete run — lock timing as the speed baseline (do once)
st-stones --set-baseline cross_stones/cross-stones-10.json

# Save this run as a named snapshot
st-stones --record-snapshot --snapshot-label "2026-Q1" cross_stones/cross-stones-10.json

# View score history over time
st-stones --history cross_stones/cross-stones-10.json

How it works

  1. Every AI generates a report for the same domain prompt — 5 AIs × 10 domains = 50 reports.
  2. Every AI fact-checks every report — 5×5 per domain = 250 fact-check operations across the full set.
  3. Scores are aggregated per AI across domains into the final cross_stone_score.

The cross-product design reveals things a single-evaluator test cannot:

Pattern What it reveals
High self-score, low peer scores AI is lenient with its own claims
Consistent row scores Stable fact-checking style regardless of author
Diagonal vs off-diagonal gap Degree of self-serving bias

The 10 standard domains

# Domain File
1 Software Development & Programming software_development.prompt
2 Customer Service & Support customer_service.prompt
3 Marketing & Content Creation marketing_content.prompt
4 Education & Learning education_learning.prompt
5 Data Analytics & Business Intelligence data_analytics.prompt
6 Healthcare & Medical Analysis healthcare_medical.prompt
7 Finance & Business Decision Making finance_business.prompt
8 Writing, Editing & Summarizing writing_editing.prompt
9 Research, Search & Q&A research_qa.prompt
10 Creative Media (Images, Video, Audio) creative_media.prompt

Each prompt asks the AI for exactly 10 specific, fact-checkable claims at calibrated difficulty — roughly half verifiable with basic research, half requiring primary sources. Prompts are intentionally neutral across provider strengths.


Scoring

Claim-level verdicts

Verdict Points
True +2
Partially True +1
Opinion 0 (excluded from average)
Partially False −1
False −2

Fact score

Each author's domain score is averaged across all N fact-checkers, reducing individual evaluator bias. Summed across all 10 domains:

Score
Maximum +200 (10 domains × 10 claims × +2)
Minimum −200

Composite Cross-Stone score

cross_stone_score = 0.7 × (fact_score / 200) + 0.3 × speed_ratio

Speed is baseline_seconds / actual_seconds — faster is higher. Once you set a baseline, scores above 1.0 mean the AI is meaningfully faster and more accurate than the baseline era.

Without a baseline, relative mode is used (fastest AI in the current run = 100%). Relative mode is not comparable across time — set the baseline after your first complete run.

Adjust weights with --no-speed (accuracy only) or --w1/--w2 for custom splits.


Reading the leaderboard

st-stones cross_stones/cross-stones-10.json

Key columns:

Column Meaning
Fact Score Raw sum across all domains (out of ±200)
Fact% Fact score as a percentage of 200
vs Baseline Speed ratio — 1.00× = baseline speed, 2.00× = twice as fast
Cross-Stone The final ranking — composite accuracy + speed

For a visual breakdown:

st-heatmap --display cross_stones/domains/healthcare_medical.json   # N×N score grid
st-stones --domain --ai-caption cross_stones/cross-stones-10.json   # per-domain breakdown

Historical tracking

Set the baseline once after your first complete run:

st-stones --set-baseline cross_stones/cross-stones-10.json

Then after each significant run (monthly, quarterly, annually):

st-stones --record-snapshot --snapshot-label "2026-Q1" cross_stones/cross-stones-10.json
st-stones --history cross_stones/cross-stones-10.json

The history tables show composite score, speed ratio, and accuracy per AI per snapshot — letting you see whether a provider has improved, regressed, or stayed flat over months and years.


Creating a custom domain

st-domain                          # interactive wizard
st-domain --name supply_chain      # pre-fill the slug

st-domain walks you through naming the domain, describing the topic, and smoke-testing it against one AI to confirm you get 10 fact-checkable claims back. Domains are saved to cross_stones/domains/ by default.

To add a custom domain to the standard benchmark set, register it in cross-stones-10.json or create a new named set config file.

Related: st-stones · st-cross · st-domain · st-heatmap · AI Providers

Clone this wiki locally