cross stones

Cross-Stones Benchmark

Cross-Stones measures AI accuracy and speed across a fixed set of research domains. Run the same benchmark today, in six months, and in two years — the scores are directly comparable because the prompts, claim count, and scoring formula never change.

The name comes from curling: each AI gets one throw at the same target. The cross-product design means every AI also evaluates every other AI's throw.

Quick start

# Run the full benchmark for one domain (generates + fact-checks all AIs)
st-cross cross_stones/domains/healthcare_medical.json

# Score all 10 domains in the standard set
st-stones cross_stones/cross-stones-10.json

# Run any missing domains, then score
st-stones --run cross_stones/cross-stones-10.json

# After your first complete run — lock timing as the speed baseline (do once)
st-stones --set-baseline cross_stones/cross-stones-10.json

# Save this run as a named snapshot
st-stones --record-snapshot --snapshot-label "2026-Q1" cross_stones/cross-stones-10.json

# View score history over time
st-stones --history cross_stones/cross-stones-10.json

How it works

Every AI generates a report for the same domain prompt — 5 AIs × 10 domains = 50 reports.
Every AI fact-checks every report — 5×5 per domain = 250 fact-check operations across the full set.
Scores are aggregated per AI across domains into the final cross_stone_score.

The cross-product design reveals things a single-evaluator test cannot:

Pattern	What it reveals
High self-score, low peer scores	AI is lenient with its own claims
Consistent row scores	Stable fact-checking style regardless of author
Diagonal vs off-diagonal gap	Degree of self-serving bias

The 10 standard domains

#	Domain	File
1	Software Development & Programming	`software_development.prompt`
2	Customer Service & Support	`customer_service.prompt`
3	Marketing & Content Creation	`marketing_content.prompt`
4	Education & Learning	`education_learning.prompt`
5	Data Analytics & Business Intelligence	`data_analytics.prompt`
6	Healthcare & Medical Analysis	`healthcare_medical.prompt`
7	Finance & Business Decision Making	`finance_business.prompt`
8	Writing, Editing & Summarizing	`writing_editing.prompt`
9	Research, Search & Q&A	`research_qa.prompt`
10	Creative Media (Images, Video, Audio)	`creative_media.prompt`

Each prompt asks the AI for exactly 10 specific, fact-checkable claims at calibrated difficulty — roughly half verifiable with basic research, half requiring primary sources. Prompts are intentionally neutral across provider strengths.

Scoring

Claim-level verdicts

Verdict	Points
True	+2
Partially True	+1
Opinion	0 (excluded from average)
Partially False	−1
False	−2

Fact score

Each author's domain score is averaged across all N fact-checkers, reducing individual evaluator bias. Summed across all 10 domains:

	Score
Maximum	+200 (10 domains × 10 claims × +2)
Minimum	−200

Composite Cross-Stone score

cross_stone_score = 0.7 × (fact_score / 200) + 0.3 × speed_ratio

Speed is baseline_seconds / actual_seconds — faster is higher. Once you set a baseline, scores above 1.0 mean the AI is meaningfully faster and more accurate than the baseline era.

Without a baseline, relative mode is used (fastest AI in the current run = 100%). Relative mode is not comparable across time — set the baseline after your first complete run.

Adjust weights with --no-speed (accuracy only) or --w1/--w2 for custom splits.

Reading the leaderboard

st-stones cross_stones/cross-stones-10.json

Key columns:

Column	Meaning
`Fact Score`	Raw sum across all domains (out of ±200)
`Fact%`	Fact score as a percentage of 200
`vs Baseline`	Speed ratio — 1.00× = baseline speed, 2.00× = twice as fast
Cross-Stone	The final ranking — composite accuracy + speed

For a visual breakdown:

st-heatmap --display cross_stones/domains/healthcare_medical.json   # N×N score grid
st-stones --domain --ai-caption cross_stones/cross-stones-10.json   # per-domain breakdown

Historical tracking

Set the baseline once after your first complete run:

st-stones --set-baseline cross_stones/cross-stones-10.json

Then after each significant run (monthly, quarterly, annually):

st-stones --record-snapshot --snapshot-label "2026-Q1" cross_stones/cross-stones-10.json
st-stones --history cross_stones/cross-stones-10.json

The history tables show composite score, speed ratio, and accuracy per AI per snapshot — letting you see whether a provider has improved, regressed, or stayed flat over months and years.

Creating a custom domain

st-domain                          # interactive wizard
st-domain --name supply_chain      # pre-fill the slug

st-domain walks you through naming the domain, describing the topic, and smoke-testing it against one AI to confirm you get 10 fact-checkable claims back. Domains are saved to cross_stones/domains/ by default.

To add a custom domain to the standard benchmark set, register it in cross-stones-10.json or create a new named set config file.

Related: st-stones · st-cross · st-domain · st-heatmap · AI Providers

cross stones

Cross-Stones Benchmark

Quick start

How it works

The 10 standard domains

Scoring

Claim-level verdicts

Fact score

Composite Cross-Stone score

Reading the leaderboard

Historical tracking

Creating a custom domain

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally