-
Notifications
You must be signed in to change notification settings - Fork 0
cross stones
Cross-Stones measures AI accuracy and speed across a fixed set of research domains. Run the same benchmark today, in six months, and in two years — the scores are directly comparable because the prompts, claim count, and scoring formula never change.
The name comes from curling: each AI gets one throw at the same target. The cross-product design means every AI also evaluates every other AI's throw.
# Run the full benchmark for one domain (generates + fact-checks all AIs)
st-cross cross_stones/domains/healthcare_medical.json
# Score all 10 domains in the standard set
st-stones cross_stones/cross-stones-10.json
# Run any missing domains, then score
st-stones --run cross_stones/cross-stones-10.json
# After your first complete run — lock timing as the speed baseline (do once)
st-stones --set-baseline cross_stones/cross-stones-10.json
# Save this run as a named snapshot
st-stones --record-snapshot --snapshot-label "2026-Q1" cross_stones/cross-stones-10.json
# View score history over time
st-stones --history cross_stones/cross-stones-10.json- Every AI generates a report for the same domain prompt — 5 AIs × 10 domains = 50 reports.
- Every AI fact-checks every report — 5×5 per domain = 250 fact-check operations across the full set.
-
Scores are aggregated per AI across domains into the final
cross_stone_score.
The cross-product design reveals things a single-evaluator test cannot:
| Pattern | What it reveals |
|---|---|
| High self-score, low peer scores | AI is lenient with its own claims |
| Consistent row scores | Stable fact-checking style regardless of author |
| Diagonal vs off-diagonal gap | Degree of self-serving bias |
| # | Domain | File |
|---|---|---|
| 1 | Software Development & Programming | software_development.prompt |
| 2 | Customer Service & Support | customer_service.prompt |
| 3 | Marketing & Content Creation | marketing_content.prompt |
| 4 | Education & Learning | education_learning.prompt |
| 5 | Data Analytics & Business Intelligence | data_analytics.prompt |
| 6 | Healthcare & Medical Analysis | healthcare_medical.prompt |
| 7 | Finance & Business Decision Making | finance_business.prompt |
| 8 | Writing, Editing & Summarizing | writing_editing.prompt |
| 9 | Research, Search & Q&A | research_qa.prompt |
| 10 | Creative Media (Images, Video, Audio) | creative_media.prompt |
Each prompt asks the AI for exactly 10 specific, fact-checkable claims at calibrated difficulty — roughly half verifiable with basic research, half requiring primary sources. Prompts are intentionally neutral across provider strengths.
| Verdict | Points |
|---|---|
| True | +2 |
| Partially True | +1 |
| Opinion | 0 (excluded from average) |
| Partially False | −1 |
| False | −2 |
Each author's domain score is averaged across all N fact-checkers, reducing individual evaluator bias. Summed across all 10 domains:
| Score | |
|---|---|
| Maximum | +200 (10 domains × 10 claims × +2) |
| Minimum | −200 |
cross_stone_score = 0.7 × (fact_score / 200) + 0.3 × speed_ratio
Speed is baseline_seconds / actual_seconds — faster is higher. Once you set a baseline, scores above 1.0 mean the AI is meaningfully faster and more accurate than the baseline era.
Without a baseline, relative mode is used (fastest AI in the current run = 100%). Relative mode is not comparable across time — set the baseline after your first complete run.
Adjust weights with --no-speed (accuracy only) or --w1/--w2 for custom splits.
st-stones cross_stones/cross-stones-10.json
Key columns:
| Column | Meaning |
|---|---|
Fact Score |
Raw sum across all domains (out of ±200) |
Fact% |
Fact score as a percentage of 200 |
vs Baseline |
Speed ratio — 1.00× = baseline speed, 2.00× = twice as fast |
| Cross-Stone | The final ranking — composite accuracy + speed |
For a visual breakdown:
st-heatmap --display cross_stones/domains/healthcare_medical.json # N×N score grid
st-stones --domain --ai-caption cross_stones/cross-stones-10.json # per-domain breakdownSet the baseline once after your first complete run:
st-stones --set-baseline cross_stones/cross-stones-10.jsonThen after each significant run (monthly, quarterly, annually):
st-stones --record-snapshot --snapshot-label "2026-Q1" cross_stones/cross-stones-10.json
st-stones --history cross_stones/cross-stones-10.jsonThe history tables show composite score, speed ratio, and accuracy per AI per snapshot — letting you see whether a provider has improved, regressed, or stayed flat over months and years.
st-domain # interactive wizard
st-domain --name supply_chain # pre-fill the slugst-domain walks you through naming the domain, describing the topic, and smoke-testing it against one AI to confirm you get 10 fact-checkable claims back. Domains are saved to cross_stones/domains/ by default.
To add a custom domain to the standard benchmark set, register it in cross-stones-10.json or create a new named set config file.
Related: st-stones · st-cross · st-domain · st-heatmap · AI Providers