# Cross-Stones Benchmark Cross-Stones measures AI accuracy and speed across a fixed set of research domains. Run the same benchmark today, in six months, and in two years — the scores are directly comparable because the prompts, claim count, and scoring formula never change. The name comes from curling: each AI gets one throw at the same target. The cross-product design means every AI also evaluates every other AI's throw. --- ## Quick start ```bash # Run the full benchmark for one domain (generates + fact-checks all AIs) st-cross cross_stones/domains/healthcare_medical.json # Score all 10 domains in the standard set st-stones cross_stones/cross-stones-10.json # Run any missing domains, then score st-stones --run cross_stones/cross-stones-10.json # After your first complete run — lock timing as the speed baseline (do once) st-stones --set-baseline cross_stones/cross-stones-10.json # Save this run as a named snapshot st-stones --record-snapshot --snapshot-label "2026-Q1" cross_stones/cross-stones-10.json # View score history over time st-stones --history cross_stones/cross-stones-10.json ``` --- ## How it works 1. **Every AI generates a report** for the same domain prompt — 5 AIs × 10 domains = 50 reports. 2. **Every AI fact-checks every report** — 5×5 per domain = 250 fact-check operations across the full set. 3. **Scores are aggregated** per AI across domains into the final `cross_stone_score`. The cross-product design reveals things a single-evaluator test cannot: | Pattern | What it reveals | |---------|----------------| | High self-score, low peer scores | AI is lenient with its own claims | | Consistent row scores | Stable fact-checking style regardless of author | | Diagonal vs off-diagonal gap | Degree of self-serving bias | --- ## The 10 standard domains | # | Domain | File | |---|--------|------| | 1 | Software Development & Programming | `software_development.prompt` | | 2 | Customer Service & Support | `customer_service.prompt` | | 3 | Marketing & Content Creation | `marketing_content.prompt` | | 4 | Education & Learning | `education_learning.prompt` | | 5 | Data Analytics & Business Intelligence | `data_analytics.prompt` | | 6 | Healthcare & Medical Analysis | `healthcare_medical.prompt` | | 7 | Finance & Business Decision Making | `finance_business.prompt` | | 8 | Writing, Editing & Summarizing | `writing_editing.prompt` | | 9 | Research, Search & Q&A | `research_qa.prompt` | | 10 | Creative Media (Images, Video, Audio) | `creative_media.prompt` | Each prompt asks the AI for **exactly 10 specific, fact-checkable claims** at calibrated difficulty — roughly half verifiable with basic research, half requiring primary sources. Prompts are intentionally neutral across provider strengths. --- ## Scoring ### Claim-level verdicts | Verdict | Points | |---------|--------| | True | **+2** | | Partially True | **+1** | | Opinion | **0** *(excluded from average)* | | Partially False | **−1** | | False | **−2** | ### Fact score Each author's domain score is averaged across all N fact-checkers, reducing individual evaluator bias. Summed across all 10 domains: | | Score | |---|---| | Maximum | **+200** (10 domains × 10 claims × +2) | | Minimum | **−200** | ### Composite Cross-Stone score ``` cross_stone_score = 0.7 × (fact_score / 200) + 0.3 × speed_ratio ``` Speed is `baseline_seconds / actual_seconds` — faster is higher. Once you set a baseline, scores above 1.0 mean the AI is meaningfully faster and more accurate than the baseline era. Without a baseline, relative mode is used (fastest AI in the current run = 100%). Relative mode is not comparable across time — set the baseline after your first complete run. **Adjust weights** with `--no-speed` (accuracy only) or `--w1`/`--w2` for custom splits. --- ## Reading the leaderboard ``` st-stones cross_stones/cross-stones-10.json ``` Key columns: | Column | Meaning | |--------|---------| | `Fact Score` | Raw sum across all domains (out of ±200) | | `Fact%` | Fact score as a percentage of 200 | | `vs Baseline` | Speed ratio — 1.00× = baseline speed, 2.00× = twice as fast | | **Cross-Stone** | **The final ranking — composite accuracy + speed** | For a visual breakdown: ```bash st-heatmap --display cross_stones/domains/healthcare_medical.json # N×N score grid st-stones --domain --ai-caption cross_stones/cross-stones-10.json # per-domain breakdown ``` --- ## Historical tracking Set the baseline once after your first complete run: ```bash st-stones --set-baseline cross_stones/cross-stones-10.json ``` Then after each significant run (monthly, quarterly, annually): ```bash st-stones --record-snapshot --snapshot-label "2026-Q1" cross_stones/cross-stones-10.json st-stones --history cross_stones/cross-stones-10.json ``` The history tables show composite score, speed ratio, and accuracy per AI per snapshot — letting you see whether a provider has improved, regressed, or stayed flat over months and years. --- ## Creating a custom domain ```bash st-domain # interactive wizard st-domain --name supply_chain # pre-fill the slug ``` `st-domain` walks you through naming the domain, describing the topic, and smoke-testing it against one AI to confirm you get 10 fact-checkable claims back. Domains are saved to `cross_stones/domains/` by default. To add a custom domain to the standard benchmark set, register it in `cross-stones-10.json` or create a new named set config file. **Related:** [st-stones](st-stones.md) · [st-cross](st-cross.md) · [st-domain](st-domain.md) · [st-heatmap](st-heatmap.md) · [AI Providers](ai-providers.md)