# st-stones — Run and score the Cross-Stones benchmark

Scores AI providers on the Cross-Stones benchmark: a fixed set of domain prompts, each requiring exactly 10 fact-checkable claims. Produces a composite accuracy + speed leaderboard.

**Run after:** `st-cross`

**Related:** [st-domain](st-domain)  [st-speed](st-speed)  [Cross-Stones](cross-stones)

---

## Examples

```bash
st-stones                                              # auto-detect ~/cross-stones/ or ./cross_stones/
st-stones cross_st/cross_stones/cross-stones-10.json   # named benchmark set (locked params)
st-stones cross_st/cross_stones/                       # score all domains in a directory
st-stones --run cross_st/cross_stones/cross-stones-10.json    # run missing domains then score
st-stones --no-confirmation --run cross_stones/cross-stones-10.json  # scripted / CI use
st-stones --no-speed cross_stones/cross-stones-10.json # accuracy-only scoring
st-stones --domain cross_stones/cross-stones-10.json   # per-domain score breakdown
st-stones --init                                       # seed ~/cross-stones/ with bundled prompts
st-stones --init --dir my_domains/                     # seed to a custom directory
st-stones --set-baseline cross_stones/cross-stones-10.json     # lock current timing as baseline
st-stones --record-snapshot cross_stones/cross-stones-10.json  # save today's scores as snapshot
st-stones --history cross_stones/cross-stones-10.json          # show year-over-year history
st-stones --ai-caption cross_stones/cross-stones-10.json       # AI caption of leaderboard
```

## Options

| Option | Description |
|--------|-------------|
| `path` | Directory of benchmark `.json`/`.prompt` pairs, explicit `.json` paths, or a named benchmark set config. Omit to auto-detect `~/cross-stones/` or `./cross_stones/`. |
| `--init` | Seed the benchmark domain directory with bundled `.prompt` files, then exit |
| `--dir PATH` | Destination for `--init` (default: `~/cross-stones/`) |
| `--run` | Run `st-cross` for any domain that lacks fact-check data |
| `--confirmation` | Prompt before making API calls with `--run` (default) |
| `--no-confirmation` | Skip the `--run` confirmation prompt (useful for scripting) |
| `--w1 W1` | Accuracy weight 0–1 (default: 0.7) |
| `--w2 W2` | Speed weight 0–1 (default: 0.3) |
| `--no-speed` | Ignore timing data; pure accuracy scoring (sets w1=1.0, w2=0.0) |
| `--n-claims N` | Expected scorable claims per domain (default: from benchmark set config, or 10) |
| `--domain` | Show per-domain fact-score breakdown table |
| `--cache` | Enable API cache for `--run` (default: enabled) |
| `--no-cache` | Disable API cache for `--run` |
| `--timeout TIMEOUT` | Per-job timeout in seconds for `--run` (default: 300 = 5 min) |
| `-v`, `--verbose` | Verbose output |
| `-q`, `--quiet` | Minimal output |

### Historical tracking

| Option | Description |
|--------|-------------|
| `--history` | Display full snapshot history (composite score, speed ratio, accuracy over time) |
| `--set-baseline` | Lock the current run's timing as the speed baseline in the benchmark set config |
| `--record-snapshot` | Append current leaderboard scores as a named snapshot |
| `--snapshot-label LABEL` | Human-readable label for `--record-snapshot` (default: today's ISO date) |

### AI content generation

| Option | Description |
|--------|-------------|
| `--ai-title` | Generate a ≤10-word leaderboard title → stdout |
| `--ai-short` | Generate a 40–80-word leaderboard summary → stdout |
| `--ai-caption` | Generate a 100–160-word two-paragraph caption → stdout |
| `--ai-summary` | Generate a 120–200-word summary → stdout |
| `--ai-story` | Generate an 800–1200-word narrative → stdout |
| `--agent NAME` | AI provider for content generation (default: `xai`) |

---

## For developers

Score formula: `w1 × (fact_score / max_fact_score) + w2 × (speed_score / max_speed_score)` with defaults `w1=0.7`, `w2=0.3`. The locked benchmark set is `cross_stones/cross-stones-10.json`. Pass `--no-speed` for accuracy-only scoring. `--set-baseline` must be run once after the first complete benchmark; subsequent faster runs will have `speed_ratio > 1.0` and `cross_stone_score` may exceed 1.0, indicating genuine improvement.