st stones

st-stones — Run and score the Cross-Stones benchmark

Scores AI providers on the Cross-Stones benchmark: a fixed set of domain prompts, each requiring exactly 10 fact-checkable claims. Produces a composite accuracy + speed leaderboard.

Run after: st-cross

Related: st-domain st-speed Cross-Stones

Examples

st-stones                                              # auto-detect ~/cross-stones/ or ./cross_stones/
st-stones cross_st/cross_stones/cross-stones-10.json   # named benchmark set (locked params)
st-stones cross_st/cross_stones/                       # score all domains in a directory
st-stones --run cross_st/cross_stones/cross-stones-10.json    # run missing domains then score
st-stones --no-confirmation --run cross_stones/cross-stones-10.json  # scripted / CI use
st-stones --no-speed cross_stones/cross-stones-10.json # accuracy-only scoring
st-stones --domain cross_stones/cross-stones-10.json   # per-domain score breakdown
st-stones --init                                       # seed ~/cross-stones/ with bundled prompts
st-stones --init --dir my_domains/                     # seed to a custom directory
st-stones --set-baseline cross_stones/cross-stones-10.json     # lock current timing as baseline
st-stones --record-snapshot cross_stones/cross-stones-10.json  # save today's scores as snapshot
st-stones --history cross_stones/cross-stones-10.json          # show year-over-year history
st-stones --ai-caption cross_stones/cross-stones-10.json       # AI caption of leaderboard

Options

Option	Description
`path`	Directory of benchmark `.json`/`.prompt` pairs, explicit `.json` paths, or a named benchmark set config. Omit to auto-detect `~/cross-stones/` or `./cross_stones/`.
`--init`	Seed the benchmark domain directory with bundled `.prompt` files, then exit
`--dir PATH`	Destination for `--init` (default: `~/cross-stones/`)
`--run`	Run `st-cross` for any domain that lacks fact-check data
`--confirmation`	Prompt before making API calls with `--run` (default)
`--no-confirmation`	Skip the `--run` confirmation prompt (useful for scripting)
`--w1 W1`	Accuracy weight 0–1 (default: 0.7)
`--w2 W2`	Speed weight 0–1 (default: 0.3)
`--no-speed`	Ignore timing data; pure accuracy scoring (sets w1=1.0, w2=0.0)
`--n-claims N`	Expected scorable claims per domain (default: from benchmark set config, or 10)
`--domain`	Show per-domain fact-score breakdown table
`--cache`	Enable API cache for `--run` (default: enabled)
`--no-cache`	Disable API cache for `--run`
`--timeout TIMEOUT`	Per-job timeout in seconds for `--run` (default: 300 = 5 min)
`-v`, `--verbose`	Verbose output
`-q`, `--quiet`	Minimal output

Historical tracking

Option	Description
`--history`	Display full snapshot history (composite score, speed ratio, accuracy over time)
`--set-baseline`	Lock the current run's timing as the speed baseline in the benchmark set config
`--record-snapshot`	Append current leaderboard scores as a named snapshot
`--snapshot-label LABEL`	Human-readable label for `--record-snapshot` (default: today's ISO date)

AI content generation

Option	Description
`--ai-title`	Generate a ≤10-word leaderboard title → stdout
`--ai-short`	Generate a 40–80-word leaderboard summary → stdout
`--ai-caption`	Generate a 100–160-word two-paragraph caption → stdout
`--ai-summary`	Generate a 120–200-word summary → stdout
`--ai-story`	Generate an 800–1200-word narrative → stdout
`--agent NAME`	AI provider for content generation (default: `xai`)

For developers

Score formula: w1 × (fact_score / max_fact_score) + w2 × (speed_score / max_speed_score) with defaults w1=0.7, w2=0.3. The locked benchmark set is cross_stones/cross-stones-10.json. Pass --no-speed for accuracy-only scoring. --set-baseline must be run once after the first complete benchmark; subsequent faster runs will have speed_ratio > 1.0 and cross_stone_score may exceed 1.0, indicating genuine improvement.

st stones

st-stones — Run and score the Cross-Stones benchmark

Examples

Options

Historical tracking

AI content generation

For developers

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally