-
Notifications
You must be signed in to change notification settings - Fork 0
st stones
b2o2i edited this page May 10, 2026
·
4 revisions
Scores AI providers on the Cross-Stones benchmark: a fixed set of domain prompts, each requiring exactly 10 fact-checkable claims. Produces a composite accuracy + speed leaderboard.
Run after: st-cross
Related: st-domain st-speed Cross-Stones
st-stones # auto-detect ~/cross-stones/ or ./cross_stones/
st-stones cross_st/cross_stones/cross-stones-10.json # named benchmark set (locked params)
st-stones cross_st/cross_stones/ # score all domains in a directory
st-stones --run cross_st/cross_stones/cross-stones-10.json # run missing domains then score
st-stones --no-confirmation --run cross_stones/cross-stones-10.json # scripted / CI use
st-stones --no-speed cross_stones/cross-stones-10.json # accuracy-only scoring
st-stones --domain cross_stones/cross-stones-10.json # per-domain score breakdown
st-stones --init # seed ~/cross-stones/ with bundled prompts
st-stones --init --dir my_domains/ # seed to a custom directory
st-stones --set-baseline cross_stones/cross-stones-10.json # lock current timing as baseline
st-stones --record-snapshot cross_stones/cross-stones-10.json # save today's scores as snapshot
st-stones --history cross_stones/cross-stones-10.json # show year-over-year history
st-stones --ai-caption cross_stones/cross-stones-10.json # AI caption of leaderboard| Option | Description |
|---|---|
path |
Directory of benchmark .json/.prompt pairs, explicit .json paths, or a named benchmark set config. Omit to auto-detect ~/cross-stones/ or ./cross_stones/. |
--init |
Seed the benchmark domain directory with bundled .prompt files, then exit |
--dir PATH |
Destination for --init (default: ~/cross-stones/) |
--run |
Run st-cross for any domain that lacks fact-check data |
--confirmation |
Prompt before making API calls with --run (default) |
--no-confirmation |
Skip the --run confirmation prompt (useful for scripting) |
--w1 W1 |
Accuracy weight 0–1 (default: 0.7) |
--w2 W2 |
Speed weight 0–1 (default: 0.3) |
--no-speed |
Ignore timing data; pure accuracy scoring (sets w1=1.0, w2=0.0) |
--n-claims N |
Expected scorable claims per domain (default: from benchmark set config, or 10) |
--domain |
Show per-domain fact-score breakdown table |
--cache |
Enable API cache for --run (default: enabled) |
--no-cache |
Disable API cache for --run
|
--timeout TIMEOUT |
Per-job timeout in seconds for --run (default: 300 = 5 min) |
-v, --verbose
|
Verbose output |
-q, --quiet
|
Minimal output |
| Option | Description |
|---|---|
--history |
Display full snapshot history (composite score, speed ratio, accuracy over time) |
--set-baseline |
Lock the current run's timing as the speed baseline in the benchmark set config |
--record-snapshot |
Append current leaderboard scores as a named snapshot |
--snapshot-label LABEL |
Human-readable label for --record-snapshot (default: today's ISO date) |
| Option | Description |
|---|---|
--ai-title |
Generate a ≤10-word leaderboard title → stdout |
--ai-short |
Generate a 40–80-word leaderboard summary → stdout |
--ai-caption |
Generate a 100–160-word two-paragraph caption → stdout |
--ai-summary |
Generate a 120–200-word summary → stdout |
--ai-story |
Generate an 800–1200-word narrative → stdout |
--agent NAME |
AI provider for content generation (default: xai) |
Score formula: w1 × (fact_score / max_fact_score) + w2 × (speed_score / max_speed_score) with defaults w1=0.7, w2=0.3. The locked benchmark set is cross_stones/cross-stones-10.json. Pass --no-speed for accuracy-only scoring. --set-baseline must be run once after the first complete benchmark; subsequent faster runs will have speed_ratio > 1.0 and cross_stone_score may exceed 1.0, indicating genuine improvement.