Skip to content

v7.7.5 — styxx gauntlet: the empirical floor as a public challenge

Choose a tag to compare

@github-actions github-actions released this 27 May 19:57
· 487 commits to main since this release

styxx 7.7.5 — styxx gauntlet: the empirical floor as a public challenge with deployable tooling

The seven-method empirical floor shipped in 7.7.3 is now a runnable public challenge. Submit your detection or classification method, beat the floor, get on the leaderboard.

Added

  • styxx gauntlet --method <module:attr> — public-challenge runner. Loads any user-supplied detection or classification method, runs it against the labeled benchmark, scores it against pre-registered bars.
  • styxx.gauntlet module — programmatic API for use in scripts and CI.
  • LEADERBOARD.md — public leaderboard with the seven-method floor as Baseline-001. Full submission protocol, locked bars, sanity submissions (majority + zero detector), honest scope statement, citation block.
  • 20 new tests covering metric primitives, baseline failures, perfect-oracle upper bounds (validates the bars are beatable when given perfect signal), error handling, end-to-end CLI invocation. Full suite: 1078 passed, 8 skipped.

The frame

The seven-method floor we shipped IS the public bar. We assert we couldn't beat it with the seven methods we tested. The gauntlet invites the field to try. If anyone beats it, the synthesis gets revised; if nobody can, the floor compounds across submissions.

Two task modes

task method signature bars (PASS = all)
classification predict(question: str) -> {"class": ...} K1 folklore F1 ≥ 0.70, K2 accuracy ≥ 0.65, K3 cross-corpus F1 ≥ 0.60
detection detect(question: str, response: str) -> {"score": float} D1 misconception AUC ≥ 0.70, D2 folklore AUC ≥ 0.70

Install

```
pip install -U "styxx[mcp,nli]==7.7.5"
```

Submit

See LEADERBOARD.md for the submission protocol.

🤖 Generated with Claude Code