v7.7.5 — styxx gauntlet: the empirical floor as a public challenge
styxx 7.7.5 — styxx gauntlet: the empirical floor as a public challenge with deployable tooling
The seven-method empirical floor shipped in 7.7.3 is now a runnable public challenge. Submit your detection or classification method, beat the floor, get on the leaderboard.
Added
styxx gauntlet --method <module:attr>— public-challenge runner. Loads any user-supplied detection or classification method, runs it against the labeled benchmark, scores it against pre-registered bars.styxx.gauntletmodule — programmatic API for use in scripts and CI.LEADERBOARD.md— public leaderboard with the seven-method floor as Baseline-001. Full submission protocol, locked bars, sanity submissions (majority + zero detector), honest scope statement, citation block.- 20 new tests covering metric primitives, baseline failures, perfect-oracle upper bounds (validates the bars are beatable when given perfect signal), error handling, end-to-end CLI invocation. Full suite: 1078 passed, 8 skipped.
The frame
The seven-method floor we shipped IS the public bar. We assert we couldn't beat it with the seven methods we tested. The gauntlet invites the field to try. If anyone beats it, the synthesis gets revised; if nobody can, the floor compounds across submissions.
Two task modes
| task | method signature | bars (PASS = all) |
|---|---|---|
| classification | predict(question: str) -> {"class": ...} |
K1 folklore F1 ≥ 0.70, K2 accuracy ≥ 0.65, K3 cross-corpus F1 ≥ 0.60 |
| detection | detect(question: str, response: str) -> {"score": float} |
D1 misconception AUC ≥ 0.70, D2 folklore AUC ≥ 0.70 |
Install
```
pip install -U "styxx[mcp,nli]==7.7.5"
```
Submit
See LEADERBOARD.md for the submission protocol.
🤖 Generated with Claude Code