v7.7.8 — gauntlet detection-bar v2: D3 length-control bar fixes the artifact Baseline-007 exposed
styxx 7.7.8 — gauntlet D3 length-control bar
The fix
The gauntlet's detection bars (D1, D2) were length-gameable. Baseline-007, a 30-line token-overlap heuristic submitted as a sanity check, accidentally hit PASS=true under v1 bars (D1=0.864, D2=0.922). Investigation: the benchmark's expected_consensus field is length-confounded by class — truth responses average 3.9 words ("Paris", "206"), folklore responses average 7.5 words (full council restatements). A detector measuring response length alone scores AUC=0.79–0.80 across the bars. Any submission could game the bars by exploiting this artifact.
What's added
D3_length_control_delta ≥ 0.10 — the new third detection-task bar. A real detector must beat the length-only oracle's AUC by at least 0.10 on both the misconception-vs-truth (D1 partition) AND folklore-vs-truth (D2 partition) splits.
Plus:
styxx.gauntlet._length_oracle_detect— the length-only oracle that defines the D3 floor.- 4 new inline metrics:
length_oracle_*_AUCandD1/D2_minus_length_AUCdeltas. - Regression test
test_length_oracle_passes_D1_D2_but_fails_D3— catches any future attempt to weaken or remove D3. - Tests updated for 3-bar detection task; full suite 1084 passed.
Baseline-007 re-scored under v2 bars
| metric | value | bar |
|---|---|---|
| D1 misconception AUC | 0.864 | ≥0.70 ✓ |
| D2 folklore AUC | 0.922 | ≥0.70 ✓ |
| D1 − length AUC | 0.074 | ≥0.10 ✗ |
| D2 − length AUC | 0.117 | ≥0.10 ✓ |
| D3 length-control | — | FAIL |
| Overall | — | 2 / 3 (NOT a PASS) |
The first "PASS" on detection gets correctly downgraded to 2/3.
Why this matters
This is the sixth in-session falsification today: a real submission (run through the new gauntlet infrastructure) exposed a real bar / benchmark validity weakness; the discipline pattern caught it; the fix shipped in the next patch. The system caught its own flaw within the same session.
The discovery chain — Baseline-007 unexpected PASS → diagnosed length artifact → D3 bar definition → tests + regression guard → patch release — is exactly the property the gauntlet infrastructure was built to enable. Every real submission either confirms the floor (compounds the credibility) or surfaces a fix (improves the infrastructure). Both outcomes benefit the project.
Why patch (not minor)
Additive (new D3 bar; D1 and D2 unchanged). External submissions that PASSed D1+D2 alone are re-scored as 2/3 under v2, not retroactively invalidated. Classification bars (K1/K2/K3) are unaffected. No public API breakage.
Install
pip install -U styxx==7.7.8
Verify
pip install styxx==7.7.8
styxx gauntlet --method styxx.gauntlet:_length_oracle_detect --task detection
# expect: 2 / 3 (D1 + D2 pass on length-confound; D3 correctly fails with delta=0)
🤖 Generated with Claude Code