Skip to content

v7.7.8 — gauntlet detection-bar v2: D3 length-control bar fixes the artifact Baseline-007 exposed

Choose a tag to compare

@github-actions github-actions released this 27 May 21:21
· 474 commits to main since this release

styxx 7.7.8 — gauntlet D3 length-control bar

The fix

The gauntlet's detection bars (D1, D2) were length-gameable. Baseline-007, a 30-line token-overlap heuristic submitted as a sanity check, accidentally hit PASS=true under v1 bars (D1=0.864, D2=0.922). Investigation: the benchmark's expected_consensus field is length-confounded by class — truth responses average 3.9 words ("Paris", "206"), folklore responses average 7.5 words (full council restatements). A detector measuring response length alone scores AUC=0.79–0.80 across the bars. Any submission could game the bars by exploiting this artifact.

What's added

D3_length_control_delta ≥ 0.10 — the new third detection-task bar. A real detector must beat the length-only oracle's AUC by at least 0.10 on both the misconception-vs-truth (D1 partition) AND folklore-vs-truth (D2 partition) splits.

Plus:

  • styxx.gauntlet._length_oracle_detect — the length-only oracle that defines the D3 floor.
  • 4 new inline metrics: length_oracle_*_AUC and D1/D2_minus_length_AUC deltas.
  • Regression test test_length_oracle_passes_D1_D2_but_fails_D3 — catches any future attempt to weaken or remove D3.
  • Tests updated for 3-bar detection task; full suite 1084 passed.

Baseline-007 re-scored under v2 bars

metric value bar
D1 misconception AUC 0.864 ≥0.70 ✓
D2 folklore AUC 0.922 ≥0.70 ✓
D1 − length AUC 0.074 ≥0.10 ✗
D2 − length AUC 0.117 ≥0.10 ✓
D3 length-control FAIL
Overall 2 / 3 (NOT a PASS)

The first "PASS" on detection gets correctly downgraded to 2/3.

Why this matters

This is the sixth in-session falsification today: a real submission (run through the new gauntlet infrastructure) exposed a real bar / benchmark validity weakness; the discipline pattern caught it; the fix shipped in the next patch. The system caught its own flaw within the same session.

The discovery chain — Baseline-007 unexpected PASS → diagnosed length artifact → D3 bar definition → tests + regression guard → patch release — is exactly the property the gauntlet infrastructure was built to enable. Every real submission either confirms the floor (compounds the credibility) or surfaces a fix (improves the infrastructure). Both outcomes benefit the project.

Why patch (not minor)

Additive (new D3 bar; D1 and D2 unchanged). External submissions that PASSed D1+D2 alone are re-scored as 2/3 under v2, not retroactively invalidated. Classification bars (K1/K2/K3) are unaffected. No public API breakage.

Install

pip install -U styxx==7.7.8

Verify

pip install styxx==7.7.8
styxx gauntlet --method styxx.gauntlet:_length_oracle_detect --task detection
# expect: 2 / 3 (D1 + D2 pass on length-confound; D3 correctly fails with delta=0)

🤖 Generated with Claude Code