fix(evals): unstale 3 release-gate eval hard-fails behind legit refactors (soc-2gd6 #eval-hard-fails)#402
Merged
Merged
Conversation
…tors (soc-2gd6) The v2.42.0 release gate was red on 3 score-0/near-0 evals. All three are eval-staleness behind legitimate recent changes — verified, NOT gaming or security weakening (operator chose "update eval to match source of truth"): | Eval | Was | Cause | Fix | |---|---|---|---| | hook-manifest-command-counts | 0 | session-pr-counter.sh (PR #362) is the legit 37th hook script; eval hardcoded 43/36 | bump expected counts 43→44, 36→37 | | push-worktree landing-plane | 0.14 | #387 tiered-AGENTS split moved the "Landing the Plane" section to AGENTS-WORKFLOW.md (and dropped 2 lines) | redirect eval target AGENTS.md→AGENTS-WORKFLOW.md + restore the 2 dropped policy lines | | security-toolchain ci-soft-gate-policy | 0 | the gate is intentionally HARD (no continue-on-error); the job already runs security-gate.sh --mode quick + uploads artifacts | drop the stale continue-on-error requirement from the eval (security stays HARD) | Security note: the security-toolchain-gate stays a HARD blocking gate. The only eval bit removed was the stale "soft gate" assertion; the actual scan (security-gate.sh --mode quick) + artifact upload + summary-blocking are unchanged. How tested: - hook-manifest jq check → hook-manifest-counts-ok - security smoke ci-policy → security-toolchain-ci-policy-ok - all 7 landing-plane strings present in AGENTS-WORKFLOW.md - shellcheck clean on the edited smoke Sibling pattern: same "update eval to match legitimately-changed source of truth" move as the cli-command-surface canary bumps in #396/#397. Fitness: release-gate eval hard-fails 3 → 0. (5 minor evals 0.71-0.99 + the vil lane remain — separate remediation, NOT in this PR.) Closes-scenario: soc-2gd6#eval-hard-fails Bounded-context: BC4-Validation Evidence: evals/agentops-core/fixtures/security-toolchain-governance-smoke.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The v2.42.0 release gate (
scripts/ci-local-release.sh) was red on 8 evals. The 3 score-0/near-0 hard fails are all eval-staleness behind legitimate recent refactors — verified, not gaming or security weakening. Operator decision: update eval to match source of truth (executable > contract).hook-manifest-command-countssession-pr-counter.sh(PR #362) is the legit 37th hook script; eval hardcoded 43/36push-worktree landing-planeAGENTS-WORKFLOW.md(+ dropped 2 lines)AGENTS.md→AGENTS-WORKFLOW.md+ restore the 2 dropped policy linessecurity-toolchain ci-soft-gate-policycontinue-on-error); job already runssecurity-gate.sh --mode quick+ uploads artifactscontinue-on-errorrequirement (security stays HARD)Security note:
security-toolchain-gatestays a HARD blocking gate. Only the stale "soft gate" assertion was removed from the eval; the actual scan + artifact upload + summary-blocking are unchanged.How tested
hook-manifest-counts-okci-policy→security-toolchain-ci-policy-okAGENTS-WORKFLOW.mdScope honesty
This fixes the 3 hard fails only. The release gate still has 5 minor evals (0.71–0.99) + the vil/release-smoke lane — a separate remediation, deliberately NOT in this PR (no green-washing).
Sibling pattern: same "update eval to match legitimately-changed source of truth" move as the cli-command-surface canary bumps in #396/#397.
Fitness: release-gate eval hard-fails 3 → 0.
Closes-scenario: soc-2gd6#eval-hard-fails
Bounded-context: BC4-Validation
Evidence: evals/agentops-core/fixtures/security-toolchain-governance-smoke.sh