fix(evals): unstale 3 release-gate eval hard-fails behind legit refactors (soc-2gd6 #eval-hard-fails) by boshu2 · Pull Request #402 · boshu2/agentops

boshu2 · 2026-05-22T16:05:45Z

Why

The v2.42.0 release gate (scripts/ci-local-release.sh) was red on 8 evals. The 3 score-0/near-0 hard fails are all eval-staleness behind legitimate recent refactors — verified, not gaming or security weakening. Operator decision: update eval to match source of truth (executable > contract).

Eval	Was	Cause	Fix
`hook-manifest-command-counts`	0	`session-pr-counter.sh` (PR #362) is the legit 37th hook script; eval hardcoded 43/36	bump expected counts 43→44, 36→37
`push-worktree landing-plane`	0.14	#387 tiered-AGENTS split moved "Landing the Plane" to `AGENTS-WORKFLOW.md` (+ dropped 2 lines)	redirect eval target `AGENTS.md`→`AGENTS-WORKFLOW.md` + restore the 2 dropped policy lines
`security-toolchain ci-soft-gate-policy`	0	gate is intentionally HARD (no `continue-on-error`); job already runs `security-gate.sh --mode quick` + uploads artifacts	drop the stale `continue-on-error` requirement (security stays HARD)

Security note: security-toolchain-gate stays a HARD blocking gate. Only the stale "soft gate" assertion was removed from the eval; the actual scan + artifact upload + summary-blocking are unchanged.

How tested

hook-manifest jq → hook-manifest-counts-ok
security smoke ci-policy → security-toolchain-ci-policy-ok
all 7 landing-plane strings present in AGENTS-WORKFLOW.md
shellcheck clean on edited smoke

Scope honesty

This fixes the 3 hard fails only. The release gate still has 5 minor evals (0.71–0.99) + the vil/release-smoke lane — a separate remediation, deliberately NOT in this PR (no green-washing).

Sibling pattern: same "update eval to match legitimately-changed source of truth" move as the cli-command-surface canary bumps in #396/#397.

Fitness: release-gate eval hard-fails 3 → 0.

Closes-scenario: soc-2gd6#eval-hard-fails
Bounded-context: BC4-Validation
Evidence: evals/agentops-core/fixtures/security-toolchain-governance-smoke.sh

…tors (soc-2gd6) The v2.42.0 release gate was red on 3 score-0/near-0 evals. All three are eval-staleness behind legitimate recent changes — verified, NOT gaming or security weakening (operator chose "update eval to match source of truth"): | Eval | Was | Cause | Fix | |---|---|---|---| | hook-manifest-command-counts | 0 | session-pr-counter.sh (PR #362) is the legit 37th hook script; eval hardcoded 43/36 | bump expected counts 43→44, 36→37 | | push-worktree landing-plane | 0.14 | #387 tiered-AGENTS split moved the "Landing the Plane" section to AGENTS-WORKFLOW.md (and dropped 2 lines) | redirect eval target AGENTS.md→AGENTS-WORKFLOW.md + restore the 2 dropped policy lines | | security-toolchain ci-soft-gate-policy | 0 | the gate is intentionally HARD (no continue-on-error); the job already runs security-gate.sh --mode quick + uploads artifacts | drop the stale continue-on-error requirement from the eval (security stays HARD) | Security note: the security-toolchain-gate stays a HARD blocking gate. The only eval bit removed was the stale "soft gate" assertion; the actual scan (security-gate.sh --mode quick) + artifact upload + summary-blocking are unchanged. How tested: - hook-manifest jq check → hook-manifest-counts-ok - security smoke ci-policy → security-toolchain-ci-policy-ok - all 7 landing-plane strings present in AGENTS-WORKFLOW.md - shellcheck clean on the edited smoke Sibling pattern: same "update eval to match legitimately-changed source of truth" move as the cli-command-surface canary bumps in #396/#397. Fitness: release-gate eval hard-fails 3 → 0. (5 minor evals 0.71-0.99 + the vil lane remain — separate remediation, NOT in this PR.) Closes-scenario: soc-2gd6#eval-hard-fails Bounded-context: BC4-Validation Evidence: evals/agentops-core/fixtures/security-toolchain-governance-smoke.sh

github-actions Bot added the docs label May 22, 2026

boshu2 merged commit ce9ec94 into main May 22, 2026
71 checks passed

boshu2 deleted the fix/eval-lane-hard-fails-soc-2gd6 branch May 22, 2026 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(evals): unstale 3 release-gate eval hard-fails behind legit refactors (soc-2gd6 #eval-hard-fails)#402

fix(evals): unstale 3 release-gate eval hard-fails behind legit refactors (soc-2gd6 #eval-hard-fails)#402
boshu2 merged 1 commit into
mainfrom
fix/eval-lane-hard-fails-soc-2gd6

boshu2 commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

boshu2 commented May 22, 2026

Why

How tested

Scope honesty

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant