Skip to content

feat(mt#1459): Add PreToolUse hook to require execution evidence before merging PRs with new tests#909

Merged
edobry merged 4 commits into
mainfrom
task/mt-1459
Apr 30, 2026
Merged

feat(mt#1459): Add PreToolUse hook to require execution evidence before merging PRs with new tests#909
edobry merged 4 commits into
mainfrom
task/mt-1459

Conversation

@minsky-ai
Copy link
Copy Markdown
Contributor

@minsky-ai minsky-ai Bot commented Apr 30, 2026

Summary

Implements mt#1459: a PreToolUse hook on mcp__minsky__session_pr_merge that blocks merge when a PR adds new test files but the PR body lacks an Execution evidence: block. This is the structural enforcement for the feedback_behavior_detecting_artifacts_need_execution_evidence rule, which failed 4-for-4 at the memory tier.

Motivation & Context

The mt#1205 workstream (2026-04-28) violated the rule three times in 8 hours, shipping three PRs where new tests were never run. The first live run found three real bugs in 326ms, including a production bug in isPgPoolExhaustionError that silently no-op'd withPgPoolRetry for ~3 days. Per the 2026-04-28 retrospective and the mitigation-tier inversion principle, repeated failures at memory tier mandate escalation to hook tier.

Design/Approach

The hook exports pure testable functions (isTestFile, findNewTestFiles, hasExecutionEvidence, hasBypassPrefix, checkExecutionEvidence) and wraps the entry point in import.meta.main so the file can be imported for testing without executing. This follows the same pattern as parallel-work-guard.ts.

Two escape hatches are supported:

  1. [unverified-tests] marker anywhere in the title — allows merge with a warning, for use when tests cannot be run yet (e.g., infrastructure not deployed) and a follow-up verification task is filed
  2. Execution evidence: heading in PR body — paste from an actual test run

The hook is fail-open: if PR metadata cannot be fetched via gh api, it allows merge with a warning rather than blocking silently.

Key Changes

  • .claude/hooks/require-execution-evidence-before-merge.ts — new PreToolUse hook
  • .claude/hooks/require-execution-evidence-before-merge.test.ts — 81 tests covering all four acceptance tests from the spec plus negative cases
  • .claude/hooks/types.tsexecWithPath helper centralized here (NON-BLOCKING chore(deps-dev): bump eslint from 8.57.1 to 9.27.0 #5)
  • .claude/hooks/parallel-work-guard.ts — updated to import execWithPath from types.ts
  • .claude/settings.json — hook registered in the existing mcp__minsky__session_pr_merge PreToolUse matcher entry

Round 1 fixes (PR #909 review)

Round 2 fixes (PR #909 round 2 review)

  • BLOCKING build(deps-dev): bump eslint from 8.57.1 to 9.26.0 #1: Raised hook timeout in .claude/settings.json from 20s to 60s. Worst-case sequential subprocess budget is deriveRepoFromGit (5s) + gh pr list (10s) + fetchPrFiles (15s) + fetchPrMeta (15s) = 45s; 60s gives margin for degraded conditions or paginated PRs. No other hooks' timeouts changed.
  • BLOCKING build(deps-dev): bump @typescript-eslint/eslint-plugin from 7.18.0 to 8.32.0 #2: Replaced brittle task/<id> branch-name PR resolution with a two-path resolvePrNumber() helper exported for testing. Primary path: gh pr view --json number --jq .number (current branch, handles forks and non-standard names naturally). Fallback: gh pr list --repo <repo> --head task/<id>. If both fail, emits a writeOutput warning (audit trail) and fails-open — does NOT silently process.exit(0). Added ExecFn injectable type so the helper is unit-testable without module mocking.
  • BLOCKING build(deps-dev): bump @typescript-eslint/parser from 7.18.0 to 8.32.0 #3: Changed fetchPrFiles return type from PrFile[] to FetchPrFilesResult { files: PrFile[], warning?: string }. API errors now produce a warning string instead of silently returning []. The warning is collected in topLevelWarnings and merged into additionalContext (allowed path) or prepended to permissionDecisionReason (blocked path), ensuring every API failure is auditable.

Round 5 fixes (PR #909 round 5 review — substantive findings only)

  • SUBSTANTIVE build(deps-dev): bump eslint from 8.57.1 to 9.26.0 #1 — cross-PR bypass marker position: hasBypassPrefix (now semantically a marker detector, not a prefix detector) changed from startsWith to includes-style regex. The sibling skill mt#1460 tells agents to put [unverified-tests] at the start of their input title, but prepare-pr composes the visible PR title as feat(mt#X): <input title>, placing the marker mid-string. The old startsWith check silently missed this common case. Bracket delimiters provide natural boundary semantics — bare unverified-tests without brackets does not match. Added tests for: tag at start, tag mid-title after conventional-commit prefix, tag at end, no tag, bare word without brackets does not match.
  • SUBSTANTIVE build(deps-dev): bump @typescript-eslint/eslint-plugin from 7.18.0 to 8.32.0 #2 — HTML comment spoofing: hasExecutionEvidence now strips HTML comments via body.replace(/<!--[\s\S]*?-->/g, "") before scanning. A marker inside <!-- ... --> is invisible in rendered Markdown and must not count as evidence. Added tests for: marker inside single-line comment returns false; multi-line comment returns false; real marker outside comment returns true; mixed (commented + real) returns true.
  • SUBSTANTIVE build(deps-dev): bump @typescript-eslint/parser from 7.18.0 to 8.32.0 #3 — renamed/copied file detection: findNewTestFiles now counts status === "renamed" or status === "copied" entries as new test files when the new filename matches a test pattern AND the previous_filename did not match a test pattern. Updated PrFile type with optional previous_filename. Updated fetchPrFiles --jq projection to include previous_filename from the GitHub API. Added tests for: rename non-test to test counts; rename test to test does not count (just relocated); copy non-test to test counts; rename non-test to non-test ignored; rename with no previous_filename conservatively included.

Non-fixes (intentionally out of scope per task instructions): mcp__github__merge_pull_request matcher, fork workflow detection, round-5 regex tightening for ##Notes execution evidence: (self-reversal), parallel-work-guard.ts mixed stdout, task/taskId alias fallback, pagination alternative.

Testing

81 unit tests cover:

  • isTestFile: pattern matching for *.test.ts, *.integration.test.ts, *.spec.ts; negative cases
  • findNewTestFiles: status: "added" test files; renamed/copied non-test→test; renamed test→test excluded; modified and deleted ignored
  • hasExecutionEvidence: case-insensitive detection; HTML comment stripping; positive and negative cases including negation phrases, empty/whitespace-only body after heading, mid-sentence occurrence
  • hasBypassPrefix: [unverified-tests] marker detection anywhere in title, case-insensitive; bracket-delimited match required; bare word without brackets does not match
  • checkExecutionEvidence: all four spec acceptance tests (AT1–AT4) plus edge cases; mid-title bypass marker fires; bare-word does not fire
  • parseGitHubRemoteUrl: SCP-style SSH, HTTPS, URL-style SSH, embedded tokens, trailing newline, null cases
  • resolvePrNumber: primary path (gh pr view), fallback path (gh pr list), both fail → warning emitted, task-derived branch used in fallback
  • FetchPrFilesResult shape: no-warning case, warning case, downstream checkExecutionEvidence behavior with empty files

Execution evidence:

bun test --preload ./tests/setup.ts --timeout=15000 ./.claude/hooks/require-execution-evidence-before-merge.test.ts

bun test v1.2.21 (7c45ed97)

.claude/hooks/require-execution-evidence-before-merge.test.ts:
(pass) isTestFile > matches *.test.ts
(pass) isTestFile > matches *.integration.test.ts
(pass) isTestFile > matches *.spec.ts
(pass) isTestFile > does not match plain .ts files
(pass) isTestFile > does not match .test.js files (only .ts)
(pass) isTestFile > does not match files that merely contain 'test' in the name
(pass) isTestFile > does not match .test.tsx
(pass) findNewTestFiles > returns added test files
(pass) findNewTestFiles > ignores modified test files (only added counts as new)
(pass) findNewTestFiles > ignores deleted test files
(pass) findNewTestFiles > returns multiple added test files
(pass) findNewTestFiles > returns empty when no files are provided
(pass) findNewTestFiles > returns empty when no test files are added
(pass) findNewTestFiles > counts renamed non-test to test file as a new test file
(pass) findNewTestFiles > does NOT count renamed test to test file (just a test relocation)
(pass) findNewTestFiles > counts copied non-test to test file as a new test file
(pass) findNewTestFiles > does NOT count renamed non-test to non-test file
(pass) findNewTestFiles > counts renamed test file with no previous_filename (conservative include)
(pass) hasExecutionEvidence > detects heading with content on next line
(pass) hasExecutionEvidence > detects lowercase variant with inline content
(pass) hasExecutionEvidence > detects mixed case with inline content
(pass) hasExecutionEvidence > detects heading with content block (code fence)
(pass) hasExecutionEvidence > returns false when heading is absent
(pass) hasExecutionEvidence > returns false for empty body
(pass) hasExecutionEvidence > returns false for negation: No Execution evidence
(pass) hasExecutionEvidence > returns false for negation in a heading
(pass) hasExecutionEvidence > returns false when heading is present but body after it is empty
(pass) hasExecutionEvidence > returns false when heading is present but body after it is only whitespace
(pass) hasExecutionEvidence > returns false when execution evidence appears only mid-sentence in prose
(pass) hasExecutionEvidence > returns false when negation uses template placeholder pattern
(pass) hasExecutionEvidence > returns false when marker is inside an HTML comment
(pass) hasExecutionEvidence > returns false when marker is inside a multi-line HTML comment
(pass) hasExecutionEvidence > returns true when real marker exists outside HTML comment
(pass) hasExecutionEvidence > returns true when real marker exists alongside a commented-out one
(pass) hasBypassPrefix > detects marker at start of title
(pass) hasBypassPrefix > detects marker in the middle after conventional-commit prefix
(pass) hasBypassPrefix > detects marker at the end of the title
(pass) hasBypassPrefix > detects uppercase variant
(pass) hasBypassPrefix > detects mixed case
(pass) hasBypassPrefix > returns false when marker is absent
(pass) hasBypassPrefix > returns false when the word unverified-tests appears without brackets
(pass) hasBypassPrefix > handles leading whitespace in title
(pass) hasBypassPrefix > returns false for empty title
(pass) checkExecutionEvidence — no test files added > allows PR with only source code changes
(pass) checkExecutionEvidence — no test files added > allows PR with only modified test files
(pass) checkExecutionEvidence — no test files added > allows empty PR files list
(pass) checkExecutionEvidence — blocks on missing evidence > blocks PR adding test file without execution evidence
(pass) checkExecutionEvidence — blocks on missing evidence > error message references the new test file
(pass) checkExecutionEvidence — blocks on missing evidence > error message contains remediation instructions
(pass) checkExecutionEvidence — blocks on missing evidence > enumerates all new test files in error message
(pass) checkExecutionEvidence — allows when evidence block present > allows PR with execution evidence in body
(pass) checkExecutionEvidence — allows when evidence block present > allows PR with lowercase heading
(pass) checkExecutionEvidence — allows when evidence block present > allows PR with multiple new test files and evidence block
(pass) checkExecutionEvidence — bypass prefix > allows merge when title has marker (no evidence in body)
(pass) checkExecutionEvidence — bypass prefix > includes a warning when bypass is used
(pass) checkExecutionEvidence — bypass prefix > allows with uppercase UNVERIFIED-TESTS
(pass) checkExecutionEvidence — bypass prefix > bypasses when marker is mid-title after conventional-commit prefix
(pass) checkExecutionEvidence — bypass prefix > does NOT bypass when unverified-tests appears without brackets
(pass) checkExecutionEvidence — integration > block message references all new test files (acceptance test 4)
(pass) checkExecutionEvidence — integration > allows PR adding only non-test files (acceptance test 3)
(pass) checkExecutionEvidence — integration > allows PR adding integration test with evidence (acceptance test 2)
(pass) checkExecutionEvidence — integration > blocks PR adding integration test without evidence (acceptance test 1)
(pass) parseGitHubRemoteUrl > parses SCP-style SSH URL
(pass) parseGitHubRemoteUrl > parses HTTPS URL with .git suffix
(pass) parseGitHubRemoteUrl > parses HTTPS URL without .git suffix
(pass) parseGitHubRemoteUrl > parses HTTPS URL with embedded token
(pass) parseGitHubRemoteUrl > parses URL-style SSH with git+ssh prefix
(pass) parseGitHubRemoteUrl > parses URL-style SSH without prefix
(pass) parseGitHubRemoteUrl > returns null for non-GitHub remote
(pass) parseGitHubRemoteUrl > returns null for empty string
(pass) parseGitHubRemoteUrl > returns null for malformed URL
(pass) parseGitHubRemoteUrl > handles trailing newline in URL
(pass) resolvePrNumber > resolves PR via gh pr view (primary path)
(pass) resolvePrNumber > falls back to gh pr list when gh pr view fails
(pass) resolvePrNumber > falls back to gh pr list when gh pr view returns non-numeric output
(pass) resolvePrNumber > returns null and emits warning when both paths fail
(pass) resolvePrNumber > returns null and emits warning when both paths return zero/empty
(pass) resolvePrNumber > uses task-derived branch in fallback path
(pass) makeProdPrDeps.fetchPrFiles > FetchPrFilesResult with no warning has only files
(pass) makeProdPrDeps.fetchPrFiles > FetchPrFilesResult with warning has empty files and warning string
(pass) makeProdPrDeps.fetchPrFiles > checkExecutionEvidence with empty files allows merge

 81 pass
 0 fail
 135 expect() calls
Ran 81 tests across 1 file. [51.00ms]

Dogfooding note

This PR adds a new test file (require-execution-evidence-before-merge.test.ts). The hook itself enforces the rule it implements: the Execution evidence: block above is the required evidence. The hook was not yet deployed when this PR was created (it activates after merge), so this is simultaneously the specification of the rule and the first instance of compliance with it.

…re merging PRs with new tests

Adds require-execution-evidence-before-merge.ts hook that intercepts
mcp__minsky__session_pr_merge and blocks merges when a PR introduces new
test files (*.test.ts, *.integration.test.ts, *.spec.ts) but the PR body
lacks an Execution evidence: block.

Two escape hatches:
- [unverified-tests] title prefix: allows merge with a warning
- Execution evidence: heading in PR body: evidence of an actual test run

Registers in .claude/settings.json alongside require-review-before-merge.ts
on the same mcp__minsky__session_pr_merge matcher.

42 tests cover all four spec acceptance tests plus unit tests for each
exported pure function (isTestFile, findNewTestFiles, hasExecutionEvidence,
hasBypassPrefix, checkExecutionEvidence).

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
@minsky-ai minsky-ai Bot added the authorship/co-authored Co-authored by human and AI agent label Apr 30, 2026
Copy link
Copy Markdown

@minsky-reviewer minsky-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent adversarial review (Chinese-wall)
Reviewer: minsky-reviewer[bot] via openai:gpt-5
Tier: unknown


⚠️ Reviewer did not emit a conclude_review call. Event derived from severity counts: REQUEST_CHANGES (4 BLOCKING / 1 NON-BLOCKING / 0 PRE-EXISTING findings). Executive summary unavailable.

Findings

  • [BLOCKING] .claude/hooks/require-execution-evidence-before-merge.ts:255 — Hook emits multiple JSON outputs by calling writeOutput in a loop, violating the single-JSON contract
    In the top-level entrypoint, warnings are surfaced via a for-loop that calls writeOutput once per warning (see the block starting with for (const warning of result.warnings) { ... }). types.writeOutput writes a raw JSON object to stdout without a delimiter. Emitting multiple JSON objects concatenated produces invalid JSON for consumers expecting a single HookOutput object, risking parse errors or dropped outputs. Contrast with parallel-work-guard.ts, which aggregates warnings into a single additionalContext string and calls writeOutput once. Fix by aggregating warnings and emitting exactly one writeOutput (and one for a deny), or by printing warnings to stdout (plain text) and reserving writeOutput for a single structured payload.
  • [BLOCKING] .claude/hooks/require-execution-evidence-before-merge.ts:210 — Hardcoded owner/repo slug and PR resolution logic breaks on forks or non-edobry/minsky remotes
    The hook resolves the PR number via gh pr list --repo edobry/minsky --head task/${task} and then uses makeProdPrDeps with the same hardcoded owner/repo for subsequent API calls (lines ~210–245). This makes the hook non-portable to forks or when the working directory's origin is not edobry/minsky (e.g., a teammate's fork PR). Other hooks (parallel-work-guard.ts) derive the owner/repo slug from git remote get-url origin. Align with that pattern to avoid false negatives (fail-open) or querying the wrong repo. Derive repo from git and pass it through, with a fail-open warning if unavailable.
  • [BLOCKING] .claude/hooks/require-execution-evidence-before-merge.ts:167 — PR files fetch does not paginate GitHub API results — may miss added test files beyond first page
    makeProdPrDeps.fetchPrFiles calls gh api repos/${repo}/pulls/${prNumber}/files without --paginate. GitHub's PR files API is paginated (default 30 items/page). On PRs changing >30 files, added tests beyond the first page will be silently ignored, potentially allowing merges without required evidence. Use --paginate and aggregate pages, or switch to gh pr view --json files --jq .files (which returns all files) to ensure complete coverage.
  • [BLOCKING] .claude/hooks/require-execution-evidence-before-merge.ts:119 — Evidence detection regex will produce false positives (e.g., "No Execution evidence:"), allowing merges without actual evidence
    hasExecutionEvidence uses /execution evidence:/i (line ~119), which matches any occurrence of that substring, including negations like "No Execution evidence:" or template placeholders. This defeats the guard by allowing merges when the body explicitly states that there is no evidence. Tighten the parser to require a section heading (e.g., start-of-line/Markdown heading like ^##\s*Execution evidence:\s*$ followed by non-empty content), or at minimum require non-empty lines following the marker to reduce false positives. Add tests for negative phrases and empty blocks.
  • [NON-BLOCKING] .claude/hooks/require-execution-evidence-before-merge.ts:37 — Duplicate exec helper instead of reusing shared execSync or established helper
    This file introduces a bespoke execWithPath wrapper (lines 37–58) while types.ts already exports execSync and parallel-work-guard.ts has a similar execWithPath with timedOut semantics. The divergence (PATH augmentation here, but not in types.execSync) creates inconsistent behavior and duplicated code paths. Consider centralizing the PATH-augmented exec helper in types.ts (or reusing the existing one if available) to reduce maintenance risk and keep timeout/ENV behavior consistent across hooks.

…ivation, pagination, regex tightening

BLOCKING #1: aggregate warnings into single additionalContext string; emit exactly one
writeOutput call per code path (deny or allow-with-warnings).

BLOCKING #2: derive owner/repo from https://github.com/edobry/minsky.git via parseGitHubRemoteUrl
and deriveRepoFromGit; no more hardcoded edobry/minsky slug; fail-open with warning if
derivation fails.

BLOCKING #3: add --paginate to gh api repos/.../pulls/.../files call; parse multi-page
output by splitting on newlines and flattening JSON arrays.

BLOCKING #4: tighten hasExecutionEvidence to require line-start anchor (prevents
mid-sentence matches) and negation guard (skips lines containing no before the marker)
and require non-empty content after the heading.

NON-BLOCKING #5: centralize execWithPath in types.ts; both hooks import from there.

Tests: add negative cases for hasExecutionEvidence (negation, empty body, whitespace-only
body); add parseGitHubRemoteUrl suite covering all URL forms and null cases. 59 pass.
Copy link
Copy Markdown

@minsky-reviewer minsky-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent adversarial review (Chinese-wall)
Reviewer: minsky-reviewer[bot] via openai:gpt-5
Tier: unknown


Good structural addition with solid unit tests and prior JSON-output concerns addressed, but two blocking issues remain. First, the PreToolUse timeout is set to 20s while the hook can legitimately spend up to ~40s+ across sequential gh/git calls, risking premature termination and a silent allow. Second, PR resolution assumes a task/ branch naming convention; if branches differ, the guard exits early without warning, skipping enforcement. Please raise the hook timeout (or tighten per-call budgets) and make PR number derivation robust to alternate branch names (e.g., use current HEAD or search). Once addressed, this should be ready to merge.

Findings

  • [BLOCKING] .claude/settings.json:19 — Hook timeout (20s) is lower than cumulative gh/git timeouts (10s + 15s + 15s), risking premature kill and silent allow
    The PreToolUse entry for mcp__minsky__session_pr_merge sets timeout: 20 (line 19). However, the hook performs up to three sequential subprocesses: (1) gh pr list with default execWithPath timeout 10s (types.ts: execWithPath defaults to 10000ms), (2) gh api .../pulls/.../files with timeout 15000ms (require-execution-evidence-before-merge.ts: makeProdPrDeps > fetchPrFiles), and (3) gh api .../pulls/... with timeout 15000ms (makeProdPrDeps > fetchPrMeta). In degraded conditions this can exceed 20s, causing the hook process to be terminated by the runner before it emits a single JSON output. Since the hook is fail-open (and termination yields no deny), this can silently allow merges that should be blocked. Fix by either increasing the settings.json timeout (e.g., 45–60s) or tightening per-call timeouts and short-circuiting earlier to keep worst-case under the hook budget.
  • [BLOCKING] .claude/hooks/require-execution-evidence-before-merge.ts:325 — Deriving PR number assumes branch naming task/<taskId>; mismatch will skip enforcement without warning
    At lines 325–343, the hook constructs branch = task/${task.replace("#","-")}and then queriesgh pr list --head ... --jq '.[0].number'. This silently relies on the PR’s head ref exactly matching that pattern. Any repository using different branch conventions (e.g., feature/mt-1459, forks with username:branch, or GitHub UI-created branches) will fail to resolve a PR number (parseInt('')=> NaN) and the process exits(0) without emitting any warning or additionalContext, effectively disabling the guard. Given the stated goal is to enforce on mcp__minsky__session_pr_merge regardless of contributor branch naming, this is a silent behavior gap. Suggested fix: resolve PR by commit SHA (gh pr view --json) or by searching PRs with--searchfor the task token in title/body, or derive current branch viagit rev-parse --abbrev-ref HEADand use--head` .

Copy link
Copy Markdown

@minsky-reviewer minsky-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent adversarial review (Chinese-wall)
Reviewer: minsky-reviewer[bot] via openai:gpt-5
Tier: unknown


Requesting changes: while the feature is well-scoped with strong unit coverage, there are structural issues that create bypass and timeout risks. The PreToolUse timeout (20s) is lower than the cumulative gh/git timeouts in the hook, making premature termination likely; PR resolution relies on a brittle branch-name convention and exits silently when not found; and PR file fetch failures return an empty list without surfacing a warning, allowing merges when the check couldn’t run. Please raise or reconcile timeouts, make PR identification robust (and warn on failure), and propagate warnings for file-fetch errors. Non-blocking nits: broaden the negation guard and centralize remote-URL parsing to avoid duplication.

Findings

  • [BLOCKING] .claude/settings.json:19 — PreToolUse timeout (20s) is lower than cumulative gh/git timeouts in the hook, risking premature kill and silent allow
    The PreToolUse entry for mcp__minsky__session_pr_merge sets "timeout": 20 (see .claude/settings.json:19). In require-execution-evidence-before-merge.ts, the sequential subprocess calls can cumulatively exceed 20s under normal latency: deriveRepoFromGit uses a 5s timeout, the PR resolution step (gh pr list) uses the execWithPath default of 10s, and each of fetchPrFiles and fetchPrMeta uses 15s (see makeProdPrDeps). In a degraded network or paginated PR, this sequence can exceed 20s, causing the host to SIGTERM the hook mid-flight. Because the hook is designed fail-open only when it can emit a warning, a hard timeout will bypass enforcement without a structured message. Raise the hook timeout (or proportionally lower per-call budgets and ensure total fits under the configured ceiling) to prevent premature termination.
  • [BLOCKING] .claude/hooks/require-execution-evidence-before-merge.ts:290 — Brittle PR-number resolution assumes task/ branch naming and silently skips enforcement when not found
    At lines ~290–319, the hook constructs const branch = task/${task.replace("#","-")}and usesgh pr list --head --jq '.[0].number'to infer the PR number. If the active PR's branch does not follow this exact naming convention (e.g., a different prefix, or a teammate/fork uses a different branch name),gh pr listwill return no match and the codeif (!prNumber || isNaN(prNumber)) process.exit(0);exits early without any warning or audit trail. Additionally, for fork-based PRs,gh pr list --headtypically requiresowner:branch; using only against--repo owner/repooften yields no results, again causing a silent allow. This creates a reliable bypass path where the guard never runs. Please (a) add robust PR identification (e.g., resolve current HEAD’s associated PR viagh pr view --json number` or search by SHA), (b) support fork PR heads (owner:branch), and (c) emit a warning when PR resolution fails instead of exiting silently.
  • [BLOCKING] .claude/hooks/require-execution-evidence-before-merge.ts:142 — Silent allow when PR files fetch fails — returns [] without surfacing a warning to the top-level
    In makeProdPrDeps.fetchPrFiles (around lines 142–180), non-zero exit or JSON parse errors return an empty array. The top-level enforcement then computes newTestFiles from this empty list and may allow merge even when the API call failed. Unlike the PR metadata path (which fails open with a user-visible warning via writeOutput), file-fetch failures have no warning or audit. This creates a stealth bypass if gh api is unavailable or times out. Please propagate a warning (e.g., by returning a tuple with warnings or emitting a writeOutput additionalContext) and consider failing open with an explicit warning instead of silently returning [].

Inline comments

  • .claude/hooks/require-execution-evidence-before-merge.ts:307 — Consider switching to gh pr view --json number --jq .number without --head to resolve the current PR from the checked-out branch, or use gh pr list --search sha:<HEAD> to avoid relying on an exact task/<id> head name. Also, fork PRs require owner:branch for --head matching.

…, fileFetch warning

BLOCKING #1: raise require-execution-evidence hook timeout from 20s to 60s in
.claude/settings.json to cover worst-case sequential subprocess budget
(deriveRepoFromGit 5s + gh pr list 10s + fetchPrFiles 15s + fetchPrMeta 15s = 45s).

BLOCKING #2: replace brittle  branch-name PR resolution with a
two-path resolvePrNumber() helper: primary path uses  (current
branch, handles forks and non-standard names); fallback uses 921	fix(mt#1486): mt#1486: widen prior-review parsers for bare [BLOCKING] markers	task/mt-1486	OPEN	2026-04-30T22:04:07Z
920	feat(mt#1465): mt#1465: severity-monotonicity prototype + A/B replay measurement	task/mt-1465	OPEN	2026-04-30T21:12:37Z
914	feat(mt#1071): Add attention accounting module, attention.report command, and --show-attention flag	task/mt-1071	OPEN	2026-04-30T19:03:03Z
909	feat(mt#1459): Add PreToolUse hook to require execution evidence before merging PRs with new tests	task/mt-1459	OPEN	2026-04-30T18:06:16Z
908	feat(mt#1460): Add execution evidence requirement for new test files to prepare-pr skill	task/mt-1460	OPEN	2026-04-30T17:59:59Z
907	feat(mt#1348): Auto-apply review-state labels to PRs on review submission	task/mt-1348	OPEN	2026-04-30T09:19:25Z
906	feat(mt#1341): Add suggestion field to ReviewComment for GitHub suggestion blocks	task/mt-1341	OPEN	2026-04-30T09:13:52Z
900	feat(mt#1467): Wire mergeSessionPr to emit quality.review Ask before each merge	task/mt-1467	OPEN	2026-04-30T07:31:42Z
898	feat(mt#1342): Add GitHub PR review thread resolve/unresolve via GraphQL	task/mt-1342	OPEN	2026-04-30T07:16:28Z
896	feat(mt#1343): reviewThreads field in session_pr_review_context	task/mt-1343	OPEN	2026-04-30T07:11:30Z
889	feat(mt#1346): GitHub Check Run annotations as parallel review surface	task/mt-1346	OPEN	2026-04-30T07:02:20Z
887	feat(mt#1347): Add pre-flight diff anchor validation for PR review comments	task/mt-1347	OPEN	2026-04-30T06:56:49Z
886	feat(mt#1340): Update reviewer prompt to emit line-anchored comments[] from parsedDiff	task/mt-1340	OPEN	2026-04-30T06:51:22Z
882	feat(mt#1437): Typed TS env-var synthesizer for Railway services (minsky-mcp)	task/mt-1437	OPEN	2026-04-28T22:03:56Z
872	feat(mt#1309): Merge-gate denies PRs with zero CI check_runs (PR #763 webhook-miss regression detection)	task/mt-1309	OPEN	2026-04-28T07:36:22Z
858	fix(mt#1384): write canonical github-pr ref so reconciler can route Asks	task/mt-1384	OPEN	2026-04-27T23:13:27Z
855	docs(mt#1311): Document merge-flow operational reality in /review-pr and /orchestrate skills	task/mt-1311	OPEN	2026-04-27T21:00:05Z
849	feat(mt#1306): Reviewer convergence metrics: durable Postgres persistence	task/mt-1306	OPEN	2026-04-27T19:25:19Z
809	feat(mt#1255): replace console.* with structured winston logger in reviewer service	task/mt-1255	OPEN	2026-04-26T16:57:53Z
781	fix(mt#1274): [SUPERSEDED by mt#1281] propagate sessionId to internal session_update	task/mt-1274	OPEN	2026-04-25T19:11:02Z
777	fix(mt#1261): [SUPERSEDED by mt#1281] session_pr_create sessionId-shape fix	task/mt-1261	OPEN	2026-04-25T19:01:41Z
774	test(mt#1263): Add runReview sanitize wiring integration tests	task/mt-1263	OPEN	2026-04-24T18:54:29Z
710	feat(mt#1028): KB activity source — GitHub Issues/PRs provider	task/mt-1028	OPEN	2026-04-22T23:17:34Z
703	docs(mt#1109): E2E smoke artifact — add v1-deployed status to reviewer README	task/mt-1109	OPEN	2026-04-22T22:07:14Z
684	feat(mt#1066): PostToolUse hook enforcing /review-pr after session_pr_create	task/mt-1066	OPEN	2026-04-22T17:45:49Z
682	docs(mt#1050): surface attention-allocation frame in README	task/mt-1050	OPEN	2026-04-22T17:44:30Z
538	chore(deps-dev): bump ts-morph from 26.0.0 to 28.0.0	dependabot/npm_and_yarn/ts-morph-28.0.0	OPEN	2026-04-20T00:32:48Z
537	chore(deps-dev): bump @types/node from 22.19.17 to 25.6.0	dependabot/npm_and_yarn/types/node-25.6.0	OPEN	2026-04-20T00:32:43Z
536	chore(deps): bump ai from 4.3.19 to 6.0.168	dependabot/npm_and_yarn/ai-6.0.168	OPEN	2026-04-20T00:32:34Z
535	chore(deps): bump @ai-sdk/openai from 1.3.24 to 3.0.53	dependabot/npm_and_yarn/ai-sdk/openai-3.0.53	OPEN	2026-04-20T00:32:16Z. Both failing now emits a writeOutput warning instead of
silently calling process.exit(0). Added ExecFn injectable type for unit testing.

BLOCKING #3: change fetchPrFiles return type from PrFile[] to
FetchPrFilesResult { files, warning? } so API errors are surfaced in the
audit trail via additionalContext rather than silently allowing merge.
Warning is propagated through topLevelWarnings and merged with check-level
warnings in the final writeOutput call.

Tests: 8 new tests added (resolvePrNumber: 6, fetchPrFiles warning shape: 3);
all 68 tests pass.
Copy link
Copy Markdown

@minsky-reviewer minsky-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent adversarial review (Chinese-wall)
Reviewer: minsky-reviewer[bot] via openai:gpt-5
Tier: unknown


Solid structural guard with extensive unit tests and thoughtful fail-open posture, but several bypass and integration risks remain. The hook ignores renamed/copied files, allowing introduction of new *.test.ts via rename without evidence; the evidence parser can be spoofed via HTML comments; and the entry point exits early if task is absent instead of attempting PR resolution via gh pr view. Additionally, mixed stdout logging across hooks risks invalid JSON output for PreToolUse consumers. Please address these blocking issues before merge by expanding file status handling (including previous_filename), hardening evidence detection, making task handling robust (or independent), and standardizing JSON-only stdout for hooks.

Findings

  • [BLOCKING] .claude/hooks/require-execution-evidence-before-merge.ts:78 — New test introduction via rename/copy is not detected — only status:"added" is considered
    findNewTestFiles filters exclusively on status === "added" and isTestFile(filename), ignoring files with status: "renamed" or "copied". This creates a bypass: a PR can introduce a new *.test.ts or *.spec.ts by renaming/copying an existing non-test file, and the hook will treat it as "no new test files" and allow merge without evidence. Evidence:
  • At .claude/hooks/require-execution-evidence-before-merge.ts: the PrFile type includes multiple statuses, but findNewTestFiles (around this line) filters strictly on status === "added".
  • At .claude/hooks/require-execution-evidence-before-merge.ts: in makeProdPrDeps.fetchPrFiles, the --jq extracts only {filename, status} from the GitHub API, dropping previous_filename which would be needed to reason about renames into test-file patterns.
    Remediation: Treat renamed/copied as "new test files" when the new filename matches a test pattern and (optionally) the previous filename did not. Update the jq to extract .previous_filename and adjust detection accordingly.
  • [BLOCKING] .claude/hooks/require-execution-evidence-before-merge.ts:149 — Execution-evidence detection can be trivially spoofed by placing the marker inside an HTML comment
    hasExecutionEvidence only examines visible lines and requires a line-start "Execution evidence:" marker, but it does not exclude content inside HTML comments. A PR body like:

will be treated as valid evidence due to the line-start match, even though it's invisible and clearly not an actual run. Evidence: regex at .claude/hooks/require-execution-evidence-before-merge.ts:149 matches any line beginning with optional # then execution evidence:; there is no guard for HTML comment blocks. Tests also lack this negative case.
Remediation: Before scanning, strip HTML comments (/<!--[\s\S]*?-->/g) or explicitly reject matches within comment blocks. Add unit tests to cover comment-wrapped markers and ensure they return false.

  • [BLOCKING] .claude/hooks/types.ts:14 — Hook output shape may be incompatible: missing top-level event fields and inconsistent use across hooks
    The writeOutput function serializes only a { hookSpecificOutput: { ... } } object. However, other hooks in this repo (e.g., parallel-work-guard.ts) sometimes also write plain text to stdout prior to writeOutput, and the consumer contract for PreToolUse may expect a single JSON object with specific top-level fields (e.g., decision, reason) rather than arbitrary logs plus JSON. While you aggregated warnings into additionalContext in the new hook, parallel-work-guard still writes multiple log lines (see .claude/hooks/parallel-work-guard.ts lines ~196, 224, 258, 289) before its JSON, potentially producing mixed output. Since this PR centralizes execWithPath but leaves mixed-output behavior, the receiving system may fail JSON parsing or drop the decision.
    Remediation: Ensure all PreToolUse hooks avoid printing non-JSON to stdout. Route warnings through writeOutput.additionalContext and remove direct process.stdout.write calls. Alternatively, write logs to stderr.
  • [BLOCKING] .claude/hooks/require-execution-evidence-before-merge.ts:399 — Hook silently skips enforcement when tool_input.task is missing instead of attempting robust PR resolution
    At .claude/hooks/require-execution-evidence-before-merge.ts:399–404, the hook reads const task = (input.tool_input.task as string | undefined) ?? ""; if (!task) process.exit(0); and exits immediately. This creates a bypass: if the invoking tool omits task (or uses taskId, which this hook does not read), enforcement is skipped entirely, even though resolvePrNumber's primary path (gh pr view --json number) does not depend on task and could still resolve the PR. In contrast, parallel-work-guard supports both task and taskId aliases. Remediation: accept taskId as a fallback, and if neither is present, still attempt the gh pr view primary resolution path (only use the branch-name fallback when task is available). Emit a warning rather than exiting silently when task context is absent.

Copy link
Copy Markdown

@minsky-reviewer minsky-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent adversarial review (Chinese-wall)
Reviewer: minsky-reviewer[bot] via openai:gpt-5
Tier: unknown


Solid structural addition with thorough unit tests and improved robustness (timeouts, PR resolution, pagination, and single-output discipline). However, three blocking issues remain. First, the hook is only registered for mcp__minsky__session_pr_merge; merges via the GitHub MCP path (mcp__github__merge_pull_request) appear unguarded, creating a bypass. Second, hasExecutionEvidence’s regex admits malformed headings (e.g., '##Notes execution evidence:') due to allowing no-space after heading markers; tighten to require a space or pure label. Third, deriving the repo from origin targets the contributor’s fork in fork-based workflows, causing gh API calls to miss the upstream PR and fail-open, bypassing enforcement. Address these to close bypass vectors; a minor non-blocking note flags pagination parsing edge cases.

Findings

  • [BLOCKING] .claude/settings.json:7 — Policy gap: hook only enforces on mcp__minsky__session_pr_merge; merges via mcp__github__merge_pull_request bypass enforcement
    The new PreToolUse hook is registered only under the matcher "mcp__minsky__session_pr_merge" (lines 6–17). However, the repository also supports merges via the GitHub MCP path (see PostToolUse matcher entry for "mcp__github__merge_pull_request" at lines 54–66). There is no corresponding PreToolUse enforcement for "mcp__github__merge_pull_request", which means a PR that adds new tests could be merged via the GitHub MCP tool without triggering the execution-evidence guard. This creates a silent bypass of the rule the PR intends to enforce. Please register the same hook (or an equivalent guard) under a PreToolUse matcher that covers the GitHub MCP merge path as well, or otherwise document and disable that alternate path.
  • [BLOCKING] .claude/hooks/require-execution-evidence-before-merge.ts:129 — Regex allows mid-sentence "execution evidence:" matches despite spec claiming line-start anchor
    In hasExecutionEvidence (lines ~113–160), the headingPattern is defined as /^(#{0,6}\s*)(execution evidence:\s*)(.)$/im. This matches any line where the literal text "execution evidence:" appears, even if preceded by arbitrary characters other than '#'/whitespace? Actually the first capture restricts only leading '#' and whitespace, but the regex anchors at start-of-line and then requires either 0-6 '#' and spaces followed immediately by "execution evidence:". However, your tests claim to reject mid-sentence occurrences, yet the implementation also accepts inline content case (match[3]) and counts it as evidence. More importantly, the negative guard checks "beforeMarker" by slicing up to indexOf("execution") on the entire line; if the phrase appears mid-line preceded by non-# characters (e.g., "Note: execution evidence: ..."), the headingPattern WILL NOT match, but lines like "##Notes execution evidence:" (no space between heading and text) would match due to #{0,6}\s permitting no space after '#'. This allows malformed headings like "##Notes execution evidence:" to count as evidence. To enforce a true line-start anchor with a proper heading or label, require either: (a) optional heading markers with at least one space before the label (e.g., /^#{1,6}\s+execution evidence:/i), or (b) no heading markers and no preceding content (^execution evidence:). As written, variants like "#execution evidence:" (no space) or "####execution evidence:" could be incorrectly accepted.
  • [BLOCKING] .claude/hooks/require-execution-evidence-before-merge.ts:286 — Fork workflow bypass: repo derived from origin causes all API calls to target the fork, failing open for upstream PRs
    deriveRepoFromGit (lines 262–272) reads the origin remote and parseGitHubRemoteUrl to produce owner/repo. All subsequent PR resolution and API calls use this slug (e.g., resolvePrNumber passes --repo ${repo} at lines 315–327; fetchPrFiles and fetchPrMeta call gh api repos/${repo}/pulls/... at lines 172–217). In a common fork-based workflow, origin points to the contributor's fork while the PR is opened against the upstream repository. In that case, gh pr view --repo <fork> and gh api repos/<fork>/pulls/<n> will 404 or return empty, causing the hook to emit a warning and fail-open — silently allowing merges without enforcement. This creates a systematic bypass for fork-origin branches. To fix: detect and target the upstream repository that actually owns the PR (e.g., query gh repo view --json parent to prefer .parent.nameWithOwner when present; or resolve the PR via gh pr view without --repo to infer the correct base repo, then use that for subsequent API calls).
  • [NON-BLOCKING] .claude/hooks/require-execution-evidence-before-merge.ts:201 — Pagination parsing path mismatch: use of --paginate with --jq should already flatten; fallback line-splitting may mask errors
    fetchPrFiles invokes gh api ... --paginate --jq "[.[] | {filename: .filename, status: .status}]" (lines 182–197). With --paginate, gh typically concatenates the JSON arrays for each page back-to-back without newlines; with --jq, the result is already flattened per page, but concatenation can yield an invalid single JSON document. Your code attempts a direct JSON.parse first, then splits by newline and parses each line as an array (lines 198–221). If gh concatenates arrays without newlines, the second path won't help and you'll return an empty list without warning, silently allowing. Consider using --paginate --template or accumulating via --jq '.[]' and then wrapping in [], or use gh api without --jq and parse paginated pages by invoking page-by-page. At minimum, log a warning when both JSON.parse and line-by-line fail but stdout is non-empty.

…marker, HTML comment strip, rename/copy detection

Substantive #1 — hasBypassPrefix now detects [unverified-tests] anywhere in the
title (was startsWith). The sibling skill mt#1460 tells agents to put the marker at
the start of their input title, but prepare-pr composes the visible title as
feat(mt#X): input title, placing the marker mid-string. Updated function docstring
and renamed the conceptual role (marker, not prefix). Added tests for: tag at start,
tag mid-title after conventional-commit prefix, tag at end, no tag, bare word without
brackets does NOT match.

Updated the checkExecutionEvidence bypass-prefix test that previously asserted
mid-title does NOT bypass (it was testing the old startsWith behavior). Replaced with
two tests: one confirming mid-title fires, one confirming bare-word-without-brackets
does not.

Substantive #2 — hasExecutionEvidence now strips HTML comments via
body.replace before scanning. A marker inside HTML comments is invisible in
rendered Markdown and must not count as evidence. Added tests for: marker inside
single-line comment returns false; multi-line comment returns false; real marker
outside comment returns true; mixed (commented + real) returns true.

Substantive #3 — findNewTestFiles now counts renamed/copied files as new test files
when the new filename matches a test pattern AND the previous_filename did not match
a test pattern. Updated PrFile type to include optional previous_filename. Updated
fetchPrFiles jq projection to include previous_filename from the GitHub API. Added
tests for: rename non-test to test counts; rename test to test does not count (just
relocated); copy non-test to test counts; rename non-test to non-test ignored;
rename with no previous_filename conservatively included.

Intentionally NOT fixed (bypass-merged per task instructions):
- mcp__github__merge_pull_request matcher (already blocked by separate hook)
- Fork workflow / upstream parent detection (out of scope)
- Round-5 regex tightening for Notes execution evidence (self-reversal)
- parallel-work-guard.ts mixed stdout (separate file, not in mt#1459 scope)
- task/taskId alias fallback (gh pr view primary path handles it)
- Pagination alternative --template vs --paginate (current impl sufficient)
@edobry edobry merged commit ac866e2 into main Apr 30, 2026
2 checks passed
@edobry edobry deleted the task/mt-1459 branch April 30, 2026 22:30
Copy link
Copy Markdown

@minsky-reviewer minsky-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent adversarial review (Chinese-wall)
Reviewer: minsky-reviewer[bot] via openai:gpt-5
Tier: unknown


Strong structural guard with thorough tests, but several high-impact bypass vectors and an auditability gap remain. Deriving the repo from the local origin breaks fork-based PRs and causes a fail-open path; the evidence matcher still accepts malformed Markdown headings without a space after hashes; the hook is only wired for the Minsky merge path at PreToolUse, leaving the GitHub MCP merge path unguarded if the separate ban is disabled; and the entry point silently exits when task is missing without emitting a warning. Please address these issues to ensure consistent enforcement and fail-open-with-warning behavior before merging.

Findings

  • [BLOCKING] .claude/hooks/require-execution-evidence-before-merge.ts:424 — Fork-based PRs bypass enforcement: repo slug derived from origin targets the fork, causing gh API calls to miss the upstream PR and fail-open
    At import.meta.main, the hook derives the repo slug from the local git origin (const repo = deriveRepoFromGit(input.cwd);). In fork workflows, origin points to the contributor's fork, while the open PR lives on the upstream base repo. Subsequent calls explicitly pass --repo ${repo} to gh pr view in resolvePrNumber and to gh api repos/${repo}/pulls/${prNumber}.... When repo is the fork, gh pr view --repo <fork> typically returns no PR and gh api repos/<fork>/pulls/<#> will 404. Your logic then emits a warning and fails-open, allowing merges without evidence — a silent bypass vector. Evidence in file:
  • line ~424: const repo = deriveRepoFromGit(input.cwd);
  • line ~362: gh pr view --repo repo ...
  • lines ~219–260: gh api repos/${repo}/pulls/${prNumber}...

Fix: resolve the PR first without forcing a repo (let gh infer from the current branch), or query gh pr view --json baseRepository to obtain the base repo slug and use that for subsequent gh api calls. Alternatively, detect an upstream remote and default to it when present. Do not fail-open due to fork/upstream mismatch.

  • [BLOCKING] .claude/hooks/require-execution-evidence-before-merge.ts:154 — Evidence regex admits malformed headings without a space after Markdown hashes (e.g., '##Execution evidence:')
    The hasExecutionEvidence matcher uses headingPattern = /^(#{0,6}\s*)(execution evidence:\s*)(.*)$/im. Because the first group allows #{0,6} followed by \s*, it also matches lines with no space after the heading markers (e.g., '##Execution evidence:'), and even zero hashes (plain text), which can produce false positives like '##Notes execution evidence:' when the token follows immediately after '#' characters or other text. A prior review explicitly called for requiring a space after heading markers to avoid matches like '##Notes execution evidence:'. Tighten the pattern to require either: (a) start-of-line with optional heading markers followed by at least one space, then the label; or (b) the label at line start with no hashes. For example: /^(?:#{1,6}\s+execution evidence:|execution evidence:)\s*(.*)$/i and ensure the '## Notes' case does not match.
  • [BLOCKING] .claude/settings.json:1 — Hook only registered for mcp__minsky__session_pr_merge PreToolUse; merges via mcp__github__merge_pull_request path are unguarded at PreToolUse
    Settings register the new hook under PreToolUse for matcher "mcp__minsky__session_pr_merge" only (lines 6–20). There is a PostToolUse hook for "mcp__minsky__session_pr_merge|mcp__github__merge_pull_request" (lines 86–103), and a PreToolUse block banning GitHub MCP PR-write tools (lines 45–61), but that separate hook only denies GitHub MCP write attempts and doesn't ensure execution evidence on merges routed via GitHub MCP if that denial hook is disabled or bypassed. To close the bypass, the execution-evidence hook should also be attached to a PreToolUse matcher that includes mcp__github__merge_pull_request (or rely on the denial hook's presence by explicitly asserting dependency). As-is, a configuration drift that removes or disables the ban would re-open the GitHub MCP merge path without this guard. Register this hook for both merge tool names or make the ban hook a hard dependency with tests in settings.
  • [BLOCKING] .claude/hooks/require-execution-evidence-before-merge.ts:338 — Silent bypass when task is absent: hook exits without emitting any audit warning or context
    In the entry point, the code reads const task = (input.tool_input.task as string | undefined) ?? ""; if (!task) process.exit(0); (lines ~336–340). This creates a silent allow path when the tool input lacks task, with no writeOutput warning. That makes enforcement dependent on the calling tool always supplying task and provides no audit trail when it doesn't, contradicting the "fail-open with a warning" posture described in the PR. Emit a structured additionalContext warning instead of exiting silently so operators can detect misconfigurations and bypass attempts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

authorship/co-authored Co-authored by human and AI agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant