Skip to content

v1.57.3.0 fix(ship): always-loaded PR-title-version rule + fork-PR title-sync backstop#1909

Merged
garrytan merged 8 commits into
mainfrom
garrytan/pr-title-version-fix
Jun 8, 2026
Merged

v1.57.3.0 fix(ship): always-loaded PR-title-version rule + fork-PR title-sync backstop#1909
garrytan merged 8 commits into
mainfrom
garrytan/pr-title-version-fix

Conversation

@garrytan

@garrytan garrytan commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Summary

Restores a regression and hardens the title-sync backstop. The v1.54.0.0 carve moved the "PR title must start with v$NEW_VERSION" rule out of /ship's always-loaded skeleton into the lazily-loaded pr-body.md section, so the agent only applied it when it happened to read that section — recent open PRs landed with bare titles.

Ship skeleton (agent-side fix)

  • fix(ship): restore always-loaded PR-title-version invariant to skeleton — adds a one-line invariant + helper reference to ship/SKILL.md.tmpl right before the {{SECTION:pr-body}} pointer (mirrors the AUQ always-loaded precedent); full procedure stays sectioned. Regenerated all hosts.
  • fix(ship): expand binDir path in pr-body Linked Spec block${ctx.paths.binDir} is only substituted in .ts resolvers, not .tmpl files, so it shipped literally into the generated pr-body.md Linked Spec block. Now uses the hardcoded helper path.

CI backstop (fork coverage)

  • fix(ci): pr-title-sync covers fork PRs via hardened pull_request_target — plain pull_request gives a read-only token on fork PRs, so the title-sync workflow could never edit a fork/agent PR. Switched to pull_request_target (write token in base context) and made it safe: base-repo checkout only (no PR-head code execution), all PR fields via env: (never inlined into run:), VERSION read as data from the PR head, same-repo-loud / fork-soft on failure, no raw-title echo.

Guards (keep it from drifting again)

  • test(ship): guard PR-title-version rule + pull_request_target safety — adds test/pr-title-sync-workflow-safety.test.ts, a static tripwire that fails CI if the workflow checks out PR-head code or inlines an attacker-controlled ${{ github.event.pull_request.* }} field into a run: block (the injection class actionlint can't catch).
  • refactor(test): fold ship PR-title skeleton guard into carve-guard registry — main shipped a generalized carve-guard system (v1.57.0.0 feat: carve-guard system + carve cso/document-release/design-consultation #1907) that is now the single source of truth for carved-skill skeleton invariants. Registered the rule there (mustStayInSkeleton asserts v$NEW_VERSION + the helper stay always-loaded; mustMoveToSection asserts both the create and update PR paths stay carved into the section) and deleted the now-redundant standalone test.

(The merge commit pulls in main's v1.57.0.0 carve-guard system; it is not part of this PR's substantive change.)

Test Coverage

All new behavior is covered by deterministic free tests:

  • test/pr-title-sync-workflow-safety.test.ts (new) — 5 assertions: no PR-head checkout, no ${{ pull_request.* }} in run:, uses pull_request_target, title via env:.
  • test/helpers/carve-guards.ts ship entry — mustStayInSkeleton + mustMoveToSection now assert the PR-title invariant via the carve-guard ordering test (test/carve-section-ordering.test.ts).
  • Helper unchanged, still covered by test/pr-title-rewrite.test.ts (9 cases).

The only inherent gap is pull_request_target runtime behavior, which is not locally unit-testable; covered by actionlint + the static safety tripwire + a live fork-PR smoke after merge.

Pre-Landing Review

  • Eng review (/plan-eng-review): CLEAR — 5 findings, 0 critical gaps, all folded in (including a prior-learning catch that the original commit split would fail CI check-freshness; corrected to atomic tmpl+regen).
  • Codex (/codex review, outside voice): PASS (no P1) — 8 findings, all folded in. Codex caught 4 things eng-review missed: gen:skill-docs needs --host all, the workflow needs a mechanical injection guard, the refs/pull/N/head read was an unsafe assumption (now uses head-repo metadata), and a path-token issue. One Codex finding (browse skill: default to sonnet to save tokens #8, use ${ctx.paths.binDir} in the skeleton) was REVERSED after verifying that token doesn't substitute in .tmpl files.

Eval Results

Diff-selected suite (bun run test:evals), full tier: routing 10/10 PASS, $2.39, 563s (includes journey-ship). Free gate (bun test) green.

Two ios-qa agent-flow simulation tests failed under full-suite concurrency (tailnet allowlist gate, capability-tier enforcement). Proven unrelated to this branch (blame protocol):

  1. This branch's diff touches zero ios files (all changes are ship/CI/changelog/test-registry).
  2. test/skill-e2e-ios.test.ts on origin/main in isolation: 7 pass, 0 fail.
  3. Same test on this branch in isolation: 7 pass, 0 fail.
  4. The failures only appear under parallel test:evals load with (attempt 3) retry markers — timing/resource flakiness, not a deterministic regression. Pre-existing test-infra flakiness, not introduced here.

Plan Completion

Plan implemented in full (skeleton invariant, two guard tests, CI hardening, binDir fix, carve-guard integration). All eng + codex findings folded in. 0 unresolved.

Test plan

  • bun test (free gate) — passed
  • bun test targeted: carve-guard 20/0, workflow-safety 8/0, 786 skill-doc tests
  • bun run test:evals routing 10/10; ios-qa flakiness proven pre-existing
  • actionlint .github/workflows/pr-title-sync.yml — clean
  • gstack-pr-title-rewrite.sh 3-case sanity

🤖 Generated with Claude Code

garrytan and others added 7 commits June 7, 2026 19:27
The v1.54.0.0 carve moved the 'PR title MUST start with v$NEW_VERSION' rule
out of the always-loaded ship skeleton and entirely into the lazily-loaded
pr-body.md section. The agent only set the version prefix if it happened to
read that section before creating the PR, so PRs landed with bare titles.

Restore a one-line invariant (+ helper reference) to ship/SKILL.md.tmpl right
before the {{SECTION:pr-body}} pointer, mirroring the AUQ always-loaded
precedent. Full procedure stays sectioned. Regenerated all hosts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two free gate tests so a future carve or workflow refactor can't silently
regress:

- ship-pr-title-version-always-loaded: asserts the invariant lives in the
  always-loaded ship/SKILL.md skeleton (not only sections/), and that the
  skeleton+sections union keeps BOTH the create and the existing-PR update
  title paths. Modeled on test/auq-format-always-loaded.test.ts.
- pr-title-sync-workflow-safety: static tripwire that fails CI if
  pr-title-sync.yml checks out PR-head code or inlines an attacker-controlled
  ${{ github.event.pull_request.* }} field inside a run: block (the two
  pull_request_target footguns actionlint cannot catch).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Under plain pull_request the GITHUB_TOKEN is read-only on fork PRs, so the
title-sync backstop could never edit a fork/agent PR title. Switch to
pull_request_target (write token in base context) and make it safe:

- Check out the base repo only (no ref:) — execute trusted infra, never
  fork-head code.
- All attacker-controlled PR fields (title, head repo, head sha) pass via
  env: and are referenced as shell-quoted "$VAR", never inlined into run:.
- Read the PR-head VERSION as data (raw media type) from the head repo at the
  head sha; guard the assignment under set -e.
- Same-repo read failure fails loudly; fork miss warns and skips (the backstop
  stays green without going silently optional).
- Never echo the raw fork title (Actions parses ::workflow-command:: from stdout).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ship/sections/pr-body.md.tmpl:98-99 used ${ctx.paths.binDir}, but the
gen-skill-docs generator only resolves {{TOKEN}} syntax in .tmpl files — the
${...} JS-template-literal form is substituted only inside .ts resolver files.
So the token passed through literally into the generated pr-body.md, leaving the
agent with an unexpandable ${ctx.paths.binDir}/gstack-paths command in the
Linked Spec auto-detect block. Use the hardcoded helper path, consistent with
every other path reference in this section.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gistry

main shipped a generalized carve-guard system (PR #1907) that is now the single
source of truth for carved-skill skeleton invariants. Register the PR-title rule
there instead of a standalone test: ship's mustStayInSkeleton asserts v$NEW_VERSION
+ the rewrite helper stay always-loaded, and mustMoveToSection asserts both the
create and update PR paths stay carved into pr-body.md (present in the union, out of
the skeleton). Delete the standalone ship-pr-title-version-always-loaded test it
replaces. The CI-workflow safety tripwire stays standalone (not a carve concern).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@trunk-io

trunk-io Bot commented Jun 8, 2026

Copy link
Copy Markdown

Merging to main in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

E2E Evals: ✅ PASS

9/9 tests passed | $.94 total cost | 12 parallel runners

Suite Result Status Cost
e2e-qa-workflow 1/1 $0.12
e2e-review 2/2 $0.24
e2e-workflow 2/2 $0.27
llm-judge 2/2 $0.04
e2e-workflow 2/2 $0.27

12x ubicloud-standard-8 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

…sion-fix

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
@garrytan garrytan merged commit d8c91c6 into main Jun 8, 2026
23 checks passed
garrytan added a commit that referenced this pull request Jun 8, 2026
…rule)

#1909 added the always-loaded PR/MR-title-version invariant to the ship skill
and regenerated ship/SKILL.md, but did not regenerate the golden fixtures, so
the golden-file regression test was red on main. Refresh the three ship
baselines (claude/codex/factory) from the current generated output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request Jun 9, 2026
…ions (#1916)

* fix(plan-devex-review): add missing gstack-review-log step

plan-devex-review carried the EXIT PLAN MODE GATE but never wrote a
review-log entry, so the gate's 'review log was called' check was
structurally unsatisfiable and the Review Readiness Dashboard / GSTACK
REVIEW REPORT had no plan-devex-review data to read. Add a Review Log
section before the dashboard read, logging the devex fields the report
parser already expects (status, scores, product_type, tthw, persona,
competitive_tier, unresolved, commit).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(review): make unresolved-decisions status mandatory in GSTACK REVIEW REPORT

The report's UNRESOLVED line was optional ('omit if empty') and the EXIT
PLAN MODE GATE only checked it 'if applicable', so a plan could ship with
no statement about open decisions at all — a missed ambiguity read
identically to a clean plan. Now every report ends with a mandatory
unresolved-decisions status as its final line: either the exact unbolded
sentinel 'NO UNRESOLVED DECISIONS', or a '**UNRESOLVED DECISIONS:**' block
of bullets. The gate blocks ExitPlanMode unless that final line is present.

generatePlanFileReviewReport: current-review items are listed from context;
prior reviews contribute an aggregate count computed as latest-fresh-row-
per-skill minus the current run (no double-count, dashboard 7-day window).
generateExitPlanModeGate: check #3 is now blocking with no 'if applicable'
escape; bolded sentinel does not satisfy it.

Tests: static guard in gen-skill-docs.test.ts asserts the mandatory status
across all six report consumers and the gate across gate-bearing skills;
skill-e2e-plan.test.ts asserts the written report's final line is the
status (and fixes a stale 'four review rows' -> five-row prompt).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(review): compress unresolved-status prose to fit parity budget

After merging origin/main (v1.57.3.0), plan-devex-review exceeded the 1.05x
parity ratio vs the v1.53.0.0 baseline. Rather than rebase the baseline,
compressed the new prose to stay under the cap honestly: the report's
unresolved-status block (~32 -> ~9 lines) and the EXIT PLAN MODE GATE's
final-line check (~7 -> ~5 lines), plus the plan-devex-review review-log
step. All load-bearing rules and the exact gate-checkable tokens are
preserved; the static guards in gen-skill-docs.test.ts still pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: regenerate stale ship golden fixtures (#1909 follow-up)

#1909 (v1.57.3.0) added the always-loaded PR-title-version rule to ship's
template and committed the regenerated ship/SKILL.md, but did not refresh the
three ship golden fixtures, leaving the golden-file regression test red on
main. Regenerate them from current output. The diff is purely #1909 content:
the PR-title invariant line plus a previously-unresolved ${ctx.paths.binDir}
placeholder that current generation correctly resolves. No feature content
from this branch leaks into ship (ship does not consume the review report
resolvers).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(plan-devex-review): restore TIMESTAMP fill instruction in review-log

Adversarial review caught that compressing the devex review-log block dropped
the TIMESTAMP substitution guidance the three sibling plan-review skills carry.
A literal "timestamp":"TIMESTAMP" parses as JSON but is an unparseable date,
so the Review Readiness Dashboard's 7-day freshness window silently drops the
plan-devex-review row (and the report's prior-review aggregation loses it).
Restore the one-line instruction. Also: the plan-review-report E2E now derives
its last-line check from the report slice, not the whole file, so a mis-placed
report surfaces the real trailing content in the failure message.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(parity): rebase parity baseline v1.53.0.0 -> v1.57.7.0

The v1.53 anchor is four minor versions stale. v1.54-v1.57 (ship/plan carving,
carve-guards, AUQ prose fallback, the cross-session decision-log preamble) plus
this branch's mandatory unresolved-decisions status line pushed the three
plan-review skills past the 5% ratchet even after exhaustive compression. The
new baseline captures current UNION sizes (skeleton + sections/*.md, matching
what parity-harness measures) so the per-skill 1.05 ratio keeps catching future
bloat. The frozen v1.44.1 integrity anchor and the v1.47 size-budget baseline
are untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.57.7.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant