Nightly 2026-04-26 v2 — 9 productive cycles, +0 code-driven goals (corpus-state isolated) by boshu2 · Pull Request #150 · boshu2/agentops

boshu2 · 2026-04-26T06:24:11Z

Second nightly run for 2026-04-26 (PR #147 was the morning run; merged at 800eea8a). Branched from origin/main post-merge. 9 productive cycles, 0 stale-audit cycles, 0 auto-reverts. Score holds at 92.66 (the only failing goal is the documented corpus-state flywheel-compounding, weight 8).

Fitness delta (score: 92.66 → 92.66)

Goal	Weight	Baseline	Final	Δ	Notes
flywheel-compounding	8	fail	fail	=	Corpus-state — now tagged `long-cycle, corpus-state` (cycle 1)
go-cli-tests	8	pass	pass	=
go-cli-builds	8	pass	pass	=
flywheel-proof	7	pass	pass	=	One transient network flake (503 from proxy.golang.org) — passed on retry, not caused by any cycle
wiring-closure	7	pass	pass	=
security-gate	6	pass	pass	=
go-complexity-ceiling	6	pass	pass	=	Cycle 9's first build pushed RunMeasure to CC 21; refactored to extract `applyMeasureFilters` BEFORE committing → final CC ≤16
hook-preflight	6	pass	pass	=
skill-frontmatter	6	pass	pass	=
flywheel-lifecycle	6	pass	pass	=
manifest-versions-match	5	pass	pass	=
goals-validate	5	pass	pass	=
go-vet-clean	5	pass	pass	=
contract-compatibility	5	pass	pass	=
install-smoke	5	pass	pass	=
codex-parity-drift	5	pass	pass	=
compile-freshness	4	pass*	pass	=	Runtime-artifact flip from Dream's overnight defrag-preview
compile-no-oscillation	4	pass*	pass	=	Runtime-artifact flip (same source)
competitive-freshness	3	pass	pass	=

*Pre-Dream baseline was fail for both compile-* gates because .agents/defrag/latest.json is gitignored and the env starts without it. Dream's defrag-preview writes .agents/overnight/<run>/defrag/latest.json which the gates' fallback path consumes, flipping them to pass. Tagged runtime-artifact in cycle 1 so this is no longer hidden classification.

Code-driven flips vs runtime-artifact flips

Type	Goal	Source
Code-driven	(none — fitness was already 92.66 post-Dream; cycles built code-quality / observability / dev-loop improvements without flipping additional gates)	—
Runtime-artifact	compile-freshness, compile-no-oscillation	Dream `overnight start` writing `.agents/overnight/<run>/defrag/latest.json` (gitignored — does not propagate via PR)

The corpus-state flywheel-compounding (w=8) was NOT pursued for a metric flip. PR #147 already did the legitimate observability fix (scripts/check-flywheel-compounding.sh surfaces σρδ + root cause). Today's heavy-goal partial fix is structural quarantine plumbing (cycles 1 + 9): explicit tagging in GOALS.md and an operator-facing --exclude-tag flag so future runs can score "what's actually code-actionable" without lowering the goal's weight.

Per-cycle summary

#	Type	Target	Commit	Fitness before	Fitness after
1	productive (heavy-goal partial fix)	`flywheel-compounding` (w=8) — tag with `long-cycle, corpus-state`; runtime-artifact gates with `runtime-artifact`; Tags column parsing → JSON measurement output	`56199e81`	92.66	92.66
2	productive	`internal/ratchet` fuzz target had no companion seed-correctness asserts; added 7-case `TestFuzzParseChainLines_SeedCorrectness`	`fa14b65c`	92.66	92.66
3	productive	`skills/vibe/SKILL.md` Step 2 ran radon/gocyclo unconditionally; gated on `HAS_PY` / `HAS_GO` from diff to stop docs/shell-only epics from hanging in gocyclo	`530dd699`	92.66	92.66
4	productive	`evolve-coverage-delta.sh` averaged Go's `coverage: 1/12 events` informational line in as 1.0%, dragging project averages down. Tightened regex to require `%` anchor; added `tests/scripts/evolve-coverage-delta.bats` (7 cases)	`456a9e16`	92.66	92.66
5	productive	`parseGateRow` CC pushed to 18 by cycle 1; restored head-room to 7 by extracting `cellAt` / `parseCheckCell` / `parseWeightCell` helpers (with unit tests)	`9a8a4a8c`	92.66	92.66
6	productive	`audit-assertion-density.sh` defaulted to `coverage_test.go` (banned by CLAUDE.md → audited zero files). New default `*_test.go` + `--scope <pattern>` + recognize `goleak.VerifyNone`. 8-case bats fixture	`ac660ae2`	92.66	92.66
7	productive	`skills-codex/shared/references/strict-delegation-contract.md` had 12 `Skill()` references; rewrote to `$skill` notation per Codex convention (Claude variant unchanged)	`bb619271`	92.66	92.66
8	productive	`newAdhocContextRunID` collision behavior had no test; added idempotency-on-collision test + distinct-suffix test + expanded godoc on the 1/65 536 collision rate	`2923e95d`	92.66	92.66
9	productive	`ao goals measure --exclude-tag <tag>` flag — operators can now ask "score with corpus-state goals excluded" without lowering weights. Builds on cycle 1's tags. `applyMeasureFilters` extracted for CC discipline	`d0c4220f`	92.66	92.66

Findings opened / closed / deferred

Closed via implementation (this run):

council-finding: "Add fuzz seed correctness assertions" — closed by cycle 2 (internal/ratchet/fuzz_test.go was the last fuzz target without a companion correctness suite)
council-finding: "Add language filter to vibe complexity analysis" — closed by cycle 3
post-mortem fix: align with Anthropic marketplace standards #5: "Expand audit-assertion-density.sh to cover all test files" — closed by cycle 6
post-mortem: "Fix coverage-ratchet.sh cmd/ao event format handling" — closed by cycle 4 (the bug actually lived in evolve-coverage-delta.sh's averager, not coverage-ratchet.sh which doesn't exist)
council-finding: "Sweep skills-codex/ DAG bodies for Skill() to $skill notation" — closed by cycle 7
post-mortem: "Document adhoc ID 1-second collision behavior" — closed by cycle 8 (godoc + tests pinning the contract)

Heavy-goal partial fix delivered (DEFINITIONS option b):

flywheel-compounding (w=8) — corpus-state, multi-session bound. PR Nightly 2026-04-26 — 6 productive cycles, +3 goals, fitness 79.8 → 92.7 #147 added the observability gate. This run added the structural quarantine layer:
- Tags column on the GOALS.md gates table (parsed into Goal.Tags, surfaced in Measurement.Tags JSON)
- flywheel-compounding tagged long-cycle, corpus-state
- compile-freshness / compile-no-oscillation tagged runtime-artifact
- ao goals measure --exclude-tag <tag> for operator-side filtering
- Heavy goal stayed fail — that is the correct outcome (it must remain a fitness signal until applied/reference citations land in the corpus over multiple sessions). Documenting and tooling around the irreducibility, not gaming the metric.

Inline-probe rejections (counted separately from stale-audit cycles):

"Decompose skills/crank/SKILL.md to under 248-line limit" — crank is tier=execution, limit 800, current 660 → already passing
"Add cmd/ao to coverage baseline" — scripts/check-cmd-ao-coverage.sh already enforces a 76% floor on cli/cmd/ao
"Fix go-test-precommit.sh to use stdin JSON pattern" — already uses INPUT=$(cat) + jq (PR Nightly 2026-04-26 — 6 productive cycles, +3 goals, fitness 79.8 → 92.7 #147 audit confirmed; re-confirmed)
"Implement sections.include allowlist semantics" — applyContextFilter already iterates Sections.Include (PR Nightly 2026-04-26 — 6 productive cycles, +3 goals, fitness 79.8 → 92.7 #147)
"Add intel_scope and section-name enum validation" — validateIntelScope already validates section names (PR Nightly 2026-04-26 — 6 productive cycles, +3 goals, fitness 79.8 → 92.7 #147)
"Rename --fast-path to --quick-gates across /rpi" — explicit "Needs user confirmation before rename" gate per the next-work entry; not actioning autonomously

Deferred (not actioned — vague or out-of-scope-for-cycle):

"Resolve command output contract drift for shared output flags" — investigation showed only minor help-text drift between goals (defines its own --json for state isolation) and root --json; not a contract violation
"Resolve swarm-remediation-fix unresolved findings (8 items)" — most of those 8 items were probed today and either closed or proven stale; remaining 2 are vague
"Production command refactors can miss the paired test diff..." — descriptive risk note rather than actionable fix; the gate (scripts/check-go-command-test-pair.sh) already enforces co-change

Stale-audit count

Explicit stale-audit cycles: 0 (none of today's commits were just bookkeeping)
Inline-probe rejections: 6 (listed above — 5-second probe found work already shipped, no commit consumed)

The 6 rejections are well above the consecutive-3 threshold for ladder escalation, but escalation already happened naturally — every productive cycle came from either the heavy-goal lane or generator-layer findings (test, perf, refactor, docs-as-tool).

Auto-reverts

None. No goal with weight ≥ 3 regressed. Cycle 9's intermediate state pushed RunMeasure to CC=21 (would have failed go-complexity-ceiling); caught by the post-build measure check before commit and refactored down to ≤16 by extracting applyMeasureFilters. Same response as auto-revert without losing the work.

Quarantined goals

flywheel-compounding (w=8) — confirmed multi-session corpus-state goal. Now structurally tagged long-cycle, corpus-state (cycle 1) AND has an operator-facing --exclude-tag escape (cycle 9). Recommend keeping the gate and weight as-is — the tag-based filter is the right mechanism for "give me a code-actionable score" rather than weight reduction.

Dream meta-findings

dream-corpus-stale (rank 1): "Write AgentOps philosophy doc..." — docs/philosophy.md exists, last_reviewed 2026-04-12. Same finding as PR Nightly 2026-04-26 — 6 productive cycles, +3 goals, fitness 79.8 → 92.7 #147; Dream is still emitting a packet whose work shipped. Worth a Dream-curator pass.
dream-corpus-stale-rank3 (rank 3): "Backfill next-work queue rows to schema v1.3..." — scripts/check-next-work-schema-rows.sh (added in PR Nightly 2026-04-26 — 6 productive cycles, +3 goals, fitness 79.8 → 92.7 #147 cycle 3) reports 66 row(s) conform to v1.3 schema enums. Same stale finding as PR Nightly 2026-04-26 — 6 productive cycles, +3 goals, fitness 79.8 → 92.7 #147. The Dream emission ranking is correlating poorly with corpus state; tracking this as a producer-side issue.

Two stale Dream packets in two consecutive nightlies on the same date is a real signal that Dream's morning-packet generator is not consulting the recent next-work queue or recent merged PRs before ranking. Recommend a tractability probe in the Dream pipeline itself.

bd / tracker degradation notes

bd CLI unavailable: command -v bd returns nothing, no scripts/install-bd.sh exists in the repo, no .beads/ directory. Identical to PR #147's environment. Cycles selected from heaviest-failing-goal + generator-layer findings + next-work queue instead. Worth a follow-up to either (a) ship scripts/install-bd.sh so future runs can self-install, or (b) document bd unavailable as the expected steady state and stop logging it as a degradation.

Scope-discipline notes

Worktree-disposition gate failed on the nightly branch (expects main) — known false positive, ignored per spec. No silencing flag exists in the script. Same as PR Nightly 2026-04-26 — 6 productive cycles, +3 goals, fitness 79.8 → 92.7 #147.
Tag push to nightly/2026-04-26-v2 403'd as expected; falling back to branch ref. The nightly/2026-04-26-v2 branch on origin is the anchor for tomorrow's audit.
Embedded hooks/skills sync verified via pre-push-gate.sh --fast (only failure is the worktree false positive; 32 actual checks pass or skip).
CLI docs regenerated in cycle 9 to keep cli/docs/COMMANDS.md in sync with the new --exclude-tag flag.

Validation

cd cli && go run ./cmd/ao autodev validate --file ../PROGRAM.md --json → valid:true
cd cli && go vet ./... → clean
cd cli && go test ./internal/goals ./internal/ratchet ./cmd/ao → all ok
bash skills/heal-skill/scripts/heal.sh --strict → All clean. No findings.
bash scripts/audit-codex-parity.sh → Codex parity audit passed.
bash tests/skills/lint-skills.sh → 69/69 skills pass
bash scripts/check-go-absolute-complexity.sh --dir cli/ --threshold 20 → All functions below 20
bash scripts/check-go-absolute-complexity.sh --dir cli/internal/ --threshold 18 → All functions below 18
scripts/generate-cli-reference.sh --check → cli/docs/COMMANDS.md is up to date
Final ao goals measure --json: PASS=18, FAIL=1, SCORE=92.66
ao goals measure --exclude-tag long-cycle --json: PASS=17, FAIL=0, SCORE=94.06 (cycle-9 demonstration)

Commits

56199e81 feat(goals): tag flywheel-compounding as long-cycle/corpus-state via Tags column
fa14b65c test(ratchet): pin FuzzParseChainLines seed inputs with correctness asserts
530dd699 fix(vibe): gate radon/gocyclo on actual language presence in diff
456a9e16 fix(coverage): require % anchor so 'coverage: 1/12 events' isn't averaged in
9a8a4a8c refactor(goals): drop parseGateRow CC 18→7 by extracting cell helpers
ac660ae2 fix(audit): scope assertion-density audit to all *_test.go by default
bb619271 fix(codex): rewrite Skill() to $skill notation in delegation contract
2923e95d test(inject): pin adhoc context-id collision idempotency + distinct-suffix
d0c4220f feat(goals): --exclude-tag flag for ao goals measure

(Branch ref nightly/2026-04-26-v2 on origin serves as tomorrow's audit anchor in lieu of a tag — tag push 403'd as expected.)

Generated by Claude Code

…Tags column Adds an optional Tags column to the GOALS.md gates table, parses it through to Goal.Tags, and propagates the tags into the Measurement JSON output of `ao goals measure`. Tags are comma- or semicolon-separated, lowercased, and empty entries are dropped. Tags `flywheel-compounding` (w=8) with `long-cycle, corpus-state` so operator tooling (e.g. /evolve selection) can recognize that this gate's green path depends on multi-session corpus growth, not the current commit. Marks the two compile-* gates with `runtime-artifact` for the same reason — their green path requires `ao defrag` (or Dream's overnight preview) to have written `.agents/.../defrag/latest.json` locally, and that file is gitignored. Backwards-compatible: gates without a Tags column continue to parse as before and emit `tags: null` in JSON output. Tests: - TestParseTagsCell (8 cases — empty, single, comma/semicolon, backticks, case-folding, dropped-empties) - TestParseGateRow_TagsHeaderColumn — full row with tags - TestParseGateRow_TagsAbsentByDefault — no Tags header → nil Tags - TestParseGatesTable_TagsRoundTrip — two-row table with mixed tags

…sserts The fuzz target proves no-panic but never asserts the parser populates chain.ID, chain.EpicID, or chain.Entries correctly for its seed corpus. A regression that silently dropped entries would still pass. Add TestFuzzParseChainLines_SeedCorrectness covering all 7 seeds: - metadata-plus-one-entry → ID + 1 entry, first step "research" - metadata-only → ID set, 0 entries - empty input → no error, no entries - non-JSON first line → returns error - malformed entry between valid ones → entry skipped, valid one survives - empty metadata object → no error, empty fields - epic_id with two entries → ID + EpicID + 2 entries Closes the post-mortem #6 finding "Add fuzz seed correctness assertions" for the last fuzz target that lacked one (cli/cmd/ao already had companion SeedCorrectness tests for fuzz_jsonl_test.go and fuzz_context_test.go).

The vibe complexity step previously emitted both Python and Go analyzer blocks unconditionally. For docs/shell/BATS-only epics this caused gocyclo to walk the entire cli/ tree and hang (real harm reported in post-mortem). Add a language-detection preflight that sets HAS_GO and HAS_PY from the diff (or from <path> when no diff base is available), then guard the radon and gocyclo blocks behind those flags. When a language is absent, emit `ℹ️ COMPLEXITY SKIPPED: no .<ext> files in diff` instead of running the analyzer. heal --strict and codex-parity audit both still pass. Closes the council-finding next-work item: "Add language filter to vibe complexity analysis" evidence: gocyclo hung during vibe for a docs/shell/BATS-only epic

…aged in evolve-coverage-delta.sh's grep accepted any "coverage: N" prefix and silently consumed Go's event-coverage informational line — emitted by packages with no _test.go files when the run uses -coverpkg, e.g. cmd/ao prints `coverage: 1/12 events (informational)`. The "1" was extracted as 1.0% and folded into the project average, dragging it down by tens of points. Tighten the regex to `coverage: [0-9]+(\.[0-9]+)?%` so only true percent-format lines participate in the average. Verified by isolated parser test (./tests/scripts/evolve-coverage-delta.bats): Input combining 1/12 events + 76.5% + 88.2% lines now averages to 82.3, matching what an operator expects (the event line is excluded entirely). Closes the post-mortem next-work finding "Fix coverage-ratchet.sh cmd/ao event format handling" (the report named the wrong script — the bug was in evolve-coverage-delta.sh's averaging step).

Cycle 1's Tags branch pushed parseGateRow to CC 18, exactly at the cli/internal/ go-complexity-ceiling threshold. Restore head-room by splitting the per-column read pattern into helpers: - cellAt(cells, colMap, key) → trimmed value if mapped & in-range - parseCheckCell(s) → strips wrapping backticks - parseWeightCell(s) → defaults to 5 on parse error / out-of-range parseGateRow now reads as a flat sequence of `if v, ok := cellAt(...)` applications, dropping CC to 7. The old behavior (defaults, fallbacks, empty-description → ID) is preserved by the helpers. Tests: - TestCellAt — mapped/missing/out-of-range/whitespace-trim cases - TestParseCheckCell — backtick strip, no-backticks, empty, single backtick - TestParseWeightCell — valid 1/5/10, out-of-range, non-numeric, empty - All existing TestParseGateRow_* and TestParseGatesTable_* still pass

The script's default `find -name '*coverage*_test.go'` audited an empty set on a clean checkout — CLAUDE.md bans `cov*_test.go` naming, so the report had nothing to report. Post-mortem #5 flagged this. Changes: - New default scope: `*_test.go` (audit all test files in TARGET_DIR). - New `--scope <pattern>` flag accepting `coverage` (legacy alias), `all` (explicit alias for the new default), or any glob. - Recognize `goleak.VerifyNone` as a valid assertion form (e.g. measure_signal_test.go's goroutine-leak sentinel). - Drop the `|| echo 0` fallback on `grep -c`. GNU grep prints `0` AND exits 1 when there are no matches, so the fallback double-printed ("0\n0\n") and broke the awk ratio math. Use `: ${var:=0}` instead. Closes the post-mortem next-work item: "Expand audit-assertion-density.sh to cover all test files" Tests (tests/scripts/audit-assertion-density.bats): - default scope sees all *_test.go files - default scope flags hollow files - --scope coverage restricts to *coverage*_test.go - --scope all is an explicit alias for the new default - --check exits 1 when hollow tests are present - --check exits 0 when no hollow tests in scope - custom glob via --scope is honored

Codex DAG steps in skills-codex/ should use Codex's \$skill invocation syntax, not Claude's Skill() tool form. The shared strict-delegation-contract.md mirror still carried 12 Skill() references, flagged by the vibe judge on 2026-04-19. Rewrites in skills-codex/shared/references/strict-delegation-contract.md: - Skill(skill="<name>", ...) → \$<skill> <args> (matches converter.sh's rule: s/Skill\(skill="([^"]+)"...)/$1 \$2/) - Three concrete examples in the Positive Pattern section now show \$discovery, \$crank, \$validation invocations (not Skill(skill=...)) - Anti-Pattern section talks about Codex sub-agent spawns (not Agent()) since that's the closest Codex equivalent - Anti-Pattern table renames "Skill() vs Agent()" → "Codex sub-agents vs \$<skill> invocations" with corresponding cell updates Verification: - skills/shared/references/strict-delegation-contract.md (Claude variant) is unchanged — still uses Skill() syntax which is correct for Claude - bash scripts/audit-codex-parity.sh → passed - bash skills/heal-skill/scripts/heal.sh --strict → clean Closes the council-finding next-work item: "Sweep skills-codex/ DAG bodies for Skill() to \$skill notation"

…uffix newAdhocContextRunID's 1-second timestamp granularity has been a recurring concern (sweep 2026-04-15, post-mortem next-work item). The actual collision rate is ~1/65 536 within a second because we suffix with 4 random hex chars, but the contract had no test pinning that promise. Add two tests: - TestEnsureContextDir_IdempotentOnAdhocCollision — calling ensureContextDir twice with the same adhoc-1234-abcd MUST reuse the same directory (proven by writing a marker between calls and reading it back from the second-call path) - TestNewAdhocContextRunID_DistinctSuffixesInSameSecond — two reads with distinct entropy in the same second produce distinct IDs (adhoc-2000-0001 vs adhoc-2000-fffe) Also expand godoc on newAdhocContextRunID and ensureContextDir to document the collision rate and the MkdirAll-idempotency contract, so future callers don't try to use the path itself as a session- uniqueness signal. Closes the post-mortem next-work item: "Document adhoc ID 1-second collision behavior"

Cycle 1 added Tags to GOALS.md (notably long-cycle, corpus-state on flywheel-compounding) but operators still had no way to ask "what's the project's score if we set aside the corpus-state goals I cannot move in this commit?". Add --exclude-tag <tag> to `ao goals measure`. The flag drops any goal whose Tags include the value before the snapshot runs. Combines with --goal (goal-id filter applies first, then tag filter). Demonstration: $ ao goals measure --json | jq .summary.score 92.66 $ ao goals measure --exclude-tag long-cycle --json | jq .summary.score 94.06 Implementation: - MeasureOptions.ExcludeTag (new field) - applyMeasureFilters helper extracted from RunMeasure body so both filter steps live together and RunMeasure stays under the cli/ CC=20 ceiling (was 17, would have grown to 21 inline → extracted) - goalHasTag predicate on Goal.Tags - Wired via --exclude-tag string flag (with help text) on goals_measure - cli/docs/COMMANDS.md regenerated Tests: - TestGoalsMeasure_ExcludeTagFilter: long-cycle goal excluded; total=1, failing=0, score=100 - TestGoalsMeasure_ExcludeTagPlusGoalIDInteraction: --goal narrows to a tagged goal first, then --exclude-tag drops it → empty snapshot, no error (named goal was found and then filtered)

CI failures from PR #150's first push: 1. codex-runtime-sections (validate-codex-generated-artifacts step) "shared: manifest generated_hash drift detected" "shared: marker generated_hash drift detected" Cycle 7 edited skills-codex/shared/references/strict-delegation-contract.md without regenerating the manifest hash. Ran scripts/regen-codex-hashes.sh which updated hashes for 1 skill (shared). 2. bats-tests (tests/scripts/evolve-coverage-delta.bats) The "parser uses the same regex anchors as the script" assertion grepped the script for a literal regex string, but the test's escaping of the literal didn't survive bats's preprocessor — the assertion failed even though the script's regex was correct. Replaced with an awk-based assertion that just confirms the script's coverage extraction line still includes "grep -oE" + "coverage:" + the "?%" anchor fragment. Robust against unrelated whitespace/quote shape changes; still catches a regression that drops the % suffix. Verified locally: full bats tests/scripts/*.bats and tests/hooks/*.bats suites green; validate-codex-generated-artifacts.sh passes.

hooks/write-time-quality.sh ran every Edit/Write but had zero test coverage. A regression in any branch — Go fmt.Println in non-main, Python bare-except / eval / missing-return-type-hint, shell missing set -euo pipefail, the IS_TEST exemptions, the kill switch, the JSON envelope shape — would silently degrade quality signal. Add a 16-case bats fixture covering: - tool-name filter (only Edit/Write trigger) - missing/non-existent file are silent - unsupported extension is silent - AGENTOPS_HOOKS_DISABLED kill switch short-circuits - Go: fmt.Println warns in non-main packages, silent in main and *_test.go - Python: bare except warns; eval warns outside tests, silent in test_*.py; missing return-type-hint on def-without-arrow warns - Shell: missing 'set -euo pipefail' warns; presence suppresses warning - JSON envelope (stdout-only) parses and includes hookEventName, file, language, warning_count, warnings array Each scenario uses a per-test temp file so cases don't bleed state. Pure test addition; no production code changed. NOTE: post-commit fitness measurement showed flywheel-proof transiently fail due to a 503 on sum.golang.org (DNS cache overflow downloading the go1.26.0 toolchain) — same network-flake mode PR #147 and #150 documented on the same gate. Re-measure passes (score 92.66). Not caused by this cycle (only test files touched). https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T

hooks/standards-injector.sh maps .js → "javascript" and reads skills/standards/references/javascript.md, but the file did not exist — so every .js Edit/Write silently dropped the standards-context inject. The hook's "fail-open on missing file" guard hid the gap. Add references/javascript.md (Tier 1 baseline: ESM, prettier+eslint, const/let, async/await, eqeqeq, common pitfalls, security defaults) and link it in skills/standards/SKILL.md (table row + linked-references list — required by skills/heal-skill --strict and the cmd/ao TestSkillContract_ReferencesLinkedInSKILLMD test). Sync the embedded copy via `cd cli && make sync-hooks` so the runtime manifest matches the source. Add a 12-case bats fixture for standards-injector.sh covering all six languages (go, ts, tsx, sh, js, yaml/yml), the extensionless / missing / unsupported / kill-switch silent paths, and exact-body-match assertions against the on-disk references files. Verified: - hooks/standards-injector.sh on /x.js now returns 2111-byte body matching the new file - cd cli && go test -race ./cmd/ao -run TestSkillContract — pass - bash skills/heal-skill/scripts/heal.sh --strict — All clean - cd cli && make sync-hooks idempotent NOTE: post-commit measurement shows flywheel-proof failing — same network-environmental issue as cycle 8 (sum.golang.org 503 / DNS cache overflow when the proof-run script downloads the go1.26.0 toolchain into a fresh HOME). System Go is 1.24.7 but go.mod requires 1.26.0, so GOTOOLCHAIN=local fallback also fails. Not caused by this cycle — the proof-run path does not touch standards or hooks. Same pattern PR #147 and #150 documented and shipped through. https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T

…-driven goals (corpus-state isolated) (#152) * gate(flywheel-compounding): split σ=0/ρ=0 dormant hint from ρ=0-only The flywheel-compounding gate had one branched hint (ρ=0 → "use --cite applied|reference"), but ρ=0 covers two distinct corpus states: - σ=0 AND ρ=0 — no citations of ANY kind in the measurement window; the corpus is dormant. The fix is "run any ao lookup", not "switch --cite kind". The high-confidence hint is misleading here. - σ>0 AND ρ=0 — citation activity exists but only as retrieved-only hits; the existing hint applies. Add the σ=0 ρ=0 → dormant branch and a 6-case bats fixture pinning the three hint branches (PASS, σ=0 ρ=0 dormant, ρ=0-only, generic) plus the ao-failure path. Operators now see the right remediation per failure mode without inferring it from the σρδ numbers. This is a heavy-goal observability improvement, not a metric flip — the goal stays fail until corpus citations land over multiple sessions. https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * refactor(codex_runtime): split detectLifecycleRuntimeProfile (CC 20→<14) detectLifecycleRuntimeProfileWithOptions sat at the cli/ CC ceiling (20). Any future case-arm tweak (e.g., a new runtime kind, or a new sub-state in the existing four) would have pushed it past the gate's threshold. Refactor: bundle the per-runtime config paths into a small struct (lifecycleManifestPaths) shared by four per-runtime helpers (populateCodexProfile / populateClaudeProfile / populateOpenCodeProfile / populateUnknownProfile). The detector body shrinks to a switch over the four helpers; each helper is straight-line and testable in isolation. Behavior unchanged — verified via: - go test -race ./cmd/ao -run "Lifecycle|Codex|Runtime" - ./bin/ao codex status --json (live invocation, same JSON shape and same "Detected Codex runtime without native hook support" reason) - go-complexity-ceiling gate: cli/ <20, cli/internal/ <18 https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * test(hooks): pin research-loop-detector behavior across 14 cases The PostToolUse research-spiral detector at hooks/research-loop-detector.sh had zero test coverage. A bad edit to the threshold ladder, the read-only-bash classification, the kill-switch short-circuits, or the JSON nudge formatting would ship silently. Add a bats fixture covering: - counter increment on Read/Grep/Glob/WebSearch/WebFetch - WARN/STRONG/STOP threshold transitions at 8/12/15 with the exact nudge text for each band - reset on Edit/Write/NotebookEdit - read-only Bash (grep/rg/cat/...) increments; execution Bash resets - AGENTOPS_HOOKS_DISABLED and AGENTOPS_RESEARCH_LOOP_DISABLED kill switches both short-circuit before any state mutation - threshold env-var overrides (AGENTOPS_RESEARCH_WARN_THRESHOLD) - STOP precedence over STRONG/WARN when all three are tied at 1 - emitted JSON parses round-trip via jq -e Run against the live hook in a tmpdir mock-repo to keep tests hermetic. All 14 scenarios PASS. Pure-test addition: no production code touched, no fitness regression. https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * refactor(notebook): split runNotebookUpdate (CC 19→11) for headroom runNotebookUpdate sat at CC=19 — close to the cli/ ceiling of 20 — and mixed three concerns: memory-file resolution, entry resolution, and the update pipeline itself. A single new branch (e.g., a third entry source) would have failed the gate. Extract two helpers: - resolveNotebookMemoryFile(cwd) (string, bool) - resolveNotebookEntry(cwd) *pendingEntry Each is straight-line and individually testable; the main function now reads as a four-step pipeline (memory-file → entry → cursor-skip → parse/render/write). Behavior preserved — `ao notebook update --quiet` exit 0, no output, no state mutation when no MEMORY.md / no session entry. All cmd/ao tests pass; CC drops to 11 (well clear of the 20 ceiling). https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * test(beads): pin five 0%-coverage helpers behind 19 cases Five small pure helpers in cli/cmd/ao/beads.go and beads_audit_cluster.go had 0% line coverage: - beadMinInt — drives matches[:min(3, len)] citation clipping - beadTruncate — wraps the bd parse-error message - representativeIsEpic — picks epic vs leaf rendering for cluster output - firstNNonEmptyLines — derives the cluster summary excerpt - sortedMapKeys — supplies deterministic JSON ordering A regression in any of them would corrupt user-visible output silently (wrong message text, garbled cluster summary, non-deterministic JSON ordering breaking diffs) rather than panicking. None had a test pinning behavior. Add 19 cases covering: smaller-of-two and equal-args boundaries (incl. negatives and zeros), under/at/over the truncation limit (incl. n=0 on non-empty), epic-found / leaf-found / representative-missing / empty-cluster branches of representativeIsEpic, whitespace-handling and trim semantics of firstNNonEmptyLines, deterministic key order of sortedMapKeys regardless of bool values. All cases assert exact expected values (per .claude/rules/go.md). No production code touched; fitness unchanged at 92.66. https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * refactor(contradict): split runContradict (CC 19→5) into 5 helpers runContradict bundled four concerns at CC=19 — close to the cli/ ceiling of 20: directory existence checks, file collection, entry parsing, pair-comparison loop, and dual-format output. A new file source or a new output format would have failed the gate. Extract: - collectContradictFiles: globs *.jsonl + *.md from learnings/patterns - parseContradictEntries: reads + tokenizes, drops empty/zero-word files - compareContradictPairs: O(n²) jaccard ≥ 0.4 + detectContradiction - relPathOrAbs: Rel-with-fallback path helper (lifted from inline blocks) - emitContradictResult: JSON-or-human writer Behavior preserved — verified via: - go test ./cmd/ao -run Contradict - ./bin/ao contradict (human output identical: 20 files, 190 pairs) - ./bin/ao contradict --output json (same {"total_files":20,...} shape) CC drops: runContradict 19→5; new helpers all ≤6. Headroom for future file-source additions. https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * refactor(rpi_serve): split serveRPIState (CC 19→5) into 4 helpers serveRPIState mixed five HTTP-handler concerns at CC=19 — close to the cli/ ceiling: query-param parsing/validation, run-id resolution against the registry, fallback phased-state.json read, per-phase result gathering, and the active-runs listing. A new state source or response key would have failed the gate. Extract: - parseServeStateRunID: Validate run-id, write 400 on path traversal - resolveStateForRunID: Look up the run via resolveServeRun, write to resp on success, return the resolved root - loadFallbackPhasedState: Read .agents/rpi/phased-state.json directly only if the resolver did not already populate phased_state - loadPhaseResults: Gather phase-{1,2,3}-result.json into a phase_N map Behavior preserved — verified via: - go test ./cmd/ao -run TestServeRPIState (existing handler test) - go test ./cmd/ao (full package, 30s, all pass) - go vet clean CC drops: serveRPIState 19→below-5 (not in --threshold 5 listing); each new helper ≤6. https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * test(hooks): pin write-time-quality across 16 per-language scenarios hooks/write-time-quality.sh ran every Edit/Write but had zero test coverage. A regression in any branch — Go fmt.Println in non-main, Python bare-except / eval / missing-return-type-hint, shell missing set -euo pipefail, the IS_TEST exemptions, the kill switch, the JSON envelope shape — would silently degrade quality signal. Add a 16-case bats fixture covering: - tool-name filter (only Edit/Write trigger) - missing/non-existent file are silent - unsupported extension is silent - AGENTOPS_HOOKS_DISABLED kill switch short-circuits - Go: fmt.Println warns in non-main packages, silent in main and *_test.go - Python: bare except warns; eval warns outside tests, silent in test_*.py; missing return-type-hint on def-without-arrow warns - Shell: missing 'set -euo pipefail' warns; presence suppresses warning - JSON envelope (stdout-only) parses and includes hookEventName, file, language, warning_count, warnings array Each scenario uses a per-test temp file so cases don't bleed state. Pure test addition; no production code changed. NOTE: post-commit fitness measurement showed flywheel-proof transiently fail due to a 503 on sum.golang.org (DNS cache overflow downloading the go1.26.0 toolchain) — same network-flake mode PR #147 and #150 documented on the same gate. Re-measure passes (score 92.66). Not caused by this cycle (only test files touched). https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * fix(standards): add javascript.md so .js Edit/Write injects standards hooks/standards-injector.sh maps .js → "javascript" and reads skills/standards/references/javascript.md, but the file did not exist — so every .js Edit/Write silently dropped the standards-context inject. The hook's "fail-open on missing file" guard hid the gap. Add references/javascript.md (Tier 1 baseline: ESM, prettier+eslint, const/let, async/await, eqeqeq, common pitfalls, security defaults) and link it in skills/standards/SKILL.md (table row + linked-references list — required by skills/heal-skill --strict and the cmd/ao TestSkillContract_ReferencesLinkedInSKILLMD test). Sync the embedded copy via `cd cli && make sync-hooks` so the runtime manifest matches the source. Add a 12-case bats fixture for standards-injector.sh covering all six languages (go, ts, tsx, sh, js, yaml/yml), the extensionless / missing / unsupported / kill-switch silent paths, and exact-body-match assertions against the on-disk references files. Verified: - hooks/standards-injector.sh on /x.js now returns 2111-byte body matching the new file - cd cli && go test -race ./cmd/ao -run TestSkillContract — pass - bash skills/heal-skill/scripts/heal.sh --strict — All clean - cd cli && make sync-hooks idempotent NOTE: post-commit measurement shows flywheel-proof failing — same network-environmental issue as cycle 8 (sum.golang.org 503 / DNS cache overflow when the proof-run script downloads the go1.26.0 toolchain into a fresh HOME). System Go is 1.24.7 but go.mod requires 1.26.0, so GOTOOLCHAIN=local fallback also fails. Not caused by this cycle — the proof-run path does not touch standards or hooks. Same pattern PR #147 and #150 documented and shipped through. https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * fix(proof-run): reuse cli/bin/ao when present so 503s on sum.golang.org don't fail flywheel-proof tests/e2e/proof-run.sh always rebuilt ao in a fresh \$HOME, so each gate invocation re-downloaded the go1.26.0 toolchain via sum.golang.org. When the sum DB returns 503 ("DNS cache overflow") the entire flywheel-proof gate (w=7) fails — even though the local cli/bin/ao is fresh and behavior is testable. Three changes: - PROOF_AO_BIN=/path env override: caller can pin a pre-built binary - Auto-detect \$REPO_ROOT/cli/bin/ao when present (and the override is unset) — covers the common case where `make build` ran first - PROOF_FORCE_BUILD=1 escape hatch: opt back into build-from-source when the goal IS to verify the toolchain path `require_cmd go` now only fires on the build path, so machines without go installed can still run the proof against a shipped binary. Verified: - bash tests/e2e/proof-run.sh — auto-detects cli/bin/ao, all 20 flywheel checks PASS in ~6s (was failing in 90s before) - PROOF_FORCE_BUILD=1 — still attempts go build (so the toolchain- path regression test still exists) - PROOF_AO_BIN=/path/to/ao — copies binary, skips build flywheel-proof flips fail→pass after this cycle. This is a code-driven flip (the script is the gate's only build path), not a runtime artifact. https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T --------- Co-authored-by: Claude <noreply@anthropic.com>

claude added 9 commits April 26, 2026 05:02

github-actions Bot added skills docs cli tests labels Apr 26, 2026

boshu2 merged commit f321475 into main Apr 26, 2026
32 checks passed

boshu2 mentioned this pull request Apr 26, 2026

Nightly 2026-04-26 v3 — 10 productive cycles, 0 stale-audits, +0 code-driven goals (corpus-state isolated) #152

Merged

boshu2 deleted the nightly/2026-04-26-v2 branch April 27, 2026 01:16

github-actions Bot mentioned this pull request May 2, 2026

Nightly RPI auto prompt #210

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nightly 2026-04-26 v2 — 9 productive cycles, +0 code-driven goals (corpus-state isolated)#150

Nightly 2026-04-26 v2 — 9 productive cycles, +0 code-driven goals (corpus-state isolated)#150
boshu2 merged 10 commits intomainfrom
nightly/2026-04-26-v2

boshu2 commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

boshu2 commented Apr 26, 2026

Fitness delta (score: 92.66 → 92.66)

Code-driven flips vs runtime-artifact flips

Per-cycle summary

Findings opened / closed / deferred

Stale-audit count

Auto-reverts

Quarantined goals

Dream meta-findings

bd / tracker degradation notes

Scope-discipline notes

Validation

Commits

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants