feat: evolution mutex, failure ceiling, and main-agent memory path by mcheemaa · Pull Request #60 · ghostwright/phantom

mcheemaa · 2026-04-14T20:24:24Z

Summary

Phase 0 of the evolution rethink. Two independent changes that land together because they close different sides of the same safety issue.

Part A: evolution safety floor

Process-wide mutex around EvolutionEngine.afterSession. A second call that arrives while a cycle is in flight logs a skip and returns without spawning any judge subprocess. This removes the stacking-cycles failure mode where multi-turn sessions fired overlapping evolution cycles under load.
Cycle-local judge failure ceiling inside validateAllWithJudges. On the second judge subprocess error within one cycle, the remaining deltas are dropped and the cycle aborts cleanly via a CycleAborted exception. This prevents a single failing environment from fanning out 50+ subprocess spawns per cycle.
Partial cost capture on subprocess failure. runJudgeQuery now throws JudgeSubprocessError carrying whatever token and cost numbers were visible on the stream before the subprocess died. Makes SIGKILL-era API spend at least partially observable in the log.

Part B: main-agent memory path

New canonical agent notes file at phantom-config/memory/agent-notes.md, committed with a short append-only header.
New prompt block agent-memory-instructions.ts that teaches the agent to append short dated entries to that file via the Write or Edit tool. Specific about what to write, what not to write, how to format, and when. Under 400 tokens.
Wired into prompt-assembler.ts after the evolved config block so the agent knows what is already managed for it and what lives in its own notes.

The agent notes file is intentionally NOT injected into the system prompt as read-only context. The agent reads its own writes with the Read tool when it needs them, which avoids a feedback loop that would re-present the agent's own past entries as canonical context on every query.

What this does not ship

Everything else in the evolution rethink is deferred to later phases:

Haiku conditional-firing gate
SQLite-backed evolution queue and cadence batching
Reduction from 6 judges to 3
Minority veto removal
Rewrite of buildCritiqueFromObservations in reflection.ts
Deletion of safety-judge.ts, regression-judge.ts, quality-judge.ts
Golden suite deletion
Auto-rollback removal

Phase 0's job is to keep the process alive and to light up the main-agent memory path. Nothing more.

Test plan

bun test: 1331 pass / 10 skip / 0 fail, +11 new tests vs main
bun run lint: clean
bun run typecheck: clean
Soak on a single instance for 24 hours, confirm no fork-bomb pattern and that the evolution mutex skip log appears at least once under real traffic
Within 48 hours of deploy, confirm memory_file_audit_log shows at least one agent-originated write to phantom-config/memory/agent-notes.md. A zero count indicates the prompt block is not landing and should be investigated before further phases ship

Phase 0 of the evolution rethink. Two independent changes that land together because they close different sides of the same safety issue. Part A: evolution safety floor - Process-wide mutex around EvolutionEngine.afterSession. A second call that arrives while a cycle is in flight logs a skip and returns without spawning any judge subprocess. - Cycle-local judge failure ceiling inside validateAllWithJudges. On the second judge subprocess error within one cycle, the remaining deltas are dropped and the cycle aborts cleanly via a CycleAborted exception that the engine catches and records as partial spend. - Partial cost capture on subprocess failure. runJudgeQuery now throws JudgeSubprocessError carrying whatever token and cost numbers were visible on the stream before the subprocess died. Part B: main-agent memory path - New canonical agent notes file at phantom-config/memory/agent-notes.md, committed with a short append-only header. - New prompt block agent-memory-instructions.ts that teaches the main agent to append short dated entries to that file via the Write or Edit tool, with specific rules for what to write, what not to write, how to format, and when. Target under 400 tokens. - Wired into prompt-assembler.ts after the evolved config block so the agent knows what is already managed for it by the evolution engine and what lives in its own notes. The agent notes file is intentionally NOT injected into the system prompt as read-only context. The agent reads its own writes with the Read tool when it needs them, which avoids a feedback loop that would re-present the agent's own past entries as canonical context on every query. Tests: 1331 pass / 10 skip / 0 fail (+11 new tests vs main). Lint and typecheck clean.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 463deab205

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-14T20:28:07Z

src/evolution/validation.ts

+		if (error instanceof JudgeSubprocessError) {
+			const p = error.partialCost;
+			console.warn(
+				`[evolution] judge subprocess died mid-flight: ${msg} ` +
+					`(partial: in=${p.inputTokens} out=${p.outputTokens} cost=$${p.costUsd.toFixed(4)} model=${p.model})`,
+			);


Accumulate partial judge cost on subprocess failures

recordJudgeFailure logs JudgeSubprocessError.partialCost but never adds those tokens/USD into judgeCosts. Since EvolutionEngine uses partialJudgeCosts to update metrics.json and the daily cap, any SIGKILL/OOM judge failures are effectively free in accounting, which underreports spend and can let additional judge subprocesses run after the configured cost ceiling should have been reached.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-14T20:28:07Z

src/evolution/engine.ts

+					if (this.llmJudgesEnabled) {
+						this.recordJudgeCosts(judgeCosts);
+					}
+					return this.skippedResult();


Apply partial validation results when cycle aborts

When validateAllWithJudges throws CycleAborted, the catch path returns skippedResult() immediately and ignores error.partialResults. That drops deltas that were already fully validated before the failure ceiling was hit, so a late second failure causes the entire cycle to be discarded rather than only the remaining deltas, which can stall evolution during intermittent judge instability.

Useful? React with 👍 / 👎.

Resolves the 9 items on the independent reviewer's punch list for the phase 0 safety and agent-memory PR plus an operator-requested dashboard surfacing for agent-notes.md. CRIT fixes: - docker-entrypoint.sh seeds agent-notes.md on every container start so upgrade deploys land the committed baseline in the live volume even when the first-run bootstrap skip fires. Chowns the tree to 999:999. - validation.ts recordJudgeFailure takes a gate argument and accumulates JudgeSubprocessError.partialCost tokens and USD into judgeCosts[gate] so SIGKILL-era dead spend is visible to the daily cost cap and the metrics.json persistence. - engine.ts CycleAborted catch applies error.partialResults through the normal applyApproved path instead of short-circuiting to skippedResult, so deltas that cleared all 5 gates before the failure ceiling tripped are no longer dropped on the floor. Skips steps 6/7/8 on the aborted path because the environment is known unhealthy. MAJOR fixes: - validation.ts has a block comment above recordJudgeFailure explaining wrapper-vs-subprocess failure counting, the Promise.all sibling semantics, the true subprocess upper bound, and the phase 3 multiJudge closure plan. - mutex-and-retry-ceiling.test.ts adds a real thrown-cycle mutex test that monkey-patches runCycle to throw synchronously, a partial-cost accumulation test for C2, an end-to-end partial-apply test for C3, a skip-log format test for M5, and a session-counter undercount test for M4. - judge-query.ts exports __absorbUsageForTest and judge-query.test.ts has three new tests exercising the streaming usage absorber against assistant BetaMessage usage, top-level usage-bearing messages, and missing-result stream ends. - engine.ts moves updateAfterSession to the top of afterSession so mutex-skipped sessions still bump dashboard counters. Accepts the small normal-path double-count per the reviewer's tradeoff, since phase 2 replaces the drop-on-floor mutex with a real queue. - engine.ts tracks activeCycleSessionId and activeCycleSkipCount and includes both in the mutex skip log line so operators can pair a skip to its cause and watch the skip count climb in a tight burst. MINOR: - agent-memory-instructions.ts clarifies that Write creates the file on first use and Edit is for every subsequent append, so the agent cannot read "Write tool to create" as license to overwrite. Dashboard visibility for agent-notes.md: - memory-files storage adds a second read-only root for phantom-config/memory/ with an explicit allow-list (currently just agent-notes.md) exposed through the existing /ui/api/memory-files endpoint. Writes and deletes on the virtual prefix return 400/422 so manual edits cannot race the agent's own appends. - Dashboard memory-files.js renders the read-only flag with hidden Save and Delete buttons, a description banner, the agent-notes description text, and a new phantom-config group in the list. Tests: 1342 pass (+11), 10 skip, 0 fail. lint clean. typecheck clean.

chatgpt-codex-connector bot reviewed Apr 14, 2026

View reviewed changes

mcheemaa merged commit dba09be into main Apr 14, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: evolution mutex, failure ceiling, and main-agent memory path#60

feat: evolution mutex, failure ceiling, and main-agent memory path#60
mcheemaa merged 2 commits intomainfrom
feat/evolution-phase-0-safety-and-agent-memory

mcheemaa commented Apr 14, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 14, 2026

Uh oh!

chatgpt-codex-connector bot Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mcheemaa commented Apr 14, 2026

Summary

Part A: evolution safety floor

Part B: main-agent memory path

What this does not ship

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant