fix: phase 1+2 reviewer findings (C1 runtime refresh, C2 failed-row preservation, C3 gate preset, M1 single-count, M2 integration test, M3 dedup) by mcheemaa · Pull Request #63 · ghostwright/phantom

mcheemaa · 2026-04-15T02:14:59Z

Summary

Follow-up PR that closes the combined Codex + independent reviewer punch list against PR #61 (Phase 1 conditional firing gate + Phase 2 SQLite cadence batching). Three CRITs, three MAJORs, all shipped. Optional MINOR items deferred.

CRITs

C1. Wire runtime refresh after cadence drain

The queued path in enqueueIfWorthy returns no inlineResult, so the .then callback in src/index.ts never called runtime.setEvolvedConfig after a drain. When the cadence drains a batch and applies changes to disk, AgentRuntime's in-memory evolvedConfig snapshot stayed at boot-time state until process restart. Effective result: the queued path silently disabled live evolution.

Fix: EvolutionEngine now owns an onConfigApplied callback, set via a new setOnConfigApplied method next to setQueueWiring. It fires from both the normal apply path in runCycle and the CycleAborted partial-apply recovery, whenever applied.length > 0. Wired in src/index.ts to call runtime.setEvolvedConfig(evolution.getConfig()) on every fire. notifyConfigApplied swallows callback errors so a telemetry/refresh failure cannot wedge the pipeline.

C2. Preserve failed queue rows for retry

cadence.runDrain was calling markProcessed with every drained id regardless of per-row success. Transient pipeline failures silently deleted rows from evolution_queue, defeating the whole durability argument for SQLite backing.

Fix: filter to ok=true ids only before markProcessed, emit a warn log per failed row including id and error string. Failed rows retry on the next drain.

C3. Strip the `claude_code` preset overhead from the gate

Observed per-call gate cost was 20x to 180x the ~$0.0006 research target across the live fleet. Root cause: runJudgeQuery injects systemPrompt: { type: "preset", preset: "claude_code", append: gatePrompt }, which bundles the full Claude Code base prompt plus the tool catalog into every Haiku call. The gate is a pure evaluation call and does not need any of that overhead.

Fix: new omitPreset?: boolean option on JudgeQueryOptions. When true, the SDK receives a plain-string systemPrompt instead of the preset envelope. Verified against node_modules/@anthropic-ai/claude-agent-sdk/sdk.d.ts:1281 — the SDK explicitly accepts both shapes, and the plain-string path is documented as "Use a custom system prompt" which skips the preset's base prompt. Flipped on only at the gate call site. Every other judge (observation, constitution, regression, safety, quality) keeps the preset.

Extracted buildSystemPrompt as an exported helper so the invariant is unit-testable without mocking the SDK's streaming query().

Predicted cost reduction: 10x to 100x depending on VM and cache state. Post-deploy verification reads gate_stats.haiku_cost_usd_total / total_decisions on the fleet.

MAJORs

M1. Collapse session_count triple-increment to single-increment

Phase 0's M4 fix moved updateAfterSession to the top of afterSession so the mutex-skip path would still bump the counter. Phase 2 replaced the drop-on-floor model with a persistent queue, so the M4 premise is obsolete: every enqueued session eventually hits runCycle. But the M4 call was still there, plus a new one inside enqueueIfWorthy, plus the original one inside runCycle. Fired sessions were counted 3x.

Fix: deleted the calls in enqueueIfWorthy and at the top of afterSessionInternal. Kept only the call inside runCycle where hadCorrections is known. Reverted the engine test expectation from 2 back to 1 and rewrote the Phase 0 mutex-skip test to assert activeCycleSkipCount === 2 instead of the old 3x count.

M2. End-to-end integration test across engine + queue + cadence + runtime

Every CRIT and MAJOR in this review is a coordination bug between modules, and the existing unit tests stub the engine so none of them could catch any of it. New test file src/evolution/__tests__/phase-1-2-integration.test.ts wires the real EvolutionEngine, real EvolutionQueue over :memory:, real EvolutionCadence, and a stub AgentRuntime. Two tests cover queue depth post-drain, single-count metrics, runtime refresh callback firing, and failed-row retention through two drains. 283 LOC.

M3. Dedup queue rows by session_key

A busy multi-turn session that crosses the gate threshold multiple times between drains was producing multiple queue rows, each triggering the full Sonnet judge pipeline on the drain. Cost amplifier scaling with conversation length.

Fix: EvolutionQueue.enqueue wraps DELETE FROM evolution_queue WHERE session_key = ? and the INSERT in a db.transaction(...). No migration required. A repeated enqueue for the same session_key replaces the prior row with the latest summary.

MINORs

N1 gate prompt tuning: deferred, needs a week of evolution-gate-log.jsonl data
N2 thread reactions into SessionSummary: deferred, out of scope
N3 thread gate cost into judge_costs: deferred, Phase 3 cost tracking restructure
N7 canonical session_key format: deferred, one-line cleanup outside the in-scope file list

Test plan

bun test: 1385 pass / 10 skip / 0 fail (+11 new tests over the PR feat: evolution conditional firing gate and cadence batching #61 baseline of 1374)
bun run lint: clean
bun run typecheck: clean
Post-deploy: confirm gate_stats.haiku_cost_usd_total / total_decisions drops into the sub-cent range on at least one VM with active gate traffic
Post-deploy: confirm metrics.session_count bumps by exactly 1 per drained session (not 3)
Post-deploy: confirm a cadence drain applies changes AND refreshes the runtime in the same cycle (no stale config window)
Post-deploy: confirm the queue retains failed rows for retry by tailing warn logs for the per-row failure line

Notes

engine.ts is now 607 lines, up from 580. Past the 600-line guideline. Phase 3 split target — intentionally not split here because Phase 3's rewrite deletes most of the 6-judge bulk and splitting first would produce throwaway work.

… dedup, single-count, failed-row preservation, integration test) Closes the Codex and independent reviewer punch list against PR #61. C1 runtime refresh after cadence drain. EvolutionEngine now owns an onConfigApplied callback that fires from both the normal apply path and the CycleAborted partial-apply recovery whenever applied.length > 0. Wired into src/index.ts next to setQueueWiring so the AgentRuntime evolved config snapshot refreshes after every drain that applies changes. Pre-fix the queued path rewrote phantom-config/ files but the live agent kept its boot-time snapshot until process restart. C2 failed queue rows now stay in the queue. cadence.runDrain filters the batch result to ok-only ids before markProcessed and emits a warn line per failed row including the queue id and error string. Failed rows retry on the next drain instead of being silently deleted, which was reintroducing the pre-Phase-2 drop-on-floor shape for the exact sessions the safety floor exists to protect. C3 gate Haiku call drops the claude_code preset envelope. New omitPreset option on JudgeQueryOptions, flipped on only at the gate call site, swaps the SDK systemPrompt from the preset envelope to a plain string. The SDK accepts both shapes; the plain-string path skips the full Claude Code base prompt and tool catalog. Live fleet data showed gate cost running 20 to 180x the research target because of preset overhead. Other judges keep the preset envelope unchanged. M1 collapses session_count triple-increment to single-increment per fired session. Removed updateAfterSession from enqueueIfWorthy and from the top of afterSessionInternal. The remaining call site inside runCycle has the real hadCorrections value after observation extraction. The Phase 0 M4 mutex-skip increment is obsolete now that Phase 2 replaced the drop-on-floor path with a persistent queue: every enqueued session eventually reaches runCycle. M2 adds a phase-1-2 integration test that wires the real EvolutionEngine, the real EvolutionQueue over in-memory sqlite, the real EvolutionCadence, and a stub AgentRuntime. Two tests cover queue depth post-drain, single-count metrics, runtime refresh callback firing, and failed-row retention through two drains. Single highest leverage item in the punch list; covers every coordination concern between engine, queue, cadence, and runtime in one fixture. M3 dedups queue rows by session_key. EvolutionQueue.enqueue wraps a DELETE WHERE session_key = ? and the INSERT in a sqlite transaction so the dedup is atomic with respect to a concurrent drainAll. No migration required. A busy multi-turn session that crosses the gate threshold multiple times between drains now collapses to a single row with the latest summary, instead of running the full Sonnet judge pipeline once per turn against progressively shorter snapshots. Test plan - bun test: 1385 pass / 10 skip / 0 fail (+11 over the 1374 baseline) - bun run lint: clean - bun run typecheck: clean

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3917fc1d71

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-15T02:19:48Z

src/evolution/cadence.ts

+				console.warn(`[evolution] queue row id=${entry.id} pipeline failed, leaving in queue: ${entry.error}`);
+			}
+		}
+		this.queue.markProcessed(okIds);


Avoid counting retried failed rows as new sessions

Keeping failed rows in the queue is correct for durability, but deleting only ok IDs means the same row can be processed multiple times, and runCycle increments session_count/success_count before several later failure points (for example, an exception after observation extraction). With this change, one transiently failing row can inflate session metrics and skew auto-rollback/consolidation decisions on every retry. Consider making metrics idempotent per queue row/session or deferring session-counter updates until the row is successfully finalized.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-15T02:19:48Z

src/evolution/engine.ts

 			console.log(
 				`[evolution] Applied ${applied.length} changes (v${this.getCurrentVersion()}) in ${Date.now() - startTime}ms`,
 			);
+			this.notifyConfigApplied();


Refresh runtime again after rollback path

The new config-refresh callback is invoked immediately after apply, but auto-rollback is evaluated later in the same cycle; if rollback triggers, disk state moves back while runtime has already been refreshed to the pre-rollback version. Since rollback() does not call notifyConfigApplied, queued evolution can leave runtime and on-disk config diverged exactly in the safety-critical rollback case.

Useful? React with 👍 / 👎.

Two follow-up fixes for findings Codex raised against the PR #63 fix pass. - engine.ts: dedup updateAfterSession by session.session_key so the M1 single-count contract holds under the C2 retry path. A failed row that retries on the next drain re-enters runCycle under the same key; the new countedSessionKeys Set guarantees one increment per unique conversation for the engine's lifetime. Memory is bounded by unique session_keys; process restart clears the set, which is an acceptable per-restart blip. - engine.ts: call notifyConfigApplied() from rollback() so the auto-rollback branch in runCycle no longer leaves the runtime serving a snapshot newer than the disk state. getConfig() re-reads from disk on every call, so the callback observes the post-rollback content with no extra in-memory work. Tests: +2 (retry dedup in phase-1-2-integration.test.ts, rollback callback in engine.test.ts). 1387 pass / 10 skip / 0 fail. Lint and typecheck clean.

chatgpt-codex-connector bot reviewed Apr 15, 2026

View reviewed changes

mcheemaa merged commit 5200517 into main Apr 15, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: phase 1+2 reviewer findings (C1 runtime refresh, C2 failed-row preservation, C3 gate preset, M1 single-count, M2 integration test, M3 dedup)#63

fix: phase 1+2 reviewer findings (C1 runtime refresh, C2 failed-row preservation, C3 gate preset, M1 single-count, M2 integration test, M3 dedup)#63
mcheemaa merged 2 commits intomainfrom
fix/evolution-phase-1-and-2-review

mcheemaa commented Apr 15, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 15, 2026

Uh oh!

chatgpt-codex-connector bot Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mcheemaa commented Apr 15, 2026

Summary

CRITs

C1. Wire runtime refresh after cadence drain

C2. Preserve failed queue rows for retry

C3. Strip the claude_code preset overhead from the gate

MAJORs

M1. Collapse session_count triple-increment to single-increment

M2. End-to-end integration test across engine + queue + cadence + runtime

M3. Dedup queue rows by session_key

MINORs

Test plan

Notes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

C3. Strip the `claude_code` preset overhead from the gate