Skip to content

fix: phase 1+2 reviewer findings (C1 runtime refresh, C2 failed-row preservation, C3 gate preset, M1 single-count, M2 integration test, M3 dedup)#63

Merged
mcheemaa merged 2 commits intomainfrom
fix/evolution-phase-1-and-2-review
Apr 15, 2026
Merged

fix: phase 1+2 reviewer findings (C1 runtime refresh, C2 failed-row preservation, C3 gate preset, M1 single-count, M2 integration test, M3 dedup)#63
mcheemaa merged 2 commits intomainfrom
fix/evolution-phase-1-and-2-review

Conversation

@mcheemaa
Copy link
Copy Markdown
Member

Summary

Follow-up PR that closes the combined Codex + independent reviewer punch list against PR #61 (Phase 1 conditional firing gate + Phase 2 SQLite cadence batching). Three CRITs, three MAJORs, all shipped. Optional MINOR items deferred.

CRITs

C1. Wire runtime refresh after cadence drain

The queued path in enqueueIfWorthy returns no inlineResult, so the .then callback in src/index.ts never called runtime.setEvolvedConfig after a drain. When the cadence drains a batch and applies changes to disk, AgentRuntime's in-memory evolvedConfig snapshot stayed at boot-time state until process restart. Effective result: the queued path silently disabled live evolution.

Fix: EvolutionEngine now owns an onConfigApplied callback, set via a new setOnConfigApplied method next to setQueueWiring. It fires from both the normal apply path in runCycle and the CycleAborted partial-apply recovery, whenever applied.length > 0. Wired in src/index.ts to call runtime.setEvolvedConfig(evolution.getConfig()) on every fire. notifyConfigApplied swallows callback errors so a telemetry/refresh failure cannot wedge the pipeline.

C2. Preserve failed queue rows for retry

cadence.runDrain was calling markProcessed with every drained id regardless of per-row success. Transient pipeline failures silently deleted rows from evolution_queue, defeating the whole durability argument for SQLite backing.

Fix: filter to ok=true ids only before markProcessed, emit a warn log per failed row including id and error string. Failed rows retry on the next drain.

C3. Strip the claude_code preset overhead from the gate

Observed per-call gate cost was 20x to 180x the ~$0.0006 research target across the live fleet. Root cause: runJudgeQuery injects systemPrompt: { type: "preset", preset: "claude_code", append: gatePrompt }, which bundles the full Claude Code base prompt plus the tool catalog into every Haiku call. The gate is a pure evaluation call and does not need any of that overhead.

Fix: new omitPreset?: boolean option on JudgeQueryOptions. When true, the SDK receives a plain-string systemPrompt instead of the preset envelope. Verified against node_modules/@anthropic-ai/claude-agent-sdk/sdk.d.ts:1281 — the SDK explicitly accepts both shapes, and the plain-string path is documented as "Use a custom system prompt" which skips the preset's base prompt. Flipped on only at the gate call site. Every other judge (observation, constitution, regression, safety, quality) keeps the preset.

Extracted buildSystemPrompt as an exported helper so the invariant is unit-testable without mocking the SDK's streaming query().

Predicted cost reduction: 10x to 100x depending on VM and cache state. Post-deploy verification reads gate_stats.haiku_cost_usd_total / total_decisions on the fleet.

MAJORs

M1. Collapse session_count triple-increment to single-increment

Phase 0's M4 fix moved updateAfterSession to the top of afterSession so the mutex-skip path would still bump the counter. Phase 2 replaced the drop-on-floor model with a persistent queue, so the M4 premise is obsolete: every enqueued session eventually hits runCycle. But the M4 call was still there, plus a new one inside enqueueIfWorthy, plus the original one inside runCycle. Fired sessions were counted 3x.

Fix: deleted the calls in enqueueIfWorthy and at the top of afterSessionInternal. Kept only the call inside runCycle where hadCorrections is known. Reverted the engine test expectation from 2 back to 1 and rewrote the Phase 0 mutex-skip test to assert activeCycleSkipCount === 2 instead of the old 3x count.

M2. End-to-end integration test across engine + queue + cadence + runtime

Every CRIT and MAJOR in this review is a coordination bug between modules, and the existing unit tests stub the engine so none of them could catch any of it. New test file src/evolution/__tests__/phase-1-2-integration.test.ts wires the real EvolutionEngine, real EvolutionQueue over :memory:, real EvolutionCadence, and a stub AgentRuntime. Two tests cover queue depth post-drain, single-count metrics, runtime refresh callback firing, and failed-row retention through two drains. 283 LOC.

M3. Dedup queue rows by session_key

A busy multi-turn session that crosses the gate threshold multiple times between drains was producing multiple queue rows, each triggering the full Sonnet judge pipeline on the drain. Cost amplifier scaling with conversation length.

Fix: EvolutionQueue.enqueue wraps DELETE FROM evolution_queue WHERE session_key = ? and the INSERT in a db.transaction(...). No migration required. A repeated enqueue for the same session_key replaces the prior row with the latest summary.

MINORs

  • N1 gate prompt tuning: deferred, needs a week of evolution-gate-log.jsonl data
  • N2 thread reactions into SessionSummary: deferred, out of scope
  • N3 thread gate cost into judge_costs: deferred, Phase 3 cost tracking restructure
  • N7 canonical session_key format: deferred, one-line cleanup outside the in-scope file list

Test plan

  • bun test: 1385 pass / 10 skip / 0 fail (+11 new tests over the PR feat: evolution conditional firing gate and cadence batching #61 baseline of 1374)
  • bun run lint: clean
  • bun run typecheck: clean
  • Post-deploy: confirm gate_stats.haiku_cost_usd_total / total_decisions drops into the sub-cent range on at least one VM with active gate traffic
  • Post-deploy: confirm metrics.session_count bumps by exactly 1 per drained session (not 3)
  • Post-deploy: confirm a cadence drain applies changes AND refreshes the runtime in the same cycle (no stale config window)
  • Post-deploy: confirm the queue retains failed rows for retry by tailing warn logs for the per-row failure line

Notes

engine.ts is now 607 lines, up from 580. Past the 600-line guideline. Phase 3 split target — intentionally not split here because Phase 3's rewrite deletes most of the 6-judge bulk and splitting first would produce throwaway work.

… dedup, single-count, failed-row preservation, integration test)

Closes the Codex and independent reviewer punch list against PR #61.

C1 runtime refresh after cadence drain. EvolutionEngine now owns an
onConfigApplied callback that fires from both the normal apply path and
the CycleAborted partial-apply recovery whenever applied.length > 0.
Wired into src/index.ts next to setQueueWiring so the AgentRuntime
evolved config snapshot refreshes after every drain that applies
changes. Pre-fix the queued path rewrote phantom-config/ files but the
live agent kept its boot-time snapshot until process restart.

C2 failed queue rows now stay in the queue. cadence.runDrain filters
the batch result to ok-only ids before markProcessed and emits a warn
line per failed row including the queue id and error string. Failed
rows retry on the next drain instead of being silently deleted, which
was reintroducing the pre-Phase-2 drop-on-floor shape for the exact
sessions the safety floor exists to protect.

C3 gate Haiku call drops the claude_code preset envelope. New
omitPreset option on JudgeQueryOptions, flipped on only at the gate
call site, swaps the SDK systemPrompt from the preset envelope to a
plain string. The SDK accepts both shapes; the plain-string path
skips the full Claude Code base prompt and tool catalog. Live fleet
data showed gate cost running 20 to 180x the research target because
of preset overhead. Other judges keep the preset envelope unchanged.

M1 collapses session_count triple-increment to single-increment per
fired session. Removed updateAfterSession from enqueueIfWorthy and
from the top of afterSessionInternal. The remaining call site inside
runCycle has the real hadCorrections value after observation
extraction. The Phase 0 M4 mutex-skip increment is obsolete now that
Phase 2 replaced the drop-on-floor path with a persistent queue: every
enqueued session eventually reaches runCycle.

M2 adds a phase-1-2 integration test that wires the real
EvolutionEngine, the real EvolutionQueue over in-memory sqlite, the
real EvolutionCadence, and a stub AgentRuntime. Two tests cover queue
depth post-drain, single-count metrics, runtime refresh callback
firing, and failed-row retention through two drains. Single highest
leverage item in the punch list; covers every coordination concern
between engine, queue, cadence, and runtime in one fixture.

M3 dedups queue rows by session_key. EvolutionQueue.enqueue wraps a
DELETE WHERE session_key = ? and the INSERT in a sqlite transaction so
the dedup is atomic with respect to a concurrent drainAll. No
migration required. A busy multi-turn session that crosses the gate
threshold multiple times between drains now collapses to a single row
with the latest summary, instead of running the full Sonnet judge
pipeline once per turn against progressively shorter snapshots.

Test plan
- bun test: 1385 pass / 10 skip / 0 fail (+11 over the 1374 baseline)
- bun run lint: clean
- bun run typecheck: clean
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3917fc1d71

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

console.warn(`[evolution] queue row id=${entry.id} pipeline failed, leaving in queue: ${entry.error}`);
}
}
this.queue.markProcessed(okIds);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid counting retried failed rows as new sessions

Keeping failed rows in the queue is correct for durability, but deleting only ok IDs means the same row can be processed multiple times, and runCycle increments session_count/success_count before several later failure points (for example, an exception after observation extraction). With this change, one transiently failing row can inflate session metrics and skew auto-rollback/consolidation decisions on every retry. Consider making metrics idempotent per queue row/session or deferring session-counter updates until the row is successfully finalized.

Useful? React with 👍 / 👎.

console.log(
`[evolution] Applied ${applied.length} changes (v${this.getCurrentVersion()}) in ${Date.now() - startTime}ms`,
);
this.notifyConfigApplied();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Refresh runtime again after rollback path

The new config-refresh callback is invoked immediately after apply, but auto-rollback is evaluated later in the same cycle; if rollback triggers, disk state moves back while runtime has already been refreshed to the pre-rollback version. Since rollback() does not call notifyConfigApplied, queued evolution can leave runtime and on-disk config diverged exactly in the safety-critical rollback case.

Useful? React with 👍 / 👎.

Two follow-up fixes for findings Codex raised against the PR #63 fix pass.

- engine.ts: dedup updateAfterSession by session.session_key so the M1
  single-count contract holds under the C2 retry path. A failed row that
  retries on the next drain re-enters runCycle under the same key; the
  new countedSessionKeys Set guarantees one increment per unique
  conversation for the engine's lifetime. Memory is bounded by unique
  session_keys; process restart clears the set, which is an acceptable
  per-restart blip.
- engine.ts: call notifyConfigApplied() from rollback() so the
  auto-rollback branch in runCycle no longer leaves the runtime serving
  a snapshot newer than the disk state. getConfig() re-reads from disk
  on every call, so the callback observes the post-rollback content
  with no extra in-memory work.

Tests: +2 (retry dedup in phase-1-2-integration.test.ts, rollback
callback in engine.test.ts). 1387 pass / 10 skip / 0 fail. Lint and
typecheck clean.
@mcheemaa mcheemaa merged commit 5200517 into main Apr 15, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant