Conversation
The learning loop is now Cardinal Rule compliant end to end. The six-judge validation pipeline and the hardcoded buildCritiqueFromObservations for-loop are deleted. In their place: - reflection-subprocess.ts spawns the Agent SDK as a sandboxed memory manager. The agent reads the batch, reads the memory files, and decides what to learn, what to compact, when to skip, when to promote between files, and which model tier to run at. TypeScript is plumbing: snapshot, spawn, parse sentinel, invariant check, commit or rollback. - subprocess-prompt.ts carries the teaching prompt as a TypeScript constant with a runtime-facts header. The prompt teaches file purposes, signal selection, format, compaction, promotion, skip, escalation, and constitution immutability. Includes one worked good/bad bullet example. - invariant-check.ts implements nine deterministic post-write invariants. Hard fails rollback via snapshot, soft warnings log. I6 is two-tier: hard fail on credential patterns, soft warn on external URLs. I4 growth cap per file is 80 lines. - versioning.ts adds snapshot/restore and deletes the broken rollback reverse-change path. - queue.ts adds retry_count and the evolution_queue_poison table for bounded retries on invariant failures (three strikes then graduate). - metrics.ts adds reflection_stats and drops judge_costs and auto-rollback. - Docker compose adds pids_limit=256, oom_score_adj=-500, cpu_shares=2048 on the phantom service. No cpus cap. - Trims engine.ts from 627 to ~280 lines, deletes reflection.ts, validation.ts, application.ts, consolidation.ts, golden-suite.ts, and the whole judges/ directory (10 source files + 7 test files). Net: approximately -2200 lines of source, +600 lines of new source, +700 lines of new tests, -800 lines of deleted tests. Test count 1400 pass / 10 skip / 0 fail (up +13 from the 1387 baseline). Lint clean. Typecheck clean. Fixes the 100 percent appends-to-user-profile pathology, reduces evolution cost roughly 7 to 15x, and unlocks the chat channel project on a clean memory substrate.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bf1a06c889
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const result = await runReflectionSubprocess({ | ||
| batch, | ||
| config: this.config, | ||
| phantomConfig: this.runtime ? this.runtime.getPhantomConfig() : null, | ||
| }); |
There was a problem hiding this comment.
Honor reflection.enabled before spawning subprocess
resolveReflectionMode() computes this.reflectionEnabled from config/credentials, but the batch path still calls runReflectionSubprocess unconditionally here (and in runSingleSession). With reflection.enabled: "never" (or auto with no auth), drains still attempt SDK calls, return transient errors, and rows stay in evolution_queue forever because cadence only removes ok/invariant-failed rows. This makes the disable flag ineffective and can wedge queue processing in non-credential environments.
Useful? React with 👍 / 👎.
src/evolution/engine.ts
Outdated
| const session = q.session_summary; | ||
| if (this.countedSessionKeys.has(session.session_key)) continue; | ||
| this.countedSessionKeys.add(session.session_key); | ||
| updateAfterSession(this.config, session.outcome, false); |
There was a problem hiding this comment.
Stop hardcoding correction flag to false
Session metrics are updated with hadCorrections fixed to false, so correction_count/correction_rate_7d can never increase regardless of actual session content. Since these fields are still persisted and exposed (for example via MCP metrics output), this silently degrades telemetry accuracy and makes correction-rate reporting permanently zero.
Useful? React with 👍 / 👎.
Eleven items from the independent review of PR #64. Three CRITs land the dispatcher fixes that block merge; five MAJORs close test gaps and type drift; three MINORs are cosmetic cleanups bundled into the same commit. CRIT-1 short-circuits runDrainPipeline and runSingleSession when reflection.enabled is false so a no-auth install does not spawn the SDK subprocess on every cadence tick. The disabled-mode path still records stats and ticks session_count. CRIT-2 deletes correction_count and correction_rate_7d entirely. The reflection_stats.files_touched map is the single source of truth for per-file drain counts; TypeScript stops pretending to know what a correction is. CRIT-3 turns the implicit ok/invariantFailed boolean pair on SessionBatchEntry into an explicit four-value disposition enum so the cadence routes queue rows by name, not by deriving a boolean from an unrelated string field. MAJOR-1 adds engine and cadence integration tests for the disable path. MAJOR-2 adds a cadence transient-failure test that pins the rows-stay-in-queue-without-retry-bump semantics. MAJOR-3 deletes dead code in reflection-subprocess.ts and threads drainId into the SDK user message so the agent can read its batch file directly. MAJOR-4 renames getHistory to getEvolutionLog, returns the new EvolutionLogEntry shape, and adds migrateOldLogEntry for read-time backward compat against old log rows on disk. MAJOR-5 drops compact from the SubprocessStatus union; compaction is a per-change annotation only. MINOR-4 deletes the vestigial ConstitutionChecker class and inlines the existence check into the engine constructor. MINOR-5 rewrites the docker-compose memory comment to describe the Phase 3 profile. MINOR-7 sets the rollback branch status label to skip so the top-level field matches the truth on disk.
Summary
Phase 3 of the evolution rethink. Replaces the six-judge validation pipeline and the hardcoded
buildCritiqueFromObservationsfor-loop with a single reflection subprocess that manages Phantom's memory files as a first-class memory manager. The agent decides what to learn, when to compact, when to skip, whether to promote between files, and which model tier to run at. TypeScript is plumbing: snapshot, spawn, parse sentinel, invariant check, commit or rollback.Bundled with Phase 4 (cleanup of orphaned code) in a single PR because Phase 4's deletions are unreachable the moment Phase 3 lands.
The pathology this fixes
Before this PR, 100 percent of lifetime applied evolution deltas across the fleet are appends to
user-profile.md. Zero changes toconstitution.md,persona.md,domain-knowledge.md,strategies/*, ormemory/principles.md. Root cause:reflection.ts::buildCritiqueFromObservationswas a hardcoded for-loop that mapped every observation to auser-profile.mdappend regardless of type. The observation judge produced rich multi-type output and the for-loop threw everything exceptcorrectionandpreferenceon the floor.Phase 3 deletes the for-loop and hands file-selection to the agent via a teaching prompt that tells it to check whether each observation belongs in domain-knowledge, persona, strategies, or memory/principles before defaulting to user-profile.
New modules
reflection-subprocess.ts: entry point for the reflection pass. Snapshotsphantom-config/, spawns the Agent SDK as a sandboxed memory manager (cwd=phantom-config,tools: Read/Write/Edit/Glob/Grep, plain-stringsystemPrompt, explicit permission allow/deny rules), parses the sentinel JSON from the final assistant message, escalates tiers on request, runs the invariant check, and either commits the version bump or restores the snapshot. Ships with a runner override hook so tests inject deterministic fixtures without mocking the whole SDK.invariant-check.ts: nine post-write invariants as a single pure function. I1 file scope, I2 constitution byte-compare, I3 canonical file existence, I4 size bounds (per-file cap 80 lines, total cap 100, 70 percent shrinkage soft bound with sentinel override, zero-byte hard bound), I5 markdown/JSONL syntax, I6 content safety two-tier (hard fail on credential patterns likesk-ant-,ANTHROPIC_API_KEY,api_key =; soft warn on external URLs outside github/slack/telegram/anthropic/localhost), I7 near-duplicate idempotence, I8 sentinel cross-check, I9 staging cleanup. Zero LLM calls, runs in milliseconds, replaces roughly 2000 lines of the old judge pipeline.subprocess-prompt.ts: the teaching prompt as a TypeScript constant. Teaches memory file purposes, signal selection, format rules, when to compact, when to promote, when to do nothing (skip default repeated three times), how to escalate tier, constitution immutability, and the final-message sentinel format. One worked bad-vs-good bullet example.buildSubprocessSystemPromptprepends a runtime facts header (batch id, batch sessions, version, tier, file sizes) above the static teaching. No preset envelope.versioning.tsadditions:snapshotDirectory,restoreSnapshot,DirectorySnapshot,buildVersionChanges. Replace the fragile append-reverserollback/reverseChangewhich only worked for the narrow "undo the last few appends" shape the old pipeline needed.gate-prompt.tsandjudge-models.ts: migration targets for the Phase 1 gate prompt, theGateJudgeResultschema, and theJUDGE_MODEL_*constants. Migrated BEFORE the source files were deleted.Model tiering
Two-stage return-signal pattern matching the Phase 1 gate. TypeScript always spawns Haiku first. The agent either does the work at its current tier or emits
{"status":"escalate","target":"sonnet|opus","reason":"..."}. TypeScript respawns at the requested tier. Opus cannot escalate further. One escalation per stage, capped at three subprocess spawns per drain. The agent picks the tier based on task complexity; TypeScript does not classify batch size or content to pick tiers.Per-tier timeouts via
AbortController: Haiku 60s, Sonnet 180s, Opus 300s.Queue retry and poison pile
New migration 26:
ALTER TABLE evolution_queue ADD COLUMN retry_count INTEGER NOT NULL DEFAULT 0. New migration 27:CREATE TABLE evolution_queue_poison. Invariant hard failures increment retry count; rows graduate to the poison pile at retry_count >= 3. Transient subprocess failures (SIGKILL, timeout, parse fail with no writes) do NOT increment retry count because the row itself is innocent; they just leave the row in place for the next drain.listPoisonPile()ships inqueue.tsas plumbing for a future operator inspection path.Metrics
New
reflection_statsblock onmetrics.json: drains, stage_haiku_runs, stage_sonnet_runs, stage_opus_runs, escalation counters, status counters, timeout counters per tier, sigkill counters, invariant_failed_hard, invariant_warned_soft, sentinel_parse_fail, total_cost_usd, compactions_performed, files_touched per path. Replaces the oldjudge_costsblock. Auto-rollback metrics (rollback_count,sessions_since_consolidation) are deleted; auto-rollback has fired zero times across the fleet lifetime.Docker compose hardening
Adds
pids_limit: 256,oom_score_adj: -500,cpu_shares: 2048to the phantom service. OS-level cap against a future fork-bomb regression as secondary defense on top of the Phase 0 mutex, OOM killer bias toward qdrant/ollama under pressure, and 2x CPU weighting against qdrant/ollama under saturation. Nocpuscap added; current "nocpu.maxquota" behavior is correct (nr_throttled = 0across the fleet lifetime).Deletions (Phase 4 cleanup)
Full file deletions:
reflection.ts(the hardcoded for-loop)validation.ts(the 6-gate validation pipeline with triple-judge minority veto for constitution and safety)application.ts(applyDelta,applyApproved)consolidation.ts(heuristic path; the LLM consolidation-judge has fired zero times lifetime)golden-suite.ts(dead regression judge backing)judges/client.ts,judges/consolidation-judge.ts,judges/constitution-judge.ts,judges/observation-judge.ts,judges/quality-judge.ts,judges/regression-judge.ts,judges/safety-judge.ts,judges/prompts.ts,judges/schemas.ts,judges/types.ts(wholejudges/directory)__tests__/engine.tsshrinks from 627 lines to roughly 280. RemovesrunCycle, daily cost cap, auto-rollback caller,recordJudgeCosts,pruneGoldenSuite,rollback,reverseChange,mergeCosts,totalCostFromJudgeCosts,isWithinCostCap,resetDailyCostIfNewDay. Preserves the Phase 0activeCyclemutex as belt-and-suspenders.constitution.ts,config.ts,metrics.ts,types.tstrimmed to match the new shape.VersionChange.typeunion changes fromappend|replace|removetoedit|compact|new|deleteto match the file-operation model.EvolutionLogEntry.session_idreplaced withdrain_idplussession_ids[]because a drain can process many sessions. Old log entries on disk still parse viaPartial<EvolutionVersion>.src/memory/consolidation.ts: simplified to heuristic-only. The LLM path (consolidateSessionWithLLM) called into the now-deleted judges directory. The heuristic path is what was actually running in production (the LLM consolidation-judge had zero lifetime calls). No behavior regression.Net LOC delta
Source: approximately -2200 deletions, +600 additions. Tests: approximately -800 deletions (judge tests, validation tests, application tests, cost cap tests, golden suite tests), +700 additions (reflection-subprocess tests, invariant-check tests, snapshot tests, sentinel parser tests, subprocess-prompt pin tests, mutex tests, batch-processor tests). Net: approximately -1700 lines.
Test plan
bun test: 1400 pass / 10 skip / 0 fail. Up +13 from the 1387 baseline after the Phase 1+2 second fix pass merge.bun run lint: clean.bun run typecheck: clean.reflection_statsblock appears inmetrics.jsonafter first drain, reflection subprocess completes, no invariant hard fails, compaction of the accumulated bloateduser-profile.mdproduces a shorter file with semantic targeting.reflection_statsdistribution after 24 hours of drain activity. Expected: 70 to 80 percent Haiku runs, 15 to 25 percent Sonnet, 1 to 5 percent Opus, zero invariant hard fails, 30 to 40 percent skip ratio.What this does not ship
phantom evolution list-poisonand friends). Deferred to a follow-up PR. The SQLite table and themarkFailed/ auto-graduation-at-3 logic ship; operators cansqlite3 phantom.db 'SELECT * FROM evolution_queue_poison'directly in the interim.Known deploy drift
config/evolution.yamlon instances with the previous code will produce non-fatal zod warnings on the first restart until the trimmed YAML lands alongside the code.phantom-config/meta/metrics.jsonon instances with the previous code still carries the oldjudge_costsandrollback_countblocks.readMetricssilently drops unknown fields on write; the blocks bleed out naturally.src/mcp/resources.tsserializesgetVersionHistory()for the MCP changelog resource. New entries use the newVersionChangeshape. Old entries still parse viaPartial<EvolutionVersion>. Downstream consumers of the MCP changelog see the new vocabulary on fresh data.