Skip to content

feat: replace 6-judge pipeline with sandboxed reflection subprocess#64

Merged
mcheemaa merged 2 commits intomainfrom
feat/evolution-phase-3-reflection-subprocess
Apr 15, 2026
Merged

feat: replace 6-judge pipeline with sandboxed reflection subprocess#64
mcheemaa merged 2 commits intomainfrom
feat/evolution-phase-3-reflection-subprocess

Conversation

@mcheemaa
Copy link
Copy Markdown
Member

Summary

Phase 3 of the evolution rethink. Replaces the six-judge validation pipeline and the hardcoded buildCritiqueFromObservations for-loop with a single reflection subprocess that manages Phantom's memory files as a first-class memory manager. The agent decides what to learn, when to compact, when to skip, whether to promote between files, and which model tier to run at. TypeScript is plumbing: snapshot, spawn, parse sentinel, invariant check, commit or rollback.

Bundled with Phase 4 (cleanup of orphaned code) in a single PR because Phase 4's deletions are unreachable the moment Phase 3 lands.

The pathology this fixes

Before this PR, 100 percent of lifetime applied evolution deltas across the fleet are appends to user-profile.md. Zero changes to constitution.md, persona.md, domain-knowledge.md, strategies/*, or memory/principles.md. Root cause: reflection.ts::buildCritiqueFromObservations was a hardcoded for-loop that mapped every observation to a user-profile.md append regardless of type. The observation judge produced rich multi-type output and the for-loop threw everything except correction and preference on the floor.

Phase 3 deletes the for-loop and hands file-selection to the agent via a teaching prompt that tells it to check whether each observation belongs in domain-knowledge, persona, strategies, or memory/principles before defaulting to user-profile.

New modules

  • reflection-subprocess.ts: entry point for the reflection pass. Snapshots phantom-config/, spawns the Agent SDK as a sandboxed memory manager (cwd=phantom-config, tools: Read/Write/Edit/Glob/Grep, plain-string systemPrompt, explicit permission allow/deny rules), parses the sentinel JSON from the final assistant message, escalates tiers on request, runs the invariant check, and either commits the version bump or restores the snapshot. Ships with a runner override hook so tests inject deterministic fixtures without mocking the whole SDK.

  • invariant-check.ts: nine post-write invariants as a single pure function. I1 file scope, I2 constitution byte-compare, I3 canonical file existence, I4 size bounds (per-file cap 80 lines, total cap 100, 70 percent shrinkage soft bound with sentinel override, zero-byte hard bound), I5 markdown/JSONL syntax, I6 content safety two-tier (hard fail on credential patterns like sk-ant-, ANTHROPIC_API_KEY, api_key =; soft warn on external URLs outside github/slack/telegram/anthropic/localhost), I7 near-duplicate idempotence, I8 sentinel cross-check, I9 staging cleanup. Zero LLM calls, runs in milliseconds, replaces roughly 2000 lines of the old judge pipeline.

  • subprocess-prompt.ts: the teaching prompt as a TypeScript constant. Teaches memory file purposes, signal selection, format rules, when to compact, when to promote, when to do nothing (skip default repeated three times), how to escalate tier, constitution immutability, and the final-message sentinel format. One worked bad-vs-good bullet example. buildSubprocessSystemPrompt prepends a runtime facts header (batch id, batch sessions, version, tier, file sizes) above the static teaching. No preset envelope.

  • versioning.ts additions: snapshotDirectory, restoreSnapshot, DirectorySnapshot, buildVersionChanges. Replace the fragile append-reverse rollback / reverseChange which only worked for the narrow "undo the last few appends" shape the old pipeline needed.

  • gate-prompt.ts and judge-models.ts: migration targets for the Phase 1 gate prompt, the GateJudgeResult schema, and the JUDGE_MODEL_* constants. Migrated BEFORE the source files were deleted.

Model tiering

Two-stage return-signal pattern matching the Phase 1 gate. TypeScript always spawns Haiku first. The agent either does the work at its current tier or emits {"status":"escalate","target":"sonnet|opus","reason":"..."}. TypeScript respawns at the requested tier. Opus cannot escalate further. One escalation per stage, capped at three subprocess spawns per drain. The agent picks the tier based on task complexity; TypeScript does not classify batch size or content to pick tiers.

Per-tier timeouts via AbortController: Haiku 60s, Sonnet 180s, Opus 300s.

Queue retry and poison pile

New migration 26: ALTER TABLE evolution_queue ADD COLUMN retry_count INTEGER NOT NULL DEFAULT 0. New migration 27: CREATE TABLE evolution_queue_poison. Invariant hard failures increment retry count; rows graduate to the poison pile at retry_count >= 3. Transient subprocess failures (SIGKILL, timeout, parse fail with no writes) do NOT increment retry count because the row itself is innocent; they just leave the row in place for the next drain. listPoisonPile() ships in queue.ts as plumbing for a future operator inspection path.

Metrics

New reflection_stats block on metrics.json: drains, stage_haiku_runs, stage_sonnet_runs, stage_opus_runs, escalation counters, status counters, timeout counters per tier, sigkill counters, invariant_failed_hard, invariant_warned_soft, sentinel_parse_fail, total_cost_usd, compactions_performed, files_touched per path. Replaces the old judge_costs block. Auto-rollback metrics (rollback_count, sessions_since_consolidation) are deleted; auto-rollback has fired zero times across the fleet lifetime.

Docker compose hardening

Adds pids_limit: 256, oom_score_adj: -500, cpu_shares: 2048 to the phantom service. OS-level cap against a future fork-bomb regression as secondary defense on top of the Phase 0 mutex, OOM killer bias toward qdrant/ollama under pressure, and 2x CPU weighting against qdrant/ollama under saturation. No cpus cap added; current "no cpu.max quota" behavior is correct (nr_throttled = 0 across the fleet lifetime).

Deletions (Phase 4 cleanup)

Full file deletions:

  • reflection.ts (the hardcoded for-loop)
  • validation.ts (the 6-gate validation pipeline with triple-judge minority veto for constitution and safety)
  • application.ts (applyDelta, applyApproved)
  • consolidation.ts (heuristic path; the LLM consolidation-judge has fired zero times lifetime)
  • golden-suite.ts (dead regression judge backing)
  • judges/client.ts, judges/consolidation-judge.ts, judges/constitution-judge.ts, judges/observation-judge.ts, judges/quality-judge.ts, judges/regression-judge.ts, judges/safety-judge.ts, judges/prompts.ts, judges/schemas.ts, judges/types.ts (whole judges/ directory)
  • Associated test files under __tests__/

engine.ts shrinks from 627 lines to roughly 280. Removes runCycle, daily cost cap, auto-rollback caller, recordJudgeCosts, pruneGoldenSuite, rollback, reverseChange, mergeCosts, totalCostFromJudgeCosts, isWithinCostCap, resetDailyCostIfNewDay. Preserves the Phase 0 activeCycle mutex as belt-and-suspenders.

constitution.ts, config.ts, metrics.ts, types.ts trimmed to match the new shape. VersionChange.type union changes from append|replace|remove to edit|compact|new|delete to match the file-operation model. EvolutionLogEntry.session_id replaced with drain_id plus session_ids[] because a drain can process many sessions. Old log entries on disk still parse via Partial<EvolutionVersion>.

src/memory/consolidation.ts: simplified to heuristic-only. The LLM path (consolidateSessionWithLLM) called into the now-deleted judges directory. The heuristic path is what was actually running in production (the LLM consolidation-judge had zero lifetime calls). No behavior regression.

Net LOC delta

Source: approximately -2200 deletions, +600 additions. Tests: approximately -800 deletions (judge tests, validation tests, application tests, cost cap tests, golden suite tests), +700 additions (reflection-subprocess tests, invariant-check tests, snapshot tests, sentinel parser tests, subprocess-prompt pin tests, mutex tests, batch-processor tests). Net: approximately -1700 lines.

Test plan

  • bun test: 1400 pass / 10 skip / 0 fail. Up +13 from the 1387 baseline after the Phase 1+2 second fix pass merge.
  • bun run lint: clean.
  • bun run typecheck: clean.
  • Soak on a single instance for qualitative validation. Watch for: reflection_stats block appears in metrics.json after first drain, reflection subprocess completes, no invariant hard fails, compaction of the accumulated bloated user-profile.md produces a shorter file with semantic targeting.
  • Observe reflection_stats distribution after 24 hours of drain activity. Expected: 70 to 80 percent Haiku runs, 15 to 25 percent Sonnet, 1 to 5 percent Opus, zero invariant hard fails, 30 to 40 percent skip ratio.
  • Observe evolved memory files growing with meaningful content in domain-knowledge and strategies, not just user-profile appends.
  • Confirm per-evolution cost drops 7 to 15x versus the baseline.

What this does not ship

  • CLI helper for the poison pile (phantom evolution list-poison and friends). Deferred to a follow-up PR. The SQLite table and the markFailed / auto-graduation-at-3 logic ship; operators can sqlite3 phantom.db 'SELECT * FROM evolution_queue_poison' directly in the interim.
  • File-splitting restructuring operations. The reflection subprocess can propose restructuring in its sentinel but does not execute it. Deferred to a future PR with explicit operator review on first use.

Known deploy drift

  • config/evolution.yaml on instances with the previous code will produce non-fatal zod warnings on the first restart until the trimmed YAML lands alongside the code.
  • phantom-config/meta/metrics.json on instances with the previous code still carries the old judge_costs and rollback_count blocks. readMetrics silently drops unknown fields on write; the blocks bleed out naturally.
  • src/mcp/resources.ts serializes getVersionHistory() for the MCP changelog resource. New entries use the new VersionChange shape. Old entries still parse via Partial<EvolutionVersion>. Downstream consumers of the MCP changelog see the new vocabulary on fresh data.

The learning loop is now Cardinal Rule compliant end to end. The six-judge
validation pipeline and the hardcoded buildCritiqueFromObservations for-loop
are deleted. In their place:

- reflection-subprocess.ts spawns the Agent SDK as a sandboxed memory
  manager. The agent reads the batch, reads the memory files, and decides
  what to learn, what to compact, when to skip, when to promote between
  files, and which model tier to run at. TypeScript is plumbing: snapshot,
  spawn, parse sentinel, invariant check, commit or rollback.
- subprocess-prompt.ts carries the teaching prompt as a TypeScript
  constant with a runtime-facts header. The prompt teaches file purposes,
  signal selection, format, compaction, promotion, skip, escalation, and
  constitution immutability. Includes one worked good/bad bullet example.
- invariant-check.ts implements nine deterministic post-write invariants.
  Hard fails rollback via snapshot, soft warnings log. I6 is two-tier:
  hard fail on credential patterns, soft warn on external URLs. I4 growth
  cap per file is 80 lines.
- versioning.ts adds snapshot/restore and deletes the broken rollback
  reverse-change path.
- queue.ts adds retry_count and the evolution_queue_poison table for
  bounded retries on invariant failures (three strikes then graduate).
- metrics.ts adds reflection_stats and drops judge_costs and auto-rollback.
- Docker compose adds pids_limit=256, oom_score_adj=-500, cpu_shares=2048
  on the phantom service. No cpus cap.
- Trims engine.ts from 627 to ~280 lines, deletes reflection.ts,
  validation.ts, application.ts, consolidation.ts, golden-suite.ts, and
  the whole judges/ directory (10 source files + 7 test files).

Net: approximately -2200 lines of source, +600 lines of new source, +700
lines of new tests, -800 lines of deleted tests. Test count 1400 pass /
10 skip / 0 fail (up +13 from the 1387 baseline). Lint clean. Typecheck
clean.

Fixes the 100 percent appends-to-user-profile pathology, reduces evolution
cost roughly 7 to 15x, and unlocks the chat channel project on a clean
memory substrate.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bf1a06c889

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +164 to +168
const result = await runReflectionSubprocess({
batch,
config: this.config,
phantomConfig: this.runtime ? this.runtime.getPhantomConfig() : null,
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor reflection.enabled before spawning subprocess

resolveReflectionMode() computes this.reflectionEnabled from config/credentials, but the batch path still calls runReflectionSubprocess unconditionally here (and in runSingleSession). With reflection.enabled: "never" (or auto with no auth), drains still attempt SDK calls, return transient errors, and rows stay in evolution_queue forever because cadence only removes ok/invariant-failed rows. This makes the disable flag ineffective and can wedge queue processing in non-credential environments.

Useful? React with 👍 / 👎.

const session = q.session_summary;
if (this.countedSessionKeys.has(session.session_key)) continue;
this.countedSessionKeys.add(session.session_key);
updateAfterSession(this.config, session.outcome, false);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop hardcoding correction flag to false

Session metrics are updated with hadCorrections fixed to false, so correction_count/correction_rate_7d can never increase regardless of actual session content. Since these fields are still persisted and exposed (for example via MCP metrics output), this silently degrades telemetry accuracy and makes correction-rate reporting permanently zero.

Useful? React with 👍 / 👎.

Eleven items from the independent review of PR #64. Three CRITs land
the dispatcher fixes that block merge; five MAJORs close test gaps
and type drift; three MINORs are cosmetic cleanups bundled into the
same commit.

CRIT-1 short-circuits runDrainPipeline and runSingleSession when
reflection.enabled is false so a no-auth install does not spawn the
SDK subprocess on every cadence tick. The disabled-mode path still
records stats and ticks session_count.

CRIT-2 deletes correction_count and correction_rate_7d entirely. The
reflection_stats.files_touched map is the single source of truth for
per-file drain counts; TypeScript stops pretending to know what a
correction is.

CRIT-3 turns the implicit ok/invariantFailed boolean pair on
SessionBatchEntry into an explicit four-value disposition enum so
the cadence routes queue rows by name, not by deriving a boolean
from an unrelated string field.

MAJOR-1 adds engine and cadence integration tests for the disable
path. MAJOR-2 adds a cadence transient-failure test that pins the
rows-stay-in-queue-without-retry-bump semantics. MAJOR-3 deletes
dead code in reflection-subprocess.ts and threads drainId into the
SDK user message so the agent can read its batch file directly.
MAJOR-4 renames getHistory to getEvolutionLog, returns the new
EvolutionLogEntry shape, and adds migrateOldLogEntry for read-time
backward compat against old log rows on disk. MAJOR-5 drops compact
from the SubprocessStatus union; compaction is a per-change
annotation only.

MINOR-4 deletes the vestigial ConstitutionChecker class and inlines
the existence check into the engine constructor. MINOR-5 rewrites
the docker-compose memory comment to describe the Phase 3 profile.
MINOR-7 sets the rollback branch status label to skip so the
top-level field matches the truth on disk.
@mcheemaa mcheemaa merged commit 3fdfe34 into main Apr 15, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant