The world-class revamp: corrected economics, context discipline, a deep surface, the
learning moat wired end-to-end, a falsifiable benchmark, and engine-owned worktrees.
Added
- Worktree Phase 1, engine-owned (
pqa/worktrees.py): one isolated git worktree
per branch on ephemeralpqa/<run>-bNbranches, with a write-ahead registry in
.pqa/state.jsonso strays survive even a mid-run SIGKILL; rollback on partial
spawn;reconcile()merges--no-ff, aborts on conflict preserving the survivor
branch, and always prunes.Branch.workdir+run(workdirs=...)thread isolation
through the engine;spawn_branches.sh/reconcile.shbecame thin engine callers;
the orchestrator and reconciler honorbranches_mode = "worktree"with a stray
sweep at init. Zero-orphan recovery is a tested invariant. - Locked eval benchmark: 8 tasks under
evals/tasks/(eachtask.toml+ LOCKED
verify.py+reference.pymust-pass +sabotage.pymust-fail) and
scripts/eval_harness.py(deterministicscore/report/smoke, zero model
calls);/evalandpqa-eval-runnerwired to it; nightlyeval-smokeworkflow
re-proves verifier integrity. The README documents the methodology; live numbers
land only from a live run — losses included. - The learning moat, wired: conviction signals get their outcomes back-filled
post-collapse;pqa/instincts.pysynthesizes instincts from precipitates+failures
(overlap clustering; confidence from support and contradictions); prior-art injects
top instincts;RunReportcarriesinstincts_injectedand per-instinct agreement;
the dashboard gains calibration + instincts sections; the self-reflector reads the
engine'scalibration(). - Run resume: crash-resumable run journal (
pqa/state.py→.pqa/state.json,
atomic tmp+rename) —/pqa --resumere-enters at the first incomplete stage.
Journal writes preserve foreign top-level keys (the file is shared with the
worktree registry). - Generated configuration reference:
docs/configuration.mdrendered from
pqa/config.pybyscripts/generate_config_doc.py, drift-pinned by tests.
Changed
- Economics corrected; tokens primary: cost-model defects fixed, model aliases
(fable/opus/sonnet/haiku) wired to real pricing/dispatch via
pqa.cost.resolve_model, budgets token-primary with USD secondary, and a
pre-flightwould_abortgate before every dispatch (not just after the spend). - Model routing per role: Fable 5 where output quality is decided (generators,
unknown-scout, adversary, collapse-judge, baseline control); sonnet/haiku for
mechanical and bookkeeping tiers. "Every agent on Opus" is gone from docs and
dispatch. - Context discipline in the orchestrator: branch payloads live on disk and are
read only by the subagent that needs them; the orchestrator holds ≤200 tokens of
state per branch (digests only) and reports per-stage context telemetry. - Surface: depth over breadth — 34 agents · 59 skills · 27 commands trimmed to
14 agents · 12 commands · 12 deep skills (each skill a protocol + worked
example + anti-patterns playbook);validate_components.pygained census, depth,
and description-budget gates and drift-gatesdocs/catalog.json. - Memory retrieves by relevance under a hard token budget (not recency), and
every injected memory id is reported per run. - Docs truth pass: README (counts, hook claims, stage wording, workflow count,
status), architecture.md (rewritten to the shipped reality), CONTRIBUTING and the
plugin manifests; hooks language unified with SECURITY.md (two blocking hooks,
the rest fail open; the binding guarantee lives in CI).
Fixed
- Hook hardening: per-hook kill-switches (
PQA_DISABLED_HOOKS), once-per-session
research gate, fail-closed fixes on the security/secrets gates.