v0.6.0 — evaluator agent + kind=agent installer support
First customization beyond skills: the evaluator sub-agent — a read-only fresh-context grader. Caller supplies an ARTIFACT: reference and a numbered RUBRIC: inline in the dispatch prompt; the evaluator returns a structured PASS / NEEDS_WORK verdict with per-rubric-item reasoning citing the artifact.
Tool allowlist is [Read, Glob, Grep] only — no Bash, no Write, no Edit. The "fresh context that never saw the build" framing is structurally protected: the evaluator cannot re-execute, cannot mutate, cannot fetch.
Paired with agentic-harness v2.1.0, which adds a new optional §3b "evaluator augmentation" section to /review documenting how to dispatch the evaluator alongside the existing adversarial-reviewer. The two are complementary: adversarial finds defects ("the code contains bugs"); evaluator grades against an explicit rubric ("did this satisfy claims 1–5?"). The harness works standalone without the toolkit — §3b graceful-skips when agent-toolkit is absent.
Highlights
agents/evaluator.md— full body covering Purpose / Tool allowlist / Input contract (ARTIFACT + RUBRIC labeled sections) / Output contract (PASS/NEEDS_WORK header + per-rubric-item PASS/FAIL + Verdict line) / Workflow / 4 Failure modes / 7 Anti-patterns / 2 worked examples.- First-class
kind: agentdispatch acrossinstall.sh+install.ps1: per-host paths (single file for claude-code + gemini-cli; sub-agent-as-skill wrap for Antigravity); new--agentflag;validate-manifests.pyknows agents; bundle dispatch handles inner agents. - ADR 0002 — evaluator design — captures Context (forces driving the new primitive), Decision (5 locked design calls), Consequences (5 positive + 4 negative + load-bearing assumptions).
- How to use the evaluator — practical recipe with three worked rubrics and four common failure modes with symptoms + fixes.
MANAGED_PARENTSextended with.claude/agents+.gemini/agentsfor true-sync--updateorphan cleanup of toolkit-managed agents.
Future consumers
The evaluator is the load-bearing primitive for several future roadmap items:
- Design skill (#6) — per-step grader for the design-doc → execution loop.
- Quality-gates bundle (#10) — packages evaluator + kill-switch + steer + commit-on-stop + evidence-tracker for one-shot install.
- Long-running custom skills — any skill that wants automated PASS / NEEDS_WORK grading on its own output.
See CHANGELOG.md for the full v0.6.0 entry.