Skip to content

v0.6.0 — evaluator agent + kind=agent installer support

Choose a tag to compare

@alexherrero alexherrero released this 14 May 01:59
· 541 commits to main since this release

First customization beyond skills: the evaluator sub-agent — a read-only fresh-context grader. Caller supplies an ARTIFACT: reference and a numbered RUBRIC: inline in the dispatch prompt; the evaluator returns a structured PASS / NEEDS_WORK verdict with per-rubric-item reasoning citing the artifact.

Tool allowlist is [Read, Glob, Grep] only — no Bash, no Write, no Edit. The "fresh context that never saw the build" framing is structurally protected: the evaluator cannot re-execute, cannot mutate, cannot fetch.

Paired with agentic-harness v2.1.0, which adds a new optional §3b "evaluator augmentation" section to /review documenting how to dispatch the evaluator alongside the existing adversarial-reviewer. The two are complementary: adversarial finds defects ("the code contains bugs"); evaluator grades against an explicit rubric ("did this satisfy claims 1–5?"). The harness works standalone without the toolkit — §3b graceful-skips when agent-toolkit is absent.

Highlights

  • agents/evaluator.md — full body covering Purpose / Tool allowlist / Input contract (ARTIFACT + RUBRIC labeled sections) / Output contract (PASS/NEEDS_WORK header + per-rubric-item PASS/FAIL + Verdict line) / Workflow / 4 Failure modes / 7 Anti-patterns / 2 worked examples.
  • First-class kind: agent dispatch across install.sh + install.ps1: per-host paths (single file for claude-code + gemini-cli; sub-agent-as-skill wrap for Antigravity); new --agent flag; validate-manifests.py knows agents; bundle dispatch handles inner agents.
  • ADR 0002 — evaluator design — captures Context (forces driving the new primitive), Decision (5 locked design calls), Consequences (5 positive + 4 negative + load-bearing assumptions).
  • How to use the evaluator — practical recipe with three worked rubrics and four common failure modes with symptoms + fixes.
  • MANAGED_PARENTS extended with .claude/agents + .gemini/agents for true-sync --update orphan cleanup of toolkit-managed agents.

Future consumers

The evaluator is the load-bearing primitive for several future roadmap items:

  • Design skill (#6) — per-step grader for the design-doc → execution loop.
  • Quality-gates bundle (#10) — packages evaluator + kill-switch + steer + commit-on-stop + evidence-tracker for one-shot install.
  • Long-running custom skills — any skill that wants automated PASS / NEEDS_WORK grading on its own output.

See CHANGELOG.md for the full v0.6.0 entry.