Context
Implements the protocol surface defined in docs/spec/28-judge-layer.md (merged via #111, originating RFC #110). This is the first follow-up implementation issue for the judge layer — it lands the protocol, the canonical dataclasses, the filesystem-default reference judge, and the conformance suite. Subsequent issues build on top.
Mirrors the MemoryBackend protocol-pattern template established by PR #57 / spec/20.
Scope
Module layout
atomic_agents/judge/
├── __init__.py # registry: register_backend() / get_backend()
├── backend.py # JudgeBackend Protocol + all dataclasses + exception taxonomy
├── proposal.py # framework-side proposal assembly
├── llm.py # LLMJudgeBackend (default — wraps LLMBackend #87)
└── rules.py # PolicyJudge / RuleEngineJudgeBackend (always-on baseline)
Dataclasses (per spec §"Action proposal" + §"Four-outcome model" + §"Audit shape")
ActionProposal — full proposal with framework-introspected + actor-side-channel fields, including side_channel_for_tool_call_id binding
ProposalAmendment — judge-amendable subset (judge cannot forge framework-managed fields)
Evidence, Authorization, SkillRef — proposal sub-types
Judgment — what the backend returns (outcome / reason / amendment / escalation_queue_id)
JudgmentEvent — framework-wrapped audit shape (adds raw_outcome, enforcement_action, cost_source, binding)
JudgePolicyContext — what the judge sees (persona digest + tools.md entry + class_policy + recent runs + cited notes + delegate_chain + loaded_skills)
JudgeRuntimeConfig — framework-only (NEVER in LLM judge prompt; conformance test asserts)
JudgmentContext — wrapper containing both
ClassPolicySnapshot — per-class policy after project-floor + default-fill merging, with source per class
Protocol surface
@runtime_checkable
class JudgeBackend(Protocol):
def evaluate(self, proposal: ActionProposal, context: JudgmentContext) -> Judgment: ...
def supported_outcomes(self) -> set[JudgmentOutcome]: ...
def supports_read_audit(self) -> bool: ...
def supports_specialist_composition(self) -> bool: ...
@property
def judge_id(self) -> str: ...
@property
def policy_version(self) -> str: ...
def close(self) -> None: ...
Exception taxonomy
JudgeError (base) + JudgeUnavailable / JudgePolicyInvalid / JudgeBudgetExhausted / JudgeProposalInvalid / JudgeAmendedProposalRejected. Each maps to a default outcome via judges.md's failure_policy (default: all → block).
Default reference implementations
Framework integration
atomic_agents/agent.py — agent.call() multi-turn loop gains judge dispatch between LLM tool_use parsing and tool handler dispatch (per spec §"Where the judge sits in agent.call()")
atomic_agents/_canonical.py — new helper module for canonical-JSON hashing of arguments_hash + tool_definition_hash (sort_keys=True, separators=(',',':'), ensure_ascii=False); tool_definition_hash covers module + qualname, NOT bytecode
atomic_agents/_costs.py — cost_source field added to cost-event dataclass with Literal['actor', 'judge'] = 'actor' default for legacy-records backward-compat; sum_cost_for_period() gains source: Literal['actor','judge'] | None = None filter
atomic_agents/judges_md.py — new judges.md parser following the parser-rules section in the spec (default-fill class policy + failure policy, JudgePolicyInvalid on malformed YAML)
Conformance suite (~30 tests per spec)
In tests/test_judge_backend_conformance.py, parameterized across the two reference backends:
evaluate returns valid Judgment for each outcome in supported_outcomes()
evaluate does not mutate proposal or JudgePolicyContext
- Latency bounded by configurable timeout →
JudgeUnavailable
- Concurrent
evaluate does not corrupt named state: (policy cache, LLM client, judge budget counter, ensemble vote buffer, JSONL writer position, escalation queue file, backend registry)
policy_version changes on policy source change; atomic snapshot (no partial reads); invalid utf-8 → JudgePolicyInvalid
- Framework recomputes amended-proposal classification; judge cannot influence it
- Schema-invalid amended →
JudgeAmendedProposalRejected
- Stricter class applies when amended class higher than original
- Revise loop bounded (second revise →
BLOCK with revise_loop_exhausted)
- Exception taxonomy maps per
failure_policy
- Side-channel mismatch detection (missing / unbound / duplicate)
- Audit JSONL includes
tool_definition_hash, arguments_hash, tool_call_id, raw_outcome, enforcement_action, cost_source
- Read-audit mode bypasses block but emits event with
enforcement_action: "audit_bypass"
- Escalation writes PENDING file with full proposal; resolution writes RESOLVED event; redacted leaves marker
- Hash determinism + sensitivity
- Project-floor
judges.md cannot be relaxed by delegate judges.md → JudgePolicyInvalid at load time
JudgeRuntimeConfig never appears in LLM judge prompt (conformance test reads the assembled prompt and asserts)
close() idempotent
Plus per-backend tests for PolicyJudge (rule matching, write-path enforcement) and LLMJudgeBackend (prompt assembly, structured judgment parsing, policy_source hash invalidation).
Dependencies
Split decision: ship PolicyJudge + protocol + conformance suite first (no LLM dependency). Layer LLMJudgeBackend on once #87 is in. Either approach valid; size of impl PR will determine.
Doctor checks
Added by this impl:
check_judge_health — recent JudgeUnavailable rate
check_judge_policy_sync — tools.md + judges.md hash lag warning
check_judge_policy_floor — project-floor relax-violation surfacer
check_judge_model_family — warns when configured judge model family matches actor model family
vault_synced_judge_captures_off — warns on Obsidian Sync / iCloud / syncthing signals + judge_captures: false
Estimated size
~1500-2500 lines of implementation (protocol + filesystem-default + agent.py wiring + canonical hash helper + judges.md parser + costs.py changes). ~30 conformance tests + ~20 per-backend tests + ~10 framework-integration tests. Roughly the shape + scale of PR #57 (MemoryBackend).
Out of scope (filed as separate issues)
References
Context
Implements the protocol surface defined in
docs/spec/28-judge-layer.md(merged via #111, originating RFC #110). This is the first follow-up implementation issue for the judge layer — it lands the protocol, the canonical dataclasses, the filesystem-default reference judge, and the conformance suite. Subsequent issues build on top.Mirrors the MemoryBackend protocol-pattern template established by PR #57 /
spec/20.Scope
Module layout
Dataclasses (per spec §"Action proposal" + §"Four-outcome model" + §"Audit shape")
ActionProposal— full proposal with framework-introspected + actor-side-channel fields, includingside_channel_for_tool_call_idbindingProposalAmendment— judge-amendable subset (judge cannot forge framework-managed fields)Evidence,Authorization,SkillRef— proposal sub-typesJudgment— what the backend returns (outcome / reason / amendment / escalation_queue_id)JudgmentEvent— framework-wrapped audit shape (addsraw_outcome,enforcement_action,cost_source,binding)JudgePolicyContext— what the judge sees (persona digest + tools.md entry + class_policy + recent runs + cited notes + delegate_chain + loaded_skills)JudgeRuntimeConfig— framework-only (NEVER in LLM judge prompt; conformance test asserts)JudgmentContext— wrapper containing bothClassPolicySnapshot— per-class policy after project-floor + default-fill merging, withsourceper classProtocol surface
Exception taxonomy
JudgeError(base) +JudgeUnavailable/JudgePolicyInvalid/JudgeBudgetExhausted/JudgeProposalInvalid/JudgeAmendedProposalRejected. Each maps to a default outcome viajudges.md'sfailure_policy(default: all → block).Default reference implementations
PolicyJudge(rule-engine,atomic_agents/judge/rules.py) — always-on baseline; matches tools.md write paths, allowlists, deny rules, and class-policy enforcement. Microseconds latency.LLMJudgeBackend(LLM-backed,atomic_agents/judge/llm.py) — runs after PolicyJudge if PolicyJudge allowed. Default modelgpt-5-nano(OpenAI; different family than default Anthropic actor per correlated-judgment mitigation). WrapsLLMBackend([backend] LLMBackend Protocol + canonical types + native reference implementations #87) — composition only takes effect after [backend] LLMBackend Protocol + canonical types + native reference implementations #87 lands.Framework integration
atomic_agents/agent.py—agent.call()multi-turn loop gains judge dispatch between LLM tool_use parsing and tool handler dispatch (per spec §"Where the judge sits inagent.call()")atomic_agents/_canonical.py— new helper module for canonical-JSON hashing ofarguments_hash+tool_definition_hash(sort_keys=True, separators=(',',':'), ensure_ascii=False);tool_definition_hashcovers module + qualname, NOT bytecodeatomic_agents/_costs.py—cost_sourcefield added to cost-event dataclass withLiteral['actor', 'judge'] = 'actor'default for legacy-records backward-compat;sum_cost_for_period()gainssource: Literal['actor','judge'] | None = Nonefilteratomic_agents/judges_md.py— newjudges.mdparser following the parser-rules section in the spec (default-fill class policy + failure policy, JudgePolicyInvalid on malformed YAML)Conformance suite (~30 tests per spec)
In
tests/test_judge_backend_conformance.py, parameterized across the two reference backends:evaluatereturns validJudgmentfor each outcome insupported_outcomes()evaluatedoes not mutate proposal orJudgePolicyContextJudgeUnavailableevaluatedoes not corrupt named state: (policy cache, LLM client, judge budget counter, ensemble vote buffer, JSONL writer position, escalation queue file, backend registry)policy_versionchanges on policy source change; atomic snapshot (no partial reads); invalid utf-8 →JudgePolicyInvalidJudgeAmendedProposalRejectedBLOCKwithrevise_loop_exhausted)failure_policytool_definition_hash,arguments_hash,tool_call_id,raw_outcome,enforcement_action,cost_sourceenforcement_action: "audit_bypass"judges.mdcannot be relaxed by delegatejudges.md→JudgePolicyInvalidat load timeJudgeRuntimeConfignever appears in LLM judge prompt (conformance test reads the assembled prompt and asserts)close()idempotentPlus per-backend tests for
PolicyJudge(rule matching, write-path enforcement) andLLMJudgeBackend(prompt assembly, structured judgment parsing,policy_sourcehash invalidation).Dependencies
LLMJudgeBackendwrapsLLMBackend. This issue blocks until [backend] LLMBackend Protocol + canonical types + native reference implementations #87 ships; once [backend] LLMBackend Protocol + canonical types + native reference implementations #87 lands, the LLM judge implementation can be completed.Split decision: ship PolicyJudge + protocol + conformance suite first (no LLM dependency). Layer LLMJudgeBackend on once #87 is in. Either approach valid; size of impl PR will determine.
Doctor checks
Added by this impl:
check_judge_health— recentJudgeUnavailableratecheck_judge_policy_sync—tools.md+judges.mdhash lag warningcheck_judge_policy_floor— project-floor relax-violation surfacercheck_judge_model_family— warns when configured judge model family matches actor model familyvault_synced_judge_captures_off— warns on Obsidian Sync / iCloud / syncthing signals +judge_captures: falseEstimated size
~1500-2500 lines of implementation (protocol + filesystem-default + agent.py wiring + canonical hash helper + judges.md parser + costs.py changes). ~30 conformance tests + ~20 per-backend tests + ~10 framework-integration tests. Roughly the shape + scale of PR #57 (MemoryBackend).
Out of scope (filed as separate issues)
References
docs/spec/28-judge-layer.md— the spec this implementsdocs/spec/20-memory-backend.md+ PR refactor(memory): extract MemoryBackend protocol; FilesystemBackend default #57 — protocol-pattern template