Skip to content

[backend] JudgeBackend Protocol + filesystem-default reference implementation #112

@dep0we

Description

@dep0we

Context

Implements the protocol surface defined in docs/spec/28-judge-layer.md (merged via #111, originating RFC #110). This is the first follow-up implementation issue for the judge layer — it lands the protocol, the canonical dataclasses, the filesystem-default reference judge, and the conformance suite. Subsequent issues build on top.

Mirrors the MemoryBackend protocol-pattern template established by PR #57 / spec/20.

Scope

Module layout

atomic_agents/judge/
├── __init__.py        # registry: register_backend() / get_backend()
├── backend.py         # JudgeBackend Protocol + all dataclasses + exception taxonomy
├── proposal.py        # framework-side proposal assembly
├── llm.py             # LLMJudgeBackend (default — wraps LLMBackend #87)
└── rules.py           # PolicyJudge / RuleEngineJudgeBackend (always-on baseline)

Dataclasses (per spec §"Action proposal" + §"Four-outcome model" + §"Audit shape")

  • ActionProposal — full proposal with framework-introspected + actor-side-channel fields, including side_channel_for_tool_call_id binding
  • ProposalAmendment — judge-amendable subset (judge cannot forge framework-managed fields)
  • Evidence, Authorization, SkillRef — proposal sub-types
  • Judgment — what the backend returns (outcome / reason / amendment / escalation_queue_id)
  • JudgmentEvent — framework-wrapped audit shape (adds raw_outcome, enforcement_action, cost_source, binding)
  • JudgePolicyContext — what the judge sees (persona digest + tools.md entry + class_policy + recent runs + cited notes + delegate_chain + loaded_skills)
  • JudgeRuntimeConfig — framework-only (NEVER in LLM judge prompt; conformance test asserts)
  • JudgmentContext — wrapper containing both
  • ClassPolicySnapshot — per-class policy after project-floor + default-fill merging, with source per class

Protocol surface

@runtime_checkable
class JudgeBackend(Protocol):
    def evaluate(self, proposal: ActionProposal, context: JudgmentContext) -> Judgment: ...
    def supported_outcomes(self) -> set[JudgmentOutcome]: ...
    def supports_read_audit(self) -> bool: ...
    def supports_specialist_composition(self) -> bool: ...
    @property
    def judge_id(self) -> str: ...
    @property
    def policy_version(self) -> str: ...
    def close(self) -> None: ...

Exception taxonomy

JudgeError (base) + JudgeUnavailable / JudgePolicyInvalid / JudgeBudgetExhausted / JudgeProposalInvalid / JudgeAmendedProposalRejected. Each maps to a default outcome via judges.md's failure_policy (default: all → block).

Default reference implementations

Framework integration

  • atomic_agents/agent.pyagent.call() multi-turn loop gains judge dispatch between LLM tool_use parsing and tool handler dispatch (per spec §"Where the judge sits in agent.call()")
  • atomic_agents/_canonical.py — new helper module for canonical-JSON hashing of arguments_hash + tool_definition_hash (sort_keys=True, separators=(',',':'), ensure_ascii=False); tool_definition_hash covers module + qualname, NOT bytecode
  • atomic_agents/_costs.pycost_source field added to cost-event dataclass with Literal['actor', 'judge'] = 'actor' default for legacy-records backward-compat; sum_cost_for_period() gains source: Literal['actor','judge'] | None = None filter
  • atomic_agents/judges_md.py — new judges.md parser following the parser-rules section in the spec (default-fill class policy + failure policy, JudgePolicyInvalid on malformed YAML)

Conformance suite (~30 tests per spec)

In tests/test_judge_backend_conformance.py, parameterized across the two reference backends:

  • evaluate returns valid Judgment for each outcome in supported_outcomes()
  • evaluate does not mutate proposal or JudgePolicyContext
  • Latency bounded by configurable timeout → JudgeUnavailable
  • Concurrent evaluate does not corrupt named state: (policy cache, LLM client, judge budget counter, ensemble vote buffer, JSONL writer position, escalation queue file, backend registry)
  • policy_version changes on policy source change; atomic snapshot (no partial reads); invalid utf-8 → JudgePolicyInvalid
  • Framework recomputes amended-proposal classification; judge cannot influence it
  • Schema-invalid amended → JudgeAmendedProposalRejected
  • Stricter class applies when amended class higher than original
  • Revise loop bounded (second revise → BLOCK with revise_loop_exhausted)
  • Exception taxonomy maps per failure_policy
  • Side-channel mismatch detection (missing / unbound / duplicate)
  • Audit JSONL includes tool_definition_hash, arguments_hash, tool_call_id, raw_outcome, enforcement_action, cost_source
  • Read-audit mode bypasses block but emits event with enforcement_action: "audit_bypass"
  • Escalation writes PENDING file with full proposal; resolution writes RESOLVED event; redacted leaves marker
  • Hash determinism + sensitivity
  • Project-floor judges.md cannot be relaxed by delegate judges.mdJudgePolicyInvalid at load time
  • JudgeRuntimeConfig never appears in LLM judge prompt (conformance test reads the assembled prompt and asserts)
  • close() idempotent

Plus per-backend tests for PolicyJudge (rule matching, write-path enforcement) and LLMJudgeBackend (prompt assembly, structured judgment parsing, policy_source hash invalidation).

Dependencies

Split decision: ship PolicyJudge + protocol + conformance suite first (no LLM dependency). Layer LLMJudgeBackend on once #87 is in. Either approach valid; size of impl PR will determine.

Doctor checks

Added by this impl:

  • check_judge_health — recent JudgeUnavailable rate
  • check_judge_policy_synctools.md + judges.md hash lag warning
  • check_judge_policy_floor — project-floor relax-violation surfacer
  • check_judge_model_family — warns when configured judge model family matches actor model family
  • vault_synced_judge_captures_off — warns on Obsidian Sync / iCloud / syncthing signals + judge_captures: false

Estimated size

~1500-2500 lines of implementation (protocol + filesystem-default + agent.py wiring + canonical hash helper + judges.md parser + costs.py changes). ~30 conformance tests + ~20 per-backend tests + ~10 framework-integration tests. Roughly the shape + scale of PR #57 (MemoryBackend).

Out of scope (filed as separate issues)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent-loopAgent call loop, tool iteration, cancellation, run statebackendProtocol-pattern backend abstractions (memory, logs, locks, etc.)enhancementNew feature or requestspecImplementation of an Atomic Agents spec doc

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions