[backend] JudgeBackend Protocol + filesystem-default reference implementation

## Context

Implements the protocol surface defined in [`docs/spec/28-judge-layer.md`](docs/spec/28-judge-layer.md) (merged via #111, originating RFC #110). This is the **first follow-up implementation issue** for the judge layer — it lands the protocol, the canonical dataclasses, the filesystem-default reference judge, and the conformance suite. Subsequent issues build on top.

Mirrors the MemoryBackend protocol-pattern template established by PR #57 / `spec/20`.

## Scope

### Module layout

```
atomic_agents/judge/
├── __init__.py        # registry: register_backend() / get_backend()
├── backend.py         # JudgeBackend Protocol + all dataclasses + exception taxonomy
├── proposal.py        # framework-side proposal assembly
├── llm.py             # LLMJudgeBackend (default — wraps LLMBackend #87)
└── rules.py           # PolicyJudge / RuleEngineJudgeBackend (always-on baseline)
```

### Dataclasses (per spec §"Action proposal" + §"Four-outcome model" + §"Audit shape")

- `ActionProposal` — full proposal with framework-introspected + actor-side-channel fields, including `side_channel_for_tool_call_id` binding
- `ProposalAmendment` — judge-amendable subset (judge cannot forge framework-managed fields)
- `Evidence`, `Authorization`, `SkillRef` — proposal sub-types
- `Judgment` — what the backend returns (outcome / reason / amendment / escalation_queue_id)
- `JudgmentEvent` — framework-wrapped audit shape (adds `raw_outcome`, `enforcement_action`, `cost_source`, `binding`)
- `JudgePolicyContext` — what the judge sees (persona digest + tools.md entry + class_policy + recent runs + cited notes + delegate_chain + loaded_skills)
- `JudgeRuntimeConfig` — framework-only (NEVER in LLM judge prompt; conformance test asserts)
- `JudgmentContext` — wrapper containing both
- `ClassPolicySnapshot` — per-class policy after project-floor + default-fill merging, with `source` per class

### Protocol surface

```python
@runtime_checkable
class JudgeBackend(Protocol):
    def evaluate(self, proposal: ActionProposal, context: JudgmentContext) -> Judgment: ...
    def supported_outcomes(self) -> set[JudgmentOutcome]: ...
    def supports_read_audit(self) -> bool: ...
    def supports_specialist_composition(self) -> bool: ...
    @property
    def judge_id(self) -> str: ...
    @property
    def policy_version(self) -> str: ...
    def close(self) -> None: ...
```

### Exception taxonomy

`JudgeError` (base) + `JudgeUnavailable` / `JudgePolicyInvalid` / `JudgeBudgetExhausted` / `JudgeProposalInvalid` / `JudgeAmendedProposalRejected`. Each maps to a default outcome via `judges.md`'s `failure_policy` (default: all → block).

### Default reference implementations

- **`PolicyJudge`** (rule-engine, `atomic_agents/judge/rules.py`) — always-on baseline; matches tools.md write paths, allowlists, deny rules, and class-policy enforcement. Microseconds latency.
- **`LLMJudgeBackend`** (LLM-backed, `atomic_agents/judge/llm.py`) — runs after PolicyJudge if PolicyJudge allowed. Default model `gpt-5-nano` (OpenAI; different family than default Anthropic actor per correlated-judgment mitigation). Wraps `LLMBackend` (#87) — composition only takes effect after #87 lands.

### Framework integration

- `atomic_agents/agent.py` — `agent.call()` multi-turn loop gains judge dispatch between LLM tool_use parsing and tool handler dispatch (per spec §"Where the judge sits in `agent.call()`")
- `atomic_agents/_canonical.py` — new helper module for canonical-JSON hashing of `arguments_hash` + `tool_definition_hash` (`sort_keys=True, separators=(',',':'), ensure_ascii=False`); `tool_definition_hash` covers module + qualname, NOT bytecode
- `atomic_agents/_costs.py` — `cost_source` field added to cost-event dataclass with `Literal['actor', 'judge'] = 'actor'` default for legacy-records backward-compat; `sum_cost_for_period()` gains `source: Literal['actor','judge'] | None = None` filter
- `atomic_agents/judges_md.py` — new `judges.md` parser following the parser-rules section in the spec (default-fill class policy + failure policy, JudgePolicyInvalid on malformed YAML)

### Conformance suite (~30 tests per spec)

In `tests/test_judge_backend_conformance.py`, parameterized across the two reference backends:

- `evaluate` returns valid `Judgment` for each outcome in `supported_outcomes()`
- `evaluate` does not mutate proposal or `JudgePolicyContext`
- Latency bounded by configurable timeout → `JudgeUnavailable`
- Concurrent `evaluate` does not corrupt named state: (policy cache, LLM client, judge budget counter, ensemble vote buffer, JSONL writer position, escalation queue file, backend registry)
- `policy_version` changes on policy source change; atomic snapshot (no partial reads); invalid utf-8 → `JudgePolicyInvalid`
- Framework recomputes amended-proposal classification; judge cannot influence it
- Schema-invalid amended → `JudgeAmendedProposalRejected`
- Stricter class applies when amended class higher than original
- Revise loop bounded (second revise → `BLOCK` with `revise_loop_exhausted`)
- Exception taxonomy maps per `failure_policy`
- Side-channel mismatch detection (missing / unbound / duplicate)
- Audit JSONL includes `tool_definition_hash`, `arguments_hash`, `tool_call_id`, `raw_outcome`, `enforcement_action`, `cost_source`
- Read-audit mode bypasses block but emits event with `enforcement_action: "audit_bypass"`
- Escalation writes PENDING file with full proposal; resolution writes RESOLVED event; redacted leaves marker
- Hash determinism + sensitivity
- Project-floor `judges.md` cannot be relaxed by delegate `judges.md` → `JudgePolicyInvalid` at load time
- `JudgeRuntimeConfig` never appears in LLM judge prompt (conformance test reads the assembled prompt and asserts)
- `close()` idempotent

Plus per-backend tests for `PolicyJudge` (rule matching, write-path enforcement) and `LLMJudgeBackend` (prompt assembly, structured judgment parsing, `policy_source` hash invalidation).

## Dependencies

- **#87 LLMBackend Protocol** — `LLMJudgeBackend` wraps `LLMBackend`. This issue blocks until #87 ships; once #87 lands, the LLM judge implementation can be completed.
- **None for PolicyJudge** — rule-engine judge can ship independent of LLMBackend.

Split decision: ship PolicyJudge + protocol + conformance suite first (no LLM dependency). Layer LLMJudgeBackend on once #87 is in. Either approach valid; size of impl PR will determine.

## Doctor checks

Added by this impl:
- `check_judge_health` — recent `JudgeUnavailable` rate
- `check_judge_policy_sync` — `tools.md` + `judges.md` hash lag warning
- `check_judge_policy_floor` — project-floor relax-violation surfacer
- `check_judge_model_family` — warns when configured judge model family matches actor model family
- `vault_synced_judge_captures_off` — warns on Obsidian Sync / iCloud / syncthing signals + `judge_captures: false`

## Estimated size

~1500-2500 lines of implementation (protocol + filesystem-default + agent.py wiring + canonical hash helper + judges.md parser + costs.py changes). ~30 conformance tests + ~20 per-backend tests + ~10 framework-integration tests. Roughly the shape + scale of PR #57 (MemoryBackend).

## Out of scope (filed as separate issues)

- Memory provenance labels and migration lint — separate spec extension
- Reference judge example wired against the Caldwell sample agent — separate example issue
- Escalation queue review UI (dashboard tab + operator workflow) — separate issue
- Cross-agent (fleet-wide) judge policies — see PolicyBackend (#89)
- Streaming judgment — deferred until streaming use cases appear

## References

- `docs/spec/28-judge-layer.md` — the spec this implements
- PR #111 — the merged spec PR
- #110 (closed) — origin RFC
- #87 LLMBackend — composes with this
- #89 PolicyBackend — composes with this
- `docs/spec/20-memory-backend.md` + PR #57 — protocol-pattern template

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[backend] JudgeBackend Protocol + filesystem-default reference implementation #112

Context

Scope

Module layout

Dataclasses (per spec §"Action proposal" + §"Four-outcome model" + §"Audit shape")

Protocol surface

Exception taxonomy

Default reference implementations

Framework integration

Conformance suite (~30 tests per spec)

Dependencies

Doctor checks

Estimated size

Out of scope (filed as separate issues)

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[backend] JudgeBackend Protocol + filesystem-default reference implementation #112

Description

Context

Scope

Module layout

Dataclasses (per spec §"Action proposal" + §"Four-outcome model" + §"Audit shape")

Protocol surface

Exception taxonomy

Default reference implementations

Framework integration

Conformance suite (~30 tests per spec)

Dependencies

Doctor checks

Estimated size

Out of scope (filed as separate issues)

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions