feat: research integrity layers 2+3 (closes #8) by dep0we · Pull Request #21 · dep0we/atomic-agents-stack

dep0we · 2026-05-07T13:35:34Z

Summary

Implements spec/13 Layers 2 and 3 — source-grounded eval and the research-log-per-response trail. Layer 1 (citation requirements via persona + rubric) was already done in v0.1. Together, the three layers make agent factual accuracy auditable: every claim in a response can be traced back to the helper that produced it and the source it came from.

This PR is stacked on `feat/helper-provenance` (PR #20) — Layer 3 depends on the helper provenance plumbing landing first. Merge order: #15 → #16 → #17 → #18 → #19 → #20 → this.

What landed

Layer 2 — Source-grounded eval (`atomic_agents/eval.py`)

`_build_judge_prompt` — when a test declares `expected_facts`, append a "Factual accuracy check" section instructing the judge to verify each fact (`stated_in_response`, `value_correct`, `cited`) and emit a `factual_checks` array in its JSON response.
`_render_factual_check_section` — builds the addendum: one bullet per expected fact + the JSON template the judge follows.
`compute_factual_accuracy_from_checks(checks)` (module-level for testability) — derive a 1–5 dimension score from a list of factual_checks:
- Verified (stated AND value_correct AND cited) → full credit
- Stated + correct but uncited → half credit
- Stated + wrong value → no credit
- Score = `round(5 × verified / total)`, clamped to [1, 5]
- Empty checks → `None` (no signal to score from)
`_compute_weighted_score` — when `factual_accuracy` is in the rubric weights but the judge didn't return a numeric score for it, derive one from `factual_checks`. Judge's numeric score takes priority when present (LLM may apply nuance the bare proportion misses).

Layer 3 — Research log per response (`atomic_agents/agent.py`)

`_helpers_this_run` — in-memory list of helper-call rollups; reset at the start of each `call()`, appended to by `helper_call()`.
`helper_call` — after logging the helper run line, append a rollup entry (model, summary, cost, latency, sources_summarized, provenance_preserved) to the parent's `_helpers_this_run`.
`agent.call` — when finishing, embed the rollup as `helper_provenance` in the parent's run log record. Field is omitted when no helpers were called during the run, so log shape stays unchanged for reactive agents that don't fan out.

Example: log shape after a parent run that fanned out to two helpers

```json
{
"ts": "2026-05-08T11:32:00-05:00",
"trigger": "skill",
"model": "claude-opus-4-7-20260101",
"input_tokens": 4102,
"output_tokens": 892,
"cost_usd": 0.1287,
"status": "ok",
"summary": "Q1 bonus allocation question",
"run_id": "run-...",
"helper_provenance": [
{
"model": "claude-haiku-4-5-20251001",
"summary": "Summarize CPA memo",
"cost_usd": 0.00094,
"latency_ms": 240,
"sources_summarized": ["/docs/finance/cpa/2026-05-tax-mid-year.md"],
"provenance_preserved": true
},
{
"model": "claude-haiku-4-5-20251001",
"summary": "Extract Q1 income figures",
"cost_usd": 0.00071,
"latency_ms": 192,
"sources_summarized": ["/docs/finance/balance_sheet.md"],
"provenance_preserved": true
}
]
}
```

Test plan

16 new tests, 296 total passing
Layer 2: factual_accuracy from checks (perfect / zero floor / partial credit / mixed / wrong value), judge prompt includes/omits section based on expected_facts presence, every expected fact rendered in the prompt, weighted score derives factual_accuracy when judge omits it, judge's numeric score takes priority when present, no derivation without signal
Layer 3: helper_call appends to rollup, rollup omits sources_summarized for sources-less helpers, agent.call() embeds rollup in log record, rollup resets between calls
All prior 280 tests still passing — no regressions

Notes

Layer 3 here is the plumbing layer: it captures helper provenance automatically. The fuller spec/13 Layer 3 includes `sources_consulted` and `claims_in_response` fields that depend on the agent itself emitting them in its response (via fenced JSON or tool call) — the framework can pass them through, but populating them is an agent-side responsibility (the agent has to actually structure its output that way). This PR provides the helper-provenance side of Layer 3; the response-emission side is documented in spec/13 and is an authoring pattern, not a framework feature.
This closes the last open spec-deferral issue. The full v0.x build sequence ([v0.2] Eval runner — atomic_agents.eval #1 eval, [v0.3] Tuning analyzer — atomic_agents.tuning #2 tuning, [v0.4] Goal manager — atomic_agents.goal #3 goal, [v0.5] Schema migration runner — atomic_agents.migrate #4 migrate, [v0.6] Tool-call captures (Path 1) — atomic_agents._capture #5 tool-call captures, [v0.7] Multi-agent project cascade loader — atomic_agents.agent #6 cascade, [v0.8] Helper provenance preservation #7 helper provenance, [v0.9] Research integrity Layers 2 + 3 #8 research integrity) is now landed.

🤖 Generated with Claude Code

Implements spec/13 Layers 2 and 3 — source-grounded eval and the research-log-per-response trail. Layer 1 (citation requirements via persona + rubric) was already done. Together they make agent factual accuracy auditable: every claim in a response can be traced back to the helper that produced it and the source it came from. What landed: LAYER 2 — Source-grounded eval (atomic_agents/eval.py) - _build_judge_prompt: when a test declares expected_facts, append a "Factual accuracy check" section instructing the judge to verify each fact (stated_in_response, value_correct, cited) and emit a factual_checks array in its JSON response. - _render_factual_check_section: builds the addendum with one bullet per expected_fact and a JSON template the judge follows. - compute_factual_accuracy_from_checks (module-level for test): derive a 1-5 dimension score from a list of factual_checks. Verified = stated AND value_correct AND cited (full credit). Stated + correct but uncited = half credit. Empty checks → None (no signal). - _compute_weighted_score: when factual_accuracy is in rubric weights but the judge didn't return a numeric score for it, derive from factual_checks. Judge's numeric score takes priority when present. LAYER 3 — Research log per response (atomic_agents/agent.py) - _helpers_this_run: in-memory list of helper-call rollups, reset at the start of each call() and appended to by helper_call(). - helper_call: after logging the helper run line, append a rollup entry (model, summary, cost, latency, sources_summarized, provenance_preserved) to the parent's _helpers_this_run. - agent.call: when finishing, embed the rollup as helper_provenance in the parent's run log record. Field is omitted when no helpers were called during the run, so log shape stays unchanged for reactive agents that don't fan out. 16 new tests, 296 total passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: add CI workflow + update CHANGELOG for v0.9 (closes #10) CI: - .github/workflows/test.yml: GitHub Actions runs pytest on push to main and on every PR. Matrix: Python 3.11 + 3.12. Uses astral-sh/setup-uv@v3 with cache, fail-fast disabled (so one Python version's failure doesn't kill the other), in-progress cancellation on new pushes to same branch. - README.md: status badge, Python-version badge, MIT license badge. CHANGELOG: - New v0.9.0 section consolidating everything that landed across PRs #12 / #14 / #16 / #18 / #19 / #20 / #21 / #22 / #23 — eval, tuning, goal manager, migrate, tool-call captures, cascade loader, spec import, operational extras, helper provenance, research integrity layers 2+3. - Tests bumped 67 → 296 across 8 new modules. - v0.1 entry preserved unchanged below the new section. After this lands, v0.9 is feature-complete relative to the original spec. Remaining gaps before v1.0 (per the README status table) are non-code: first non-Bishop agent deployed end-to-end, vault docs sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: key uv cache off pyproject.toml (uv.lock is gitignored) Initial workflow defaulted to the **/uv.lock cache key, which fails on this repo because uv.lock is gitignored. Switching to pyproject.toml keeps caching working without changing the gitignore policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Dan Powers <dep0we@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…#25) The "What's shipped" table still claimed three rows were planned that landed in the recent PR stack: - Helper provenance enforcement (shipped #20, v0.8) - Research integrity layers 2+3 (shipped #21, v0.9) - Claude Code skill wrappers (shipped #19, v0.9) Refreshed to reflect actual state. Also added two more shipped rows (spec docs in repo from #18, CI workflow from #24) and updated the "first non-Bishop agent" gate to ⏳ next. Co-authored-by: Dan Powers <dep0we@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dep0we force-pushed the feat/helper-provenance branch from 9a0e990 to 1f07576 Compare May 7, 2026 13:45

dep0we changed the base branch from feat/helper-provenance to main May 7, 2026 13:45

dep0we force-pushed the feat/research-integrity-layers branch from ea02803 to 4273cc8 Compare May 7, 2026 13:46

dep0we merged commit 7253f62 into main May 7, 2026

dep0we deleted the feat/research-integrity-layers branch May 7, 2026 13:46

dep0we mentioned this pull request May 7, 2026

docs: refresh README status table #25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: research integrity layers 2+3 (closes #8)#21

feat: research integrity layers 2+3 (closes #8)#21
dep0we merged 1 commit into
mainfrom
feat/research-integrity-layers

dep0we commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dep0we commented May 7, 2026

Summary

What landed

Layer 2 — Source-grounded eval (`atomic_agents/eval.py`)

Layer 3 — Research log per response (`atomic_agents/agent.py`)

Example: log shape after a parent run that fanned out to two helpers

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant