feat: research integrity layers 2+3 (closes #8)#21
Merged
Conversation
9a0e990 to
1f07576
Compare
Implements spec/13 Layers 2 and 3 — source-grounded eval and the research-log-per-response trail. Layer 1 (citation requirements via persona + rubric) was already done. Together they make agent factual accuracy auditable: every claim in a response can be traced back to the helper that produced it and the source it came from. What landed: LAYER 2 — Source-grounded eval (atomic_agents/eval.py) - _build_judge_prompt: when a test declares expected_facts, append a "Factual accuracy check" section instructing the judge to verify each fact (stated_in_response, value_correct, cited) and emit a factual_checks array in its JSON response. - _render_factual_check_section: builds the addendum with one bullet per expected_fact and a JSON template the judge follows. - compute_factual_accuracy_from_checks (module-level for test): derive a 1-5 dimension score from a list of factual_checks. Verified = stated AND value_correct AND cited (full credit). Stated + correct but uncited = half credit. Empty checks → None (no signal). - _compute_weighted_score: when factual_accuracy is in rubric weights but the judge didn't return a numeric score for it, derive from factual_checks. Judge's numeric score takes priority when present. LAYER 3 — Research log per response (atomic_agents/agent.py) - _helpers_this_run: in-memory list of helper-call rollups, reset at the start of each call() and appended to by helper_call(). - helper_call: after logging the helper run line, append a rollup entry (model, summary, cost, latency, sources_summarized, provenance_preserved) to the parent's _helpers_this_run. - agent.call: when finishing, embed the rollup as helper_provenance in the parent's run log record. Field is omitted when no helpers were called during the run, so log shape stays unchanged for reactive agents that don't fan out. 16 new tests, 296 total passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ea02803 to
4273cc8
Compare
dep0we
added a commit
that referenced
this pull request
May 7, 2026
* chore: add CI workflow + update CHANGELOG for v0.9 (closes #10) CI: - .github/workflows/test.yml: GitHub Actions runs pytest on push to main and on every PR. Matrix: Python 3.11 + 3.12. Uses astral-sh/setup-uv@v3 with cache, fail-fast disabled (so one Python version's failure doesn't kill the other), in-progress cancellation on new pushes to same branch. - README.md: status badge, Python-version badge, MIT license badge. CHANGELOG: - New v0.9.0 section consolidating everything that landed across PRs #12 / #14 / #16 / #18 / #19 / #20 / #21 / #22 / #23 — eval, tuning, goal manager, migrate, tool-call captures, cascade loader, spec import, operational extras, helper provenance, research integrity layers 2+3. - Tests bumped 67 → 296 across 8 new modules. - v0.1 entry preserved unchanged below the new section. After this lands, v0.9 is feature-complete relative to the original spec. Remaining gaps before v1.0 (per the README status table) are non-code: first non-Bishop agent deployed end-to-end, vault docs sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: key uv cache off pyproject.toml (uv.lock is gitignored) Initial workflow defaulted to the **/uv.lock cache key, which fails on this repo because uv.lock is gitignored. Switching to pyproject.toml keeps caching working without changing the gitignore policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Dan Powers <dep0we@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dep0we
added a commit
that referenced
this pull request
May 7, 2026
…#25) The "What's shipped" table still claimed three rows were planned that landed in the recent PR stack: - Helper provenance enforcement (shipped #20, v0.8) - Research integrity layers 2+3 (shipped #21, v0.9) - Claude Code skill wrappers (shipped #19, v0.9) Refreshed to reflect actual state. Also added two more shipped rows (spec docs in repo from #18, CI workflow from #24) and updated the "first non-Bishop agent" gate to ⏳ next. Co-authored-by: Dan Powers <dep0we@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements spec/13 Layers 2 and 3 — source-grounded eval and the research-log-per-response trail. Layer 1 (citation requirements via persona + rubric) was already done in v0.1. Together, the three layers make agent factual accuracy auditable: every claim in a response can be traced back to the helper that produced it and the source it came from.
This PR is stacked on `feat/helper-provenance` (PR #20) — Layer 3 depends on the helper provenance plumbing landing first. Merge order: #15 → #16 → #17 → #18 → #19 → #20 → this.
What landed
Layer 2 — Source-grounded eval (`atomic_agents/eval.py`)
Layer 3 — Research log per response (`atomic_agents/agent.py`)
Example: log shape after a parent run that fanned out to two helpers
```json
{
"ts": "2026-05-08T11:32:00-05:00",
"trigger": "skill",
"model": "claude-opus-4-7-20260101",
"input_tokens": 4102,
"output_tokens": 892,
"cost_usd": 0.1287,
"status": "ok",
"summary": "Q1 bonus allocation question",
"run_id": "run-...",
"helper_provenance": [
{
"model": "claude-haiku-4-5-20251001",
"summary": "Summarize CPA memo",
"cost_usd": 0.00094,
"latency_ms": 240,
"sources_summarized": ["
/docs/finance/cpa/2026-05-tax-mid-year.md"],/docs/finance/balance_sheet.md"],"provenance_preserved": true
},
{
"model": "claude-haiku-4-5-20251001",
"summary": "Extract Q1 income figures",
"cost_usd": 0.00071,
"latency_ms": 192,
"sources_summarized": ["
"provenance_preserved": true
}
]
}
```
Test plan
Notes
🤖 Generated with Claude Code