Skip to content

feat: research integrity layers 2+3 (closes #8)#21

Merged
dep0we merged 1 commit into
mainfrom
feat/research-integrity-layers
May 7, 2026
Merged

feat: research integrity layers 2+3 (closes #8)#21
dep0we merged 1 commit into
mainfrom
feat/research-integrity-layers

Conversation

@dep0we
Copy link
Copy Markdown
Owner

@dep0we dep0we commented May 7, 2026

Summary

Implements spec/13 Layers 2 and 3 — source-grounded eval and the research-log-per-response trail. Layer 1 (citation requirements via persona + rubric) was already done in v0.1. Together, the three layers make agent factual accuracy auditable: every claim in a response can be traced back to the helper that produced it and the source it came from.

This PR is stacked on `feat/helper-provenance` (PR #20) — Layer 3 depends on the helper provenance plumbing landing first. Merge order: #15#16#17#18#19#20 → this.

What landed

Layer 2 — Source-grounded eval (`atomic_agents/eval.py`)

  • `_build_judge_prompt` — when a test declares `expected_facts`, append a "Factual accuracy check" section instructing the judge to verify each fact (`stated_in_response`, `value_correct`, `cited`) and emit a `factual_checks` array in its JSON response.
  • `_render_factual_check_section` — builds the addendum: one bullet per expected fact + the JSON template the judge follows.
  • `compute_factual_accuracy_from_checks(checks)` (module-level for testability) — derive a 1–5 dimension score from a list of factual_checks:
    • Verified (stated AND value_correct AND cited) → full credit
    • Stated + correct but uncited → half credit
    • Stated + wrong value → no credit
    • Score = `round(5 × verified / total)`, clamped to [1, 5]
    • Empty checks → `None` (no signal to score from)
  • `_compute_weighted_score` — when `factual_accuracy` is in the rubric weights but the judge didn't return a numeric score for it, derive one from `factual_checks`. Judge's numeric score takes priority when present (LLM may apply nuance the bare proportion misses).

Layer 3 — Research log per response (`atomic_agents/agent.py`)

  • `_helpers_this_run` — in-memory list of helper-call rollups; reset at the start of each `call()`, appended to by `helper_call()`.
  • `helper_call` — after logging the helper run line, append a rollup entry (model, summary, cost, latency, sources_summarized, provenance_preserved) to the parent's `_helpers_this_run`.
  • `agent.call` — when finishing, embed the rollup as `helper_provenance` in the parent's run log record. Field is omitted when no helpers were called during the run, so log shape stays unchanged for reactive agents that don't fan out.

Example: log shape after a parent run that fanned out to two helpers

```json
{
"ts": "2026-05-08T11:32:00-05:00",
"trigger": "skill",
"model": "claude-opus-4-7-20260101",
"input_tokens": 4102,
"output_tokens": 892,
"cost_usd": 0.1287,
"status": "ok",
"summary": "Q1 bonus allocation question",
"run_id": "run-...",
"helper_provenance": [
{
"model": "claude-haiku-4-5-20251001",
"summary": "Summarize CPA memo",
"cost_usd": 0.00094,
"latency_ms": 240,
"sources_summarized": ["/docs/finance/cpa/2026-05-tax-mid-year.md"],
"provenance_preserved": true
},
{
"model": "claude-haiku-4-5-20251001",
"summary": "Extract Q1 income figures",
"cost_usd": 0.00071,
"latency_ms": 192,
"sources_summarized": ["
/docs/finance/balance_sheet.md"],
"provenance_preserved": true
}
]
}
```

Test plan

  • 16 new tests, 296 total passing
  • Layer 2: factual_accuracy from checks (perfect / zero floor / partial credit / mixed / wrong value), judge prompt includes/omits section based on expected_facts presence, every expected fact rendered in the prompt, weighted score derives factual_accuracy when judge omits it, judge's numeric score takes priority when present, no derivation without signal
  • Layer 3: helper_call appends to rollup, rollup omits sources_summarized for sources-less helpers, agent.call() embeds rollup in log record, rollup resets between calls
  • All prior 280 tests still passing — no regressions

Notes

🤖 Generated with Claude Code

@dep0we dep0we force-pushed the feat/helper-provenance branch from 9a0e990 to 1f07576 Compare May 7, 2026 13:45
@dep0we dep0we changed the base branch from feat/helper-provenance to main May 7, 2026 13:45
Implements spec/13 Layers 2 and 3 — source-grounded eval and the
research-log-per-response trail. Layer 1 (citation requirements via
persona + rubric) was already done. Together they make agent factual
accuracy auditable: every claim in a response can be traced back to
the helper that produced it and the source it came from.

What landed:

LAYER 2 — Source-grounded eval (atomic_agents/eval.py)
- _build_judge_prompt: when a test declares expected_facts, append a
  "Factual accuracy check" section instructing the judge to verify
  each fact (stated_in_response, value_correct, cited) and emit a
  factual_checks array in its JSON response.
- _render_factual_check_section: builds the addendum with one
  bullet per expected_fact and a JSON template the judge follows.
- compute_factual_accuracy_from_checks (module-level for test): derive
  a 1-5 dimension score from a list of factual_checks. Verified =
  stated AND value_correct AND cited (full credit). Stated + correct
  but uncited = half credit. Empty checks → None (no signal).
- _compute_weighted_score: when factual_accuracy is in rubric weights
  but the judge didn't return a numeric score for it, derive from
  factual_checks. Judge's numeric score takes priority when present.

LAYER 3 — Research log per response (atomic_agents/agent.py)
- _helpers_this_run: in-memory list of helper-call rollups, reset at
  the start of each call() and appended to by helper_call().
- helper_call: after logging the helper run line, append a rollup
  entry (model, summary, cost, latency, sources_summarized,
  provenance_preserved) to the parent's _helpers_this_run.
- agent.call: when finishing, embed the rollup as helper_provenance
  in the parent's run log record. Field is omitted when no helpers
  were called during the run, so log shape stays unchanged for
  reactive agents that don't fan out.

16 new tests, 296 total passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dep0we dep0we force-pushed the feat/research-integrity-layers branch from ea02803 to 4273cc8 Compare May 7, 2026 13:46
@dep0we dep0we merged commit 7253f62 into main May 7, 2026
@dep0we dep0we deleted the feat/research-integrity-layers branch May 7, 2026 13:46
dep0we added a commit that referenced this pull request May 7, 2026
* chore: add CI workflow + update CHANGELOG for v0.9 (closes #10)

CI:
- .github/workflows/test.yml: GitHub Actions runs pytest on push to
  main and on every PR. Matrix: Python 3.11 + 3.12. Uses
  astral-sh/setup-uv@v3 with cache, fail-fast disabled (so one
  Python version's failure doesn't kill the other), in-progress
  cancellation on new pushes to same branch.
- README.md: status badge, Python-version badge, MIT license badge.

CHANGELOG:
- New v0.9.0 section consolidating everything that landed across PRs
  #12 / #14 / #16 / #18 / #19 / #20 / #21 / #22 / #23 — eval, tuning,
  goal manager, migrate, tool-call captures, cascade loader, spec
  import, operational extras, helper provenance, research integrity
  layers 2+3.
- Tests bumped 67 → 296 across 8 new modules.
- v0.1 entry preserved unchanged below the new section.

After this lands, v0.9 is feature-complete relative to the original
spec. Remaining gaps before v1.0 (per the README status table) are
non-code: first non-Bishop agent deployed end-to-end, vault docs sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: key uv cache off pyproject.toml (uv.lock is gitignored)

Initial workflow defaulted to the **/uv.lock cache key, which fails on
this repo because uv.lock is gitignored. Switching to pyproject.toml
keeps caching working without changing the gitignore policy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Dan Powers <dep0we@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dep0we added a commit that referenced this pull request May 7, 2026
…#25)

The "What's shipped" table still claimed three rows were planned that
landed in the recent PR stack:
- Helper provenance enforcement (shipped #20, v0.8)
- Research integrity layers 2+3 (shipped #21, v0.9)
- Claude Code skill wrappers (shipped #19, v0.9)

Refreshed to reflect actual state. Also added two more shipped rows
(spec docs in repo from #18, CI workflow from #24) and updated the
"first non-Bishop agent" gate to ⏳ next.

Co-authored-by: Dan Powers <dep0we@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant