Add goal manager — atomic_agents.goal (closes #3)#14
Merged
Conversation
added 3 commits
May 6, 2026 20:55
Implements the evaluation runner per spec/08-evaluation. Runs an Atomic Agent
against golden tests and scores outputs via a cross-family LLM judge. Produces
structured run results in evals/runs/YYYY-MM-DD.jsonl per agent.
Key pieces:
- EvalRunner class — loads rubric.md (weights, threshold, hard fails) and
judge.md (recommended judges, strict mode, audit %)
- discover_tests() — walks evals/golden/{happy,edge,adversarial,decline}/
- run_test() — invoke agent, build judge prompt, call cross-family judge,
parse JSON scores, compute weighted score + verdict
- run_suite() — sequential execution with results written to runs/ and
agent responses to runs/responses/
- pick_judge_model() — cross-family preferred, same-family fallback if
cross-family API key is missing; never self-judge
- Strict mode in judge prompts — score against rubric, not judge's taste
- Malformed-JSON retry — one stricter retry before declaring judge_error
- Hard fails override weighted score — verdict is fail regardless of score
- Cost computed per call (agent + judge separately, summed for run total)
CLI: python -m atomic_agents.eval <agent> [--category|--test|--all]
[--summary-only] [--no-write]
Tests: 27 new tests covering rubric loading, judge selection logic
(cross-family + fallback + no-judge-available), score parsing (valid JSON,
fenced code, malformed), weighted score computation, hard-fail override,
end-to-end runs (mocked LLM), JSONL + response file writes.
Total suite: 94/94 passing.
Deferred to #8 (research integrity Layers 2-3):
- expected_facts processing (Layer 2 — source-grounded eval)
- factual_checks aggregation in scoring
- Audit sample capture to evals/runs/audit_log.jsonl
- Trajectory capture (currently passes "(not implemented)" placeholder)
These all extend EvalRunner without restructuring; pure additions.
Implements the eval-driven tuning analyzer per spec/11-tuning. Reads recent
eval results + memory state; detects recurring patterns; generates specific
edit proposals to persona/memory/tools files; operator approves before any
edit lands.
Closes the "eval observes performance, but how do you act on it" loop.
Key pieces:
- AnalysisContext — shared input loaded once per run (eval JSONLs from
evals/runs/, memory frontmatter, tuning_history)
- 4 pattern detectors (the spec/11 minimum set):
- RecurringPersonaFidelityLow — persona scores ≤3 across N+ tests with
recurring judge phrases
- HardFailRecurring — same hard-fail code firing across multiple runs
- StaleNoteRecurring — memory notes past last_seen + threshold without
refresh, not pinned/archived
- PromotableMemoryDetected — feedback/user notes referenced 5+ times
without contradiction → suggest persona promotion
- EditProposal dataclass with frontmatter ready for operator decision
- 4 proposal generators producing diff bodies + rationale + risks +
verification plan, dispatched per detector type
- Optional LLM polish (--polish flag, ~$0.02/proposal) — refines proposal
wording via Sonnet using existing target file as voice context
- Report rendering — markdown output to evals/tuning_reports/YYYY-MM-DD_proposal.md
with self-contained YAML frontmatter blocks for each proposal
- parse_report_proposals() — round-trips the report after operator edits
the operator_decision frontmatter
- apply_proposals() — records all decisions (accepted/rejected/deferred)
to tuning_history.jsonl; v0.3 doesn't auto-write the diff (operator
applies manually after reviewing) — auto-edit can layer on later
- Detector crashes don't kill the analysis (one failing detector skipped,
others run)
CLI: python -m atomic_agents.tuning <agent> [--since 60d] [--polish]
[--apply <report>] [--dry-run]
HARD RULE (per spec/11): tuning never auto-applies. Operator approval
required for every edit.
Tests: 25 new tests covering each detector (positive + negative cases),
proposal generation (unique IDs, required fields), report rendering
(with proposals + empty case), apply flow (decisions recorded, dry-run,
missing report, mixed decisions), LLM polish (success + silent fallback
on failure), and detector-crash isolation.
Total suite: 119/119 passing.
Implements the goal manager per spec/12-goals-and-intent. Lets agents
pursue persistent goals across many sessions via decomposition into
sub-goals with explicit lifecycle.
Most agents are reactive (Caldwell-style). Goal-driven agents (Muse
Director-style) hold an objective across hundreds of runs. Hybrid agents
(reactive by default, goal-driven when an active goal.md exists) get
the best of both.
Key pieces:
- Goal + SubGoal dataclasses matching spec/12 frontmatter schema
- validate_goal() + validate_agent_mode() — schema validation
- parse_agent_mode() — reads IDENTITY.md "Operating mode" section
- GoalManager class:
- load() / save() — frontmatter round-trip with validation
- has_active_goal() — quick check for hybrid runtimes to switch modes
- next_sub_goal() — finds the next pending, unblocked sub_goal
- mark_in_progress / mark_complete / mark_blocked / mark_abandoned —
state transitions with validation; idempotent where appropriate
- add_sub_goal() — operator action to extend the goal
- evaluate_completion() — reports all-criteria-met, per-status counts,
deadline pacing (overdue when active and past deadline)
- archive() — non-destructive move to goal_archive/, used for both
completion and abandonment
- abandon(reason) — operator-initiated archive with stated reason
- status_summary() — one-screen output for `goal status`
- progress_report() — structured periodic check-in for journals,
includes pacing analysis (ahead-of/behind-pace flagging)
- _append_history() — auto-appends timestamped lines to goal body's
History section on every state transition
CLI: python -m atomic_agents.goal {status|next|advance|abandon|complete|report}
HARD RULES (per spec/12):
- Operator-set goals only — agent doesn't auto-generate
- Operator-set success criteria — agent doesn't tune
- No auto-extension of missed deadlines (surfaces overdue, doesn't fix)
- No override of locked decisions
- Sequential goals only in v0.4 (one active per agent)
Multi-agent project queue dispatch is deferred to issue #6
(multi-agent project cascade loader). For v0.4, single-agent goal-driven
mode works end-to-end; the agent self-handles each next sub_goal as the
work item for its next invocation.
Tests: 39 new tests covering schema validation (positive + negative
across all required fields), agent mode parsing (4 cases incl. defaults),
load/save round-trip, has_active_goal, next_sub_goal logic (incl. blocker
filtering with status transitions), all lifecycle transitions
(in_progress/complete/blocked/abandoned, idempotent + invalid cases),
add_sub_goal (incl. duplicate id rejection), completion evaluation
(in-progress + all-done + overdue), archive + abandon with reason,
status summary + progress report formatting, save persistence with
history entries.
Total suite: 158/158 passing.
dep0we
added a commit
that referenced
this pull request
May 7, 2026
* chore: add CI workflow + update CHANGELOG for v0.9 (closes #10) CI: - .github/workflows/test.yml: GitHub Actions runs pytest on push to main and on every PR. Matrix: Python 3.11 + 3.12. Uses astral-sh/setup-uv@v3 with cache, fail-fast disabled (so one Python version's failure doesn't kill the other), in-progress cancellation on new pushes to same branch. - README.md: status badge, Python-version badge, MIT license badge. CHANGELOG: - New v0.9.0 section consolidating everything that landed across PRs #12 / #14 / #16 / #18 / #19 / #20 / #21 / #22 / #23 — eval, tuning, goal manager, migrate, tool-call captures, cascade loader, spec import, operational extras, helper provenance, research integrity layers 2+3. - Tests bumped 67 → 296 across 8 new modules. - v0.1 entry preserved unchanged below the new section. After this lands, v0.9 is feature-complete relative to the original spec. Remaining gaps before v1.0 (per the README status table) are non-code: first non-Bishop agent deployed end-to-end, vault docs sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: key uv cache off pyproject.toml (uv.lock is gitignored) Initial workflow defaulted to the **/uv.lock cache key, which fails on this repo because uv.lock is gitignored. Switching to pyproject.toml keeps caching working without changing the gitignore policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Dan Powers <dep0we@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 7, 2026
dep0we
pushed a commit
that referenced
this pull request
May 13, 2026
…cing Two stopgap patches that unblock the Kimi review wrapper (#134) for international operators. Both lands properly with the LLMBackend protocol (#87) — these are minimum-blast-radius patches matching CLAUDE.md rule #14. `_llm._call_moonshot` reads `ATOMIC_AGENTS_MOONSHOT_BASE_URL` / `MOONSHOT_BASE_URL` env vars and falls back to the existing `api.moonshot.cn` default. Operators with keys from the international portal (`api.moonshot.ai`) can set the env var and use the framework without forking `_llm.py`. No breaking changes — every existing deployment continues to hit `.cn`. `_costs.PRICING` gains five new Moonshot model entries: - `moonshot/moonshot-v1-{8k,32k,128k}` (the non-thinking models) - `moonshot/kimi-k2.6`, `moonshot/kimi-k2.5` (the thinking models; available via `--model` once #146 extracts `reasoning_content`) All five priced at the existing placeholder rate of $0.30/$1.20 per Mtok in/out matching `moonshot/kimi-2.6` (the legacy alias). The placeholder note in PRICING is updated to recommend verifying against current Moonshot pricing before depending on dashboard cost totals. Refs #134 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements issue #3 — goal manager per spec/12-goals-and-intent. Lets agents pursue persistent goals across many sessions via decomposition into sub-goals with explicit lifecycle.
This is the v0.4 milestone module. Independent of evals/tuning — pure data model + lifecycle. Stacked on the tuning PR; base auto-updates as PRs land.
What's in this PR
New module:
atomic_agents/goal.py(~700 LOC)Data model + validation:
Goal,SubGoal,CompletionEvaluationdataclasses matching spec/12 frontmattervalidate_goal()— required/optional fields, sub-goal status enum, duplicate-id detection, type checksvalidate_agent_mode()— reactive | goal-driven | hybridparse_agent_mode()— readsIDENTITY.md"Operating mode" section; defaults toreactiveif not declaredGoalManagerclass:load()/save()— frontmatter round-trip with validationhas_active_goal()— quick check for hybrid runtimes to switch modesnext_sub_goal()— finds the next pending, unblocked sub-goal (filters byblocked_bychain)mark_in_progress,mark_complete,mark_blocked,mark_abandoned) — validated state machine, idempotent where it should be (mark_complete on already-complete is a no-op)add_sub_goal()— operator action to extend the goalevaluate_completion()— all-criteria-met flag, per-status counts, deadline pacing (overdue detection only when active + past deadline)archive()— non-destructive move togoal_archive/YYYY-MM-DD_<intent_slug>.md; used for both completion and abandonmentabandon(reason)— operator-initiated archive with stated reasonstatus_summary()— one-screen text output forgoal statusprogress_report()— structured periodic check-in (suitable for appending to journal entries); includes pacing analysis with ahead-of-pace / behind-pace flagsCLI:
python -m atomic_agents.goal {status|next|advance|abandon|complete|report}Hard rules (per spec/12) — all enforced:
Tests:
tests/test_goal.py— 39 tests, all passingTotal suite: 158/158 passing.
Spec coverage
Per spec/12 acceptance criteria (issue #3):
goal report; cadence config is operator-driven)What's deferred
goal.mdat<system>/projects/<project>/goal.mdand queue handoff to peer roles is part of issue [v0.7] Multi-agent project cascade loader — atomic_agents.agent #6 (multi-agent project cascade loader) where the project structure is the right context. The data model from this PR works identically there.AtomicAgent.assemble_system_promptdoesn't yet load goal.md. The change is a one-liner conditional onhas_active_goal(). Deferring to issue [v0.7] Multi-agent project cascade loader — atomic_agents.agent #6 (where multi-agent runtime assembly is restructured anyway) to avoid touchingagent.pytwice.Both are pure additions; no restructuring of this module needed.
Test plan
uv run pytest tests/test_goal.py— 39 tests passuv run pytest— full suite (158 tests) passes🤖 Generated with Claude Code