Add goal manager — atomic_agents.goal (closes #3) by dep0we · Pull Request #14 · dep0we/atomic-agents-stack

dep0we · 2026-05-07T02:07:16Z

Summary

Implements issue #3 — goal manager per spec/12-goals-and-intent. Lets agents pursue persistent goals across many sessions via decomposition into sub-goals with explicit lifecycle.

This is the v0.4 milestone module. Independent of evals/tuning — pure data model + lifecycle. Stacked on the tuning PR; base auto-updates as PRs land.

What's in this PR

New module: atomic_agents/goal.py (~700 LOC)

Data model + validation:

Goal, SubGoal, CompletionEvaluation dataclasses matching spec/12 frontmatter
validate_goal() — required/optional fields, sub-goal status enum, duplicate-id detection, type checks
validate_agent_mode() — reactive | goal-driven | hybrid
parse_agent_mode() — reads IDENTITY.md "Operating mode" section; defaults to reactive if not declared

GoalManager class:

load() / save() — frontmatter round-trip with validation
has_active_goal() — quick check for hybrid runtimes to switch modes
next_sub_goal() — finds the next pending, unblocked sub-goal (filters by blocked_by chain)
Lifecycle transitions (mark_in_progress, mark_complete, mark_blocked, mark_abandoned) — validated state machine, idempotent where it should be (mark_complete on already-complete is a no-op)
add_sub_goal() — operator action to extend the goal
evaluate_completion() — all-criteria-met flag, per-status counts, deadline pacing (overdue detection only when active + past deadline)
archive() — non-destructive move to goal_archive/YYYY-MM-DD_<intent_slug>.md; used for both completion and abandonment
abandon(reason) — operator-initiated archive with stated reason
status_summary() — one-screen text output for goal status
progress_report() — structured periodic check-in (suitable for appending to journal entries); includes pacing analysis with ahead-of-pace / behind-pace flags
Auto-appended history — every state transition writes a timestamped line to the goal's body History section

CLI: python -m atomic_agents.goal {status|next|advance|abandon|complete|report}

Hard rules (per spec/12) — all enforced:

Operator-set goals (no auto-generation)
Operator-set success criteria (no agent-side tuning)
Missed deadlines surface to operator (overdue flag), never auto-extended
Locked decisions in policy/ never overridden (out of goal manager's scope; respected by the agent runtime)
Sequential goals only in v0.4 (one active per agent)

Tests: tests/test_goal.py — 39 tests, all passing

Validation: 5 cases (valid, missing field, invalid priority, invalid sub_goal status, duplicate id)
Mode parsing: 5 cases (each mode + missing section + missing file)
Load + save: round-trip preservation, corrupt YAML rejection, has_active_goal in both states
Next sub-goal: picks first unblocked, returns None when none dispatchable, respects blocker chain (blocked-by-blocker-not-complete suppresses; completing the blocker unblocks)
Lifecycle: each transition + idempotency + invalid transitions + add_sub_goal (incl. duplicate id rejection)
Completion evaluation: in-progress + all-done + overdue
Archive + abandon with reason (verifies non-destructive move, metadata captured)
Status summary + progress report formatting (key fields present, overdue flagged, pacing computed)
Save persistence: changes survive reload, last_progress_check updated, history appended

Total suite: 158/158 passing.

Spec coverage

Per spec/12 acceptance criteria (issue #3):

✅ goal.md schema validation
✅ Sub-goal lifecycle (pending → in_progress → complete | blocked | abandoned)
✅ Decomposition loop (next_sub_goal + lifecycle transitions)
✅ Completion criteria evaluator
✅ Goal abandonment (non-destructive)
✅ Progress check (periodic via goal report; cadence config is operator-driven)
✅ Hybrid agent support (parse_agent_mode + has_active_goal)
✅ CLI: status / next / advance / abandon / complete / report

What's deferred

Multi-agent project goals + queue dispatch — the project-level goal.md at <system>/projects/<project>/goal.md and queue handoff to peer roles is part of issue [v0.7] Multi-agent project cascade loader — atomic_agents.agent #6 (multi-agent project cascade loader) where the project structure is the right context. The data model from this PR works identically there.
Runtime assembly extension at step [3.5] — the AtomicAgent.assemble_system_prompt doesn't yet load goal.md. The change is a one-liner conditional on has_active_goal(). Deferring to issue [v0.7] Multi-agent project cascade loader — atomic_agents.agent #6 (where multi-agent runtime assembly is restructured anyway) to avoid touching agent.py twice.

Both are pure additions; no restructuring of this module needed.

Test plan

uv run pytest tests/test_goal.py — 39 tests pass
uv run pytest — full suite (158 tests) passes
Real-world test: build a Muse Director sample with a multi-chapter goal and walk through a few sub-goal completions (after the multi-agent cascade lands in [v0.7] Multi-agent project cascade loader — atomic_agents.agent #6)

🤖 Generated with Claude Code

Implements the evaluation runner per spec/08-evaluation. Runs an Atomic Agent against golden tests and scores outputs via a cross-family LLM judge. Produces structured run results in evals/runs/YYYY-MM-DD.jsonl per agent. Key pieces: - EvalRunner class — loads rubric.md (weights, threshold, hard fails) and judge.md (recommended judges, strict mode, audit %) - discover_tests() — walks evals/golden/{happy,edge,adversarial,decline}/ - run_test() — invoke agent, build judge prompt, call cross-family judge, parse JSON scores, compute weighted score + verdict - run_suite() — sequential execution with results written to runs/ and agent responses to runs/responses/ - pick_judge_model() — cross-family preferred, same-family fallback if cross-family API key is missing; never self-judge - Strict mode in judge prompts — score against rubric, not judge's taste - Malformed-JSON retry — one stricter retry before declaring judge_error - Hard fails override weighted score — verdict is fail regardless of score - Cost computed per call (agent + judge separately, summed for run total) CLI: python -m atomic_agents.eval <agent> [--category|--test|--all] [--summary-only] [--no-write] Tests: 27 new tests covering rubric loading, judge selection logic (cross-family + fallback + no-judge-available), score parsing (valid JSON, fenced code, malformed), weighted score computation, hard-fail override, end-to-end runs (mocked LLM), JSONL + response file writes. Total suite: 94/94 passing. Deferred to #8 (research integrity Layers 2-3): - expected_facts processing (Layer 2 — source-grounded eval) - factual_checks aggregation in scoring - Audit sample capture to evals/runs/audit_log.jsonl - Trajectory capture (currently passes "(not implemented)" placeholder) These all extend EvalRunner without restructuring; pure additions.

Implements the eval-driven tuning analyzer per spec/11-tuning. Reads recent eval results + memory state; detects recurring patterns; generates specific edit proposals to persona/memory/tools files; operator approves before any edit lands. Closes the "eval observes performance, but how do you act on it" loop. Key pieces: - AnalysisContext — shared input loaded once per run (eval JSONLs from evals/runs/, memory frontmatter, tuning_history) - 4 pattern detectors (the spec/11 minimum set): - RecurringPersonaFidelityLow — persona scores ≤3 across N+ tests with recurring judge phrases - HardFailRecurring — same hard-fail code firing across multiple runs - StaleNoteRecurring — memory notes past last_seen + threshold without refresh, not pinned/archived - PromotableMemoryDetected — feedback/user notes referenced 5+ times without contradiction → suggest persona promotion - EditProposal dataclass with frontmatter ready for operator decision - 4 proposal generators producing diff bodies + rationale + risks + verification plan, dispatched per detector type - Optional LLM polish (--polish flag, ~$0.02/proposal) — refines proposal wording via Sonnet using existing target file as voice context - Report rendering — markdown output to evals/tuning_reports/YYYY-MM-DD_proposal.md with self-contained YAML frontmatter blocks for each proposal - parse_report_proposals() — round-trips the report after operator edits the operator_decision frontmatter - apply_proposals() — records all decisions (accepted/rejected/deferred) to tuning_history.jsonl; v0.3 doesn't auto-write the diff (operator applies manually after reviewing) — auto-edit can layer on later - Detector crashes don't kill the analysis (one failing detector skipped, others run) CLI: python -m atomic_agents.tuning <agent> [--since 60d] [--polish] [--apply <report>] [--dry-run] HARD RULE (per spec/11): tuning never auto-applies. Operator approval required for every edit. Tests: 25 new tests covering each detector (positive + negative cases), proposal generation (unique IDs, required fields), report rendering (with proposals + empty case), apply flow (decisions recorded, dry-run, missing report, mixed decisions), LLM polish (success + silent fallback on failure), and detector-crash isolation. Total suite: 119/119 passing.

Implements the goal manager per spec/12-goals-and-intent. Lets agents pursue persistent goals across many sessions via decomposition into sub-goals with explicit lifecycle. Most agents are reactive (Caldwell-style). Goal-driven agents (Muse Director-style) hold an objective across hundreds of runs. Hybrid agents (reactive by default, goal-driven when an active goal.md exists) get the best of both. Key pieces: - Goal + SubGoal dataclasses matching spec/12 frontmatter schema - validate_goal() + validate_agent_mode() — schema validation - parse_agent_mode() — reads IDENTITY.md "Operating mode" section - GoalManager class: - load() / save() — frontmatter round-trip with validation - has_active_goal() — quick check for hybrid runtimes to switch modes - next_sub_goal() — finds the next pending, unblocked sub_goal - mark_in_progress / mark_complete / mark_blocked / mark_abandoned — state transitions with validation; idempotent where appropriate - add_sub_goal() — operator action to extend the goal - evaluate_completion() — reports all-criteria-met, per-status counts, deadline pacing (overdue when active and past deadline) - archive() — non-destructive move to goal_archive/, used for both completion and abandonment - abandon(reason) — operator-initiated archive with stated reason - status_summary() — one-screen output for `goal status` - progress_report() — structured periodic check-in for journals, includes pacing analysis (ahead-of/behind-pace flagging) - _append_history() — auto-appends timestamped lines to goal body's History section on every state transition CLI: python -m atomic_agents.goal {status|next|advance|abandon|complete|report} HARD RULES (per spec/12): - Operator-set goals only — agent doesn't auto-generate - Operator-set success criteria — agent doesn't tune - No auto-extension of missed deadlines (surfaces overdue, doesn't fix) - No override of locked decisions - Sequential goals only in v0.4 (one active per agent) Multi-agent project queue dispatch is deferred to issue #6 (multi-agent project cascade loader). For v0.4, single-agent goal-driven mode works end-to-end; the agent self-handles each next sub_goal as the work item for its next invocation. Tests: 39 new tests covering schema validation (positive + negative across all required fields), agent mode parsing (4 cases incl. defaults), load/save round-trip, has_active_goal, next_sub_goal logic (incl. blocker filtering with status transitions), all lifecycle transitions (in_progress/complete/blocked/abandoned, idempotent + invalid cases), add_sub_goal (incl. duplicate id rejection), completion evaluation (in-progress + all-done + overdue), archive + abandon with reason, status summary + progress report formatting, save persistence with history entries. Total suite: 158/158 passing.

* chore: add CI workflow + update CHANGELOG for v0.9 (closes #10) CI: - .github/workflows/test.yml: GitHub Actions runs pytest on push to main and on every PR. Matrix: Python 3.11 + 3.12. Uses astral-sh/setup-uv@v3 with cache, fail-fast disabled (so one Python version's failure doesn't kill the other), in-progress cancellation on new pushes to same branch. - README.md: status badge, Python-version badge, MIT license badge. CHANGELOG: - New v0.9.0 section consolidating everything that landed across PRs #12 / #14 / #16 / #18 / #19 / #20 / #21 / #22 / #23 — eval, tuning, goal manager, migrate, tool-call captures, cascade loader, spec import, operational extras, helper provenance, research integrity layers 2+3. - Tests bumped 67 → 296 across 8 new modules. - v0.1 entry preserved unchanged below the new section. After this lands, v0.9 is feature-complete relative to the original spec. Remaining gaps before v1.0 (per the README status table) are non-code: first non-Bishop agent deployed end-to-end, vault docs sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: key uv cache off pyproject.toml (uv.lock is gitignored) Initial workflow defaulted to the **/uv.lock cache key, which fails on this repo because uv.lock is gitignored. Switching to pyproject.toml keeps caching working without changing the gitignore policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Dan Powers <dep0we@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cing Two stopgap patches that unblock the Kimi review wrapper (#134) for international operators. Both lands properly with the LLMBackend protocol (#87) — these are minimum-blast-radius patches matching CLAUDE.md rule #14. `_llm._call_moonshot` reads `ATOMIC_AGENTS_MOONSHOT_BASE_URL` / `MOONSHOT_BASE_URL` env vars and falls back to the existing `api.moonshot.cn` default. Operators with keys from the international portal (`api.moonshot.ai`) can set the env var and use the framework without forking `_llm.py`. No breaking changes — every existing deployment continues to hit `.cn`. `_costs.PRICING` gains five new Moonshot model entries: - `moonshot/moonshot-v1-{8k,32k,128k}` (the non-thinking models) - `moonshot/kimi-k2.6`, `moonshot/kimi-k2.5` (the thinking models; available via `--model` once #146 extracts `reasoning_content`) All five priced at the existing placeholder rate of $0.30/$1.20 per Mtok in/out matching `moonshot/kimi-2.6` (the legacy alias). The placeholder note in PRICING is updated to recommend verifying against current Moonshot pricing before depending on dashboard cost totals. Refs #134 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Dan Powers added 3 commits May 6, 2026 20:55

dep0we changed the base branch from feat/tuning-analyzer to main May 7, 2026 13:42

dep0we merged commit 4292119 into main May 7, 2026

dep0we deleted the feat/goal-manager branch May 7, 2026 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add goal manager — atomic_agents.goal (closes #3)#14

Add goal manager — atomic_agents.goal (closes #3)#14
dep0we merged 3 commits into
mainfrom
feat/goal-manager

dep0we commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dep0we commented May 7, 2026

Summary

What's in this PR

Spec coverage

What's deferred

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant