Skip to content

Add goal manager — atomic_agents.goal (closes #3)#14

Merged
dep0we merged 3 commits into
mainfrom
feat/goal-manager
May 7, 2026
Merged

Add goal manager — atomic_agents.goal (closes #3)#14
dep0we merged 3 commits into
mainfrom
feat/goal-manager

Conversation

@dep0we
Copy link
Copy Markdown
Owner

@dep0we dep0we commented May 7, 2026

Summary

Implements issue #3 — goal manager per spec/12-goals-and-intent. Lets agents pursue persistent goals across many sessions via decomposition into sub-goals with explicit lifecycle.

This is the v0.4 milestone module. Independent of evals/tuning — pure data model + lifecycle. Stacked on the tuning PR; base auto-updates as PRs land.

What's in this PR

New module: atomic_agents/goal.py (~700 LOC)

Data model + validation:

  • Goal, SubGoal, CompletionEvaluation dataclasses matching spec/12 frontmatter
  • validate_goal() — required/optional fields, sub-goal status enum, duplicate-id detection, type checks
  • validate_agent_mode() — reactive | goal-driven | hybrid
  • parse_agent_mode() — reads IDENTITY.md "Operating mode" section; defaults to reactive if not declared

GoalManager class:

  • load() / save() — frontmatter round-trip with validation
  • has_active_goal() — quick check for hybrid runtimes to switch modes
  • next_sub_goal() — finds the next pending, unblocked sub-goal (filters by blocked_by chain)
  • Lifecycle transitions (mark_in_progress, mark_complete, mark_blocked, mark_abandoned) — validated state machine, idempotent where it should be (mark_complete on already-complete is a no-op)
  • add_sub_goal() — operator action to extend the goal
  • evaluate_completion() — all-criteria-met flag, per-status counts, deadline pacing (overdue detection only when active + past deadline)
  • archive() — non-destructive move to goal_archive/YYYY-MM-DD_<intent_slug>.md; used for both completion and abandonment
  • abandon(reason) — operator-initiated archive with stated reason
  • status_summary() — one-screen text output for goal status
  • progress_report() — structured periodic check-in (suitable for appending to journal entries); includes pacing analysis with ahead-of-pace / behind-pace flags
  • Auto-appended history — every state transition writes a timestamped line to the goal's body History section

CLI: python -m atomic_agents.goal {status|next|advance|abandon|complete|report}

Hard rules (per spec/12) — all enforced:

  • Operator-set goals (no auto-generation)
  • Operator-set success criteria (no agent-side tuning)
  • Missed deadlines surface to operator (overdue flag), never auto-extended
  • Locked decisions in policy/ never overridden (out of goal manager's scope; respected by the agent runtime)
  • Sequential goals only in v0.4 (one active per agent)

Tests: tests/test_goal.py — 39 tests, all passing

  • Validation: 5 cases (valid, missing field, invalid priority, invalid sub_goal status, duplicate id)
  • Mode parsing: 5 cases (each mode + missing section + missing file)
  • Load + save: round-trip preservation, corrupt YAML rejection, has_active_goal in both states
  • Next sub-goal: picks first unblocked, returns None when none dispatchable, respects blocker chain (blocked-by-blocker-not-complete suppresses; completing the blocker unblocks)
  • Lifecycle: each transition + idempotency + invalid transitions + add_sub_goal (incl. duplicate id rejection)
  • Completion evaluation: in-progress + all-done + overdue
  • Archive + abandon with reason (verifies non-destructive move, metadata captured)
  • Status summary + progress report formatting (key fields present, overdue flagged, pacing computed)
  • Save persistence: changes survive reload, last_progress_check updated, history appended

Total suite: 158/158 passing.

Spec coverage

Per spec/12 acceptance criteria (issue #3):

  • ✅ goal.md schema validation
  • ✅ Sub-goal lifecycle (pending → in_progress → complete | blocked | abandoned)
  • ✅ Decomposition loop (next_sub_goal + lifecycle transitions)
  • ✅ Completion criteria evaluator
  • ✅ Goal abandonment (non-destructive)
  • ✅ Progress check (periodic via goal report; cadence config is operator-driven)
  • ✅ Hybrid agent support (parse_agent_mode + has_active_goal)
  • ✅ CLI: status / next / advance / abandon / complete / report

What's deferred

  • Multi-agent project goals + queue dispatch — the project-level goal.md at <system>/projects/<project>/goal.md and queue handoff to peer roles is part of issue [v0.7] Multi-agent project cascade loader — atomic_agents.agent #6 (multi-agent project cascade loader) where the project structure is the right context. The data model from this PR works identically there.
  • Runtime assembly extension at step [3.5] — the AtomicAgent.assemble_system_prompt doesn't yet load goal.md. The change is a one-liner conditional on has_active_goal(). Deferring to issue [v0.7] Multi-agent project cascade loader — atomic_agents.agent #6 (where multi-agent runtime assembly is restructured anyway) to avoid touching agent.py twice.

Both are pure additions; no restructuring of this module needed.

Test plan

🤖 Generated with Claude Code

Dan Powers added 3 commits May 6, 2026 20:55
Implements the evaluation runner per spec/08-evaluation. Runs an Atomic Agent
against golden tests and scores outputs via a cross-family LLM judge. Produces
structured run results in evals/runs/YYYY-MM-DD.jsonl per agent.

Key pieces:

- EvalRunner class — loads rubric.md (weights, threshold, hard fails) and
  judge.md (recommended judges, strict mode, audit %)
- discover_tests() — walks evals/golden/{happy,edge,adversarial,decline}/
- run_test() — invoke agent, build judge prompt, call cross-family judge,
  parse JSON scores, compute weighted score + verdict
- run_suite() — sequential execution with results written to runs/ and
  agent responses to runs/responses/
- pick_judge_model() — cross-family preferred, same-family fallback if
  cross-family API key is missing; never self-judge
- Strict mode in judge prompts — score against rubric, not judge's taste
- Malformed-JSON retry — one stricter retry before declaring judge_error
- Hard fails override weighted score — verdict is fail regardless of score
- Cost computed per call (agent + judge separately, summed for run total)

CLI: python -m atomic_agents.eval <agent> [--category|--test|--all]
     [--summary-only] [--no-write]

Tests: 27 new tests covering rubric loading, judge selection logic
(cross-family + fallback + no-judge-available), score parsing (valid JSON,
fenced code, malformed), weighted score computation, hard-fail override,
end-to-end runs (mocked LLM), JSONL + response file writes.

Total suite: 94/94 passing.

Deferred to #8 (research integrity Layers 2-3):
- expected_facts processing (Layer 2 — source-grounded eval)
- factual_checks aggregation in scoring
- Audit sample capture to evals/runs/audit_log.jsonl
- Trajectory capture (currently passes "(not implemented)" placeholder)

These all extend EvalRunner without restructuring; pure additions.
Implements the eval-driven tuning analyzer per spec/11-tuning. Reads recent
eval results + memory state; detects recurring patterns; generates specific
edit proposals to persona/memory/tools files; operator approves before any
edit lands.

Closes the "eval observes performance, but how do you act on it" loop.

Key pieces:

- AnalysisContext — shared input loaded once per run (eval JSONLs from
  evals/runs/, memory frontmatter, tuning_history)
- 4 pattern detectors (the spec/11 minimum set):
  - RecurringPersonaFidelityLow — persona scores ≤3 across N+ tests with
    recurring judge phrases
  - HardFailRecurring — same hard-fail code firing across multiple runs
  - StaleNoteRecurring — memory notes past last_seen + threshold without
    refresh, not pinned/archived
  - PromotableMemoryDetected — feedback/user notes referenced 5+ times
    without contradiction → suggest persona promotion
- EditProposal dataclass with frontmatter ready for operator decision
- 4 proposal generators producing diff bodies + rationale + risks +
  verification plan, dispatched per detector type
- Optional LLM polish (--polish flag, ~$0.02/proposal) — refines proposal
  wording via Sonnet using existing target file as voice context
- Report rendering — markdown output to evals/tuning_reports/YYYY-MM-DD_proposal.md
  with self-contained YAML frontmatter blocks for each proposal
- parse_report_proposals() — round-trips the report after operator edits
  the operator_decision frontmatter
- apply_proposals() — records all decisions (accepted/rejected/deferred)
  to tuning_history.jsonl; v0.3 doesn't auto-write the diff (operator
  applies manually after reviewing) — auto-edit can layer on later
- Detector crashes don't kill the analysis (one failing detector skipped,
  others run)

CLI: python -m atomic_agents.tuning <agent> [--since 60d] [--polish]
                                            [--apply <report>] [--dry-run]

HARD RULE (per spec/11): tuning never auto-applies. Operator approval
required for every edit.

Tests: 25 new tests covering each detector (positive + negative cases),
proposal generation (unique IDs, required fields), report rendering
(with proposals + empty case), apply flow (decisions recorded, dry-run,
missing report, mixed decisions), LLM polish (success + silent fallback
on failure), and detector-crash isolation.

Total suite: 119/119 passing.
Implements the goal manager per spec/12-goals-and-intent. Lets agents
pursue persistent goals across many sessions via decomposition into
sub-goals with explicit lifecycle.

Most agents are reactive (Caldwell-style). Goal-driven agents (Muse
Director-style) hold an objective across hundreds of runs. Hybrid agents
(reactive by default, goal-driven when an active goal.md exists) get
the best of both.

Key pieces:

- Goal + SubGoal dataclasses matching spec/12 frontmatter schema
- validate_goal() + validate_agent_mode() — schema validation
- parse_agent_mode() — reads IDENTITY.md "Operating mode" section
- GoalManager class:
  - load() / save() — frontmatter round-trip with validation
  - has_active_goal() — quick check for hybrid runtimes to switch modes
  - next_sub_goal() — finds the next pending, unblocked sub_goal
  - mark_in_progress / mark_complete / mark_blocked / mark_abandoned —
    state transitions with validation; idempotent where appropriate
  - add_sub_goal() — operator action to extend the goal
  - evaluate_completion() — reports all-criteria-met, per-status counts,
    deadline pacing (overdue when active and past deadline)
  - archive() — non-destructive move to goal_archive/, used for both
    completion and abandonment
  - abandon(reason) — operator-initiated archive with stated reason
  - status_summary() — one-screen output for `goal status`
  - progress_report() — structured periodic check-in for journals,
    includes pacing analysis (ahead-of/behind-pace flagging)
- _append_history() — auto-appends timestamped lines to goal body's
  History section on every state transition

CLI: python -m atomic_agents.goal {status|next|advance|abandon|complete|report}

HARD RULES (per spec/12):
- Operator-set goals only — agent doesn't auto-generate
- Operator-set success criteria — agent doesn't tune
- No auto-extension of missed deadlines (surfaces overdue, doesn't fix)
- No override of locked decisions
- Sequential goals only in v0.4 (one active per agent)

Multi-agent project queue dispatch is deferred to issue #6
(multi-agent project cascade loader). For v0.4, single-agent goal-driven
mode works end-to-end; the agent self-handles each next sub_goal as the
work item for its next invocation.

Tests: 39 new tests covering schema validation (positive + negative
across all required fields), agent mode parsing (4 cases incl. defaults),
load/save round-trip, has_active_goal, next_sub_goal logic (incl. blocker
filtering with status transitions), all lifecycle transitions
(in_progress/complete/blocked/abandoned, idempotent + invalid cases),
add_sub_goal (incl. duplicate id rejection), completion evaluation
(in-progress + all-done + overdue), archive + abandon with reason,
status summary + progress report formatting, save persistence with
history entries.

Total suite: 158/158 passing.
@dep0we dep0we changed the base branch from feat/tuning-analyzer to main May 7, 2026 13:42
@dep0we dep0we merged commit 4292119 into main May 7, 2026
@dep0we dep0we deleted the feat/goal-manager branch May 7, 2026 13:42
dep0we added a commit that referenced this pull request May 7, 2026
* chore: add CI workflow + update CHANGELOG for v0.9 (closes #10)

CI:
- .github/workflows/test.yml: GitHub Actions runs pytest on push to
  main and on every PR. Matrix: Python 3.11 + 3.12. Uses
  astral-sh/setup-uv@v3 with cache, fail-fast disabled (so one
  Python version's failure doesn't kill the other), in-progress
  cancellation on new pushes to same branch.
- README.md: status badge, Python-version badge, MIT license badge.

CHANGELOG:
- New v0.9.0 section consolidating everything that landed across PRs
  #12 / #14 / #16 / #18 / #19 / #20 / #21 / #22 / #23 — eval, tuning,
  goal manager, migrate, tool-call captures, cascade loader, spec
  import, operational extras, helper provenance, research integrity
  layers 2+3.
- Tests bumped 67 → 296 across 8 new modules.
- v0.1 entry preserved unchanged below the new section.

After this lands, v0.9 is feature-complete relative to the original
spec. Remaining gaps before v1.0 (per the README status table) are
non-code: first non-Bishop agent deployed end-to-end, vault docs sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: key uv cache off pyproject.toml (uv.lock is gitignored)

Initial workflow defaulted to the **/uv.lock cache key, which fails on
this repo because uv.lock is gitignored. Switching to pyproject.toml
keeps caching working without changing the gitignore policy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Dan Powers <dep0we@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dep0we pushed a commit that referenced this pull request May 13, 2026
…cing

Two stopgap patches that unblock the Kimi review wrapper (#134) for
international operators. Both lands properly with the LLMBackend protocol
(#87) — these are minimum-blast-radius patches matching CLAUDE.md rule #14.

`_llm._call_moonshot` reads `ATOMIC_AGENTS_MOONSHOT_BASE_URL` /
`MOONSHOT_BASE_URL` env vars and falls back to the existing
`api.moonshot.cn` default. Operators with keys from the international
portal (`api.moonshot.ai`) can set the env var and use the framework
without forking `_llm.py`. No breaking changes — every existing
deployment continues to hit `.cn`.

`_costs.PRICING` gains five new Moonshot model entries:
- `moonshot/moonshot-v1-{8k,32k,128k}` (the non-thinking models)
- `moonshot/kimi-k2.6`, `moonshot/kimi-k2.5` (the thinking models;
  available via `--model` once #146 extracts `reasoning_content`)

All five priced at the existing placeholder rate of $0.30/$1.20 per Mtok
in/out matching `moonshot/kimi-2.6` (the legacy alias). The placeholder
note in PRICING is updated to recommend verifying against current
Moonshot pricing before depending on dashboard cost totals.

Refs #134

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant