Skip to content

feat(backend): Lifecycle Manager + Session Manager lane#2

Merged
harshitsinghbhandari merged 25 commits into
mainfrom
feat/lcm-sm-contracts
May 27, 2026
Merged

feat(backend): Lifecycle Manager + Session Manager lane#2
harshitsinghbhandari merged 25 commits into
mainfrom
feat/lcm-sm-contracts

Conversation

@harshitsinghbhandari
Copy link
Copy Markdown
Collaborator

@harshitsinghbhandari harshitsinghbhandari commented May 26, 2026

Summary

This branch is the complete Lifecycle Manager (LCM) + Session Manager (SM) lane of the backend rewrite — the deterministic core that supervises agent sessions and keeps one true status per session. It was built contract-first on a single integration branch; the sub-PRs below merged into this branch, and it now lands on main as one cohesive unit.

Full documentation lives in docs/: start with docs/architecture.md, and see docs/status.md for what's done vs. what's left.

What it delivers

A per-session OBSERVE → DECIDE → ACT loop where OBSERVE lives outside the LCM (observers call in) and the LCM is a synchronous reducer:

  • domain/ — the canonical state model (CanonicalSessionLifecycle, persisted) and DeriveLegacyStatus, the sole producer of the display status (derived on read, never stored).
  • domain/decide/ — the pure, total, deterministic decision core (liveness reconciliation, the PR ladder, the anti-flap detecting quarantine), table-tested to 100%.
  • ports/ — the stabilized boundary: inbound LifecycleManager/SessionManager, outbound LifecycleStore/Notifier/AgentMessenger/Runtime/Agent/Workspace, and the SCMFacts/RuntimeFacts/ActivitySignal DTOs.
  • lifecycle/ — the LCM: the Apply* pipeline (load → decide → diff → persist), per-session serialization, the liveness-vs-activity composition rules, and the reaction table + escalation engine (TickEscalations).
  • session/ — the SM: spawn / kill / restore / cleanup / list, with eager rollback and the worktree-remove safety.

Folded-in PRs (merged into this branch)

Key invariants

  • Persist canonical state; derive the display status on read.
  • One authority for death — only the decide pipeline (via detecting) writes inferred terminal states; the SM's explicit-kill path goes through OnKillRequested.
  • Failed probe ≠ dead; PR facts dominate the soft session states once a PR exists; worktree-remove safety on teardown.

Status

Implemented and tested entirely behind in-memory fakesgofmt / go build / go vet / go test -race green across domain, decide, lifecycle, and session. The SM tests drive the real LCM for spawn/kill round-trips.

Not in this lane (integration phase, tracked in #3): swapping fakes for real adapters (persistence/CDC, SCM poller, runtime/agent/workspace plugins, notifier, API), plus three carried-forward items — the react() out-of-lock dispatch ordering, the unused ExpectedRevision CAS for multi-writer/CDC, and a real implementation of the store's new Seed/Get. See docs/status.md for the full hand-off.

Test plan

  • cd backend && gofmt -l . && go build ./... && go vet ./... && go test -race ./... all green
  • Review the cumulative diff vs. main as one lane (sub-PRs were reviewed against the integration branch)

🤖 Generated with Claude Code

Contract-first boundary for the Lifecycle Manager + Session Manager lane.
Pure shapes only — types, interfaces, and the display-status derivation —
so adil (SCM poller), Tom (persistence), and aditi (API) can review and
build against a stable boundary before any behaviour lands.

domain/
  - CanonicalSessionLifecycle: the only persisted state (session/pr/runtime
    sub-states), with Activity + Detecting sub-states added as decider inputs
    that must survive between observations.
  - DeriveLegacyStatus: the sole producer of the derived display status
    (never persisted), with 11 table tests.
ports/
  - inbound: LifecycleManager (Apply* pipeline, per-session serialised) and
    SessionManager.
  - outbound: LifecycleStore (Tom), Notifier, AgentMessenger, and the
    Runtime/Agent/Workspace plugin ports (co-owned with the agents lane).
  - facts: SCMFacts / RuntimeFacts / ActivitySignal DTOs.
decide/ pure-core signatures + I/O types; bodies stubbed for the next PR.

Folds in four design-review fixes (documented in-code, pending team confirm):
  1. Activity + Detecting persisted so the pure decider has memory across calls.
  2. Per-session serialisation documented; LifecyclePatch.ExpectedVersion
     offers optimistic-locking as an alternative.
  3. LifecyclePatch is a sparse pointer-field merge-patch (+ ClearDetecting).
  4. SCMFacts gains Fetched (failed fetch != "PR closed") and per-comment
     IsBot (bot vs human route to different reactions).

go build / go vet / go test all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces the initial contract-first Go packages for the backend “Lifecycle Manager (LCM) + Session Manager (SM)” lane, defining the canonical lifecycle domain model, inbound/outbound ports, and a single pure status-derivation function (with tests) to stabilize boundaries for parallel implementation work.

Changes:

  • Added domain canonical lifecycle/state enums, Session read-model, and DeriveLegacyStatus (plus table-style tests).
  • Added ports package with inbound interfaces (LCM/SM), outbound interfaces (store/notifier/runtime/agent/workspace), and fact DTOs (SCM/runtime/activity/spawn/kill).
  • Added domain/decide package with pure-core I/O shapes and stubbed function bodies for a follow-up PR.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
backend/internal/ports/outbound.go Outbound ports for persistence, notifications, and runtime/agent/workspace plugins
backend/internal/ports/inbound.go Inbound contracts for LifecycleManager and SessionManager
backend/internal/ports/facts.go DTOs (“facts”) crossing the LCM/SM boundary (SCM, runtime, activity, spawn/kill)
backend/internal/domain/status.go Derived display status enum + DeriveLegacyStatus
backend/internal/domain/status_test.go Tests for DeriveLegacyStatus
backend/internal/domain/session.go API read-model Session definition
backend/internal/domain/lifecycle.go Canonical persisted lifecycle model + enums/substates
backend/internal/domain/decide/decide.go Decide-core input/output shapes and tuning constants (stubbed bodies)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backend/internal/ports/outbound.go Outdated
Comment thread backend/internal/ports/outbound.go Outdated
Comment thread backend/internal/domain/status.go Outdated
Comment thread backend/internal/ports/facts.go Outdated
Comment thread backend/internal/ports/outbound.go Outdated
Comment thread backend/internal/ports/outbound.go
Comment thread backend/internal/ports/facts.go
Comment thread backend/internal/ports/outbound.go Outdated
harshitsinghbhandari and others added 2 commits May 26, 2026 20:55
Review feedback on PR #2:

- Add CanonicalSessionLifecycle.Revision (monotonic write counter) distinct
  from the schema Version; LifecyclePatch.ExpectedVersion -> ExpectedRevision
  now compares it, so optimistic locking actually works.
- LifecycleStore.List returns []domain.SessionRecord (persistence shape, no
  derived status); add SessionRecord and make Session embed it. Keeps the
  Session Manager the single producer of the derived display status.
- SpawnOutcome.RuntimeHandle is now the structured ports.RuntimeHandle, not a
  string, so Destroy/SendMessage get the handle without ad-hoc encoding.
- Agent.IsProcessRunning -> ProbeProcess returning ProcessProbe (liveness),
  not domain.ActivityState; the name no longer implies a boolean.
- Document LifecyclePatch Detecting vs ClearDetecting precedence (clear wins).
- Correct the DeriveLegacyStatus doc: hard session states outrank PR facts;
  "PR dominates" applies only to the soft idle/working states. Implementation
  was already correct (matches canonical AO); only the comment overstated it.
- Replace personal-name attributions in package/interface comments with
  role-based terms (SCM poller / persistence adapter / API layer).

go build / go vet / go test all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- go.yml: gofmt check, build, vet, and race-enabled tests for backend/,
  triggered on backend changes and pushes to main.
- gitleaks.yml: secret scanning on PRs and main using gitleaks-action v1
  (license-free; v2 requires GITLEAKS_LICENSE for org repos).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tests

Replace the stubbed deciders in domain/decide with real, total,
side-effect-free implementations:

- ResolveProbeDecision: kill-intent short-circuits to terminal; failed
  probes and probe disagreement route to detecting; only runtime-dead +
  process-dead + no-recent-activity concludes killed.
- ResolveOpenPRDecision: the PR pipeline ladder
  (ci_failing > changes_requested > approved+mergeable > approved >
  review_pending > idle-beyond > open).
- ResolveTerminalPRStateDecision: merged -> idle/merged_waiting_decision,
  closed -> idle.
- CreateDetectingDecision: anti-flap quarantine — unchanged-evidence
  counter with StartedAt preserved across the episode so the duration cap
  is a real wall-clock safety net; escalates to stuck at 3 ticks or 5m.
- HashEvidence: strips timestamps/epochs and collapses whitespace before
  hashing so restamped-but-unchanged signals compare equal.

Table tests cover every branch (100% statement coverage), including a
consistency check that the open-PR ladder's display Status matches
DeriveLegacyStatus over the canonical state it emits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
harshitsinghbhandari and others added 8 commits May 27, 2026 00:06
…ency test

Address PR review (aa-1):
- ResolveOpenPRDecision now keys MERGEABLE on Mergeable alone (checked
  before Approved). Mergeability is the authoritative merge gate, so a PR
  on a no-required-review repo no longer falls through to PR_OPEN. The
  approved+mergeable and approved-only cases are unchanged.
- Broaden the derive-consistency test to cover the probe and terminal
  deciders too, not just the open-PR ladder.
- Document the HashEvidence epoch-stripping regex's breadth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ting helper

Address PR review (aa-1):
- Move the decider input/output type definitions (LifecycleDecision,
  ProbeInput, ProcessLiveness, OpenPRInput, DetectingInput) out of the
  execution file into a dedicated types.go, leaving decide.go for behaviour.
- Expand the detecting() helper doc to spell out that it adapts a probe
  verdict into the shared CreateDetectingDecision anti-flap path so each
  probe branch doesn't re-implement the quarantine counter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Clarify the timestamp-stripping block flagged in review: spell out what
each of the three regexes matches (full ISO/RFC3339 datetime, bare
time-of-day, bare unix epoch) and why order matters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… contract

Address Copilot review:
- ResolveOpenPRDecision now sets a stable, timestamp-free Evidence summary
  "<condition> #<num> <url>" for every ladder outcome, consuming the
  previously-unused OpenPRInput.Number/URL identity inputs and making PR
  decisions traceable in logs. Covered by TestResolveOpenPRDecisionEvidence.
- Document LifecycleDecision's zero-value contract: an empty PRState/PRReason
  means "this decider does not address PR — leave unchanged", not PRNone. The
  LCM must map empty PR fields to a nil LifecyclePatch.PR; writing PRNone on a
  probe tick would clobber a live PR. (Pointers were considered but the empty
  sentinel is distinguishable from every valid state and consistent with the
  codebase's value-enum style; LifecyclePatch already owns nil-means-unchanged.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(decide): pure DECIDE core + exhaustive truth-table tests
Implements ports.LifecycleManager as a synchronous observe->decide->persist
reducer. Every entrypoint runs the shared pipeline under a per-session lock:
load canonical -> run the matching pure decider -> diff into a sparse
merge-patch -> persist. Never polls, never writes the display status.

- ApplyRuntimeObservation -> probe decider; always writes the runtime axis.
- ApplySCMObservation -> open-PR / terminal-PR deciders (failed fetch is a
  no-op: failed probe != "no PR"). Open PRs write only the PR axis.
- ApplyActivitySignal -> updates the activity axis + maps onto the session
  axis; only valid-confidence signals are authoritative.
- OnSpawnCompleted -> runtime alive + handles to metadata; session stays
  not_started (display: spawning).
- OnKillRequested -> SM's explicit terminal-write authority.
- TickEscalations -> no-op stub (reaction/escalation engine is split B).

Composition rule (#1): liveness owns the runtime + death axis; activity owns
the working/idle/waiting axis. A healthy probe verdict writes the session axis
only to recover a liveness-owned state, so it never clobbers an activity-owned
needs_input/blocked. Activity is the mirror: it stays off the death axis.

Detecting clear (#3): a non-detecting probe verdict clears stale detecting
memory so the next probe reads no phantom Prior.

Built/tested against in-memory fakes (LifecycleStore with full merge-patch +
ExpectedRevision, recording Notifier/AgentMessenger). Per-session
serialisation verified under -race.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- shouldWriteSessionRuntime: never resurrect a terminal session; an
  observation may refresh the runtime axis but must touch neither the
  session axis nor the detecting memory (gated in ApplyRuntimeObservation).
- OnSpawnCompleted: error on an unseeded session instead of fabricating a
  partial record (SM must seed first — a missing seed is a contract violation).
- OnKillRequested: no-op on an unknown/already-gone session (benign race)
  instead of fabricating a terminal record.
- keyedMutex: reference-count entries and evict on last release so the lock
  map stays bounded in a long-running daemon.
- runtimeSubstateFromFacts: map RuntimeProbeIndeterminate to RuntimeUnknown
  with a neutral reason, distinct from the probe_error of a failed probe.

Adds tests for terminal non-resurrection, unseeded spawn-completed error,
and unknown-session kill no-op.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
harshitsinghbhandari and others added 3 commits May 27, 2026 01:49
Address Harshit's PR #5 review (approve w/ design confirms + polish):

- #1 (design decision): a valid activity signal is proof of life, so it now
  resolves a detecting session — writes the activity-mapped session state and
  clears the quarantine memory. Scoped to detecting only; a liveness-escalated
  stuck stays the probe pipeline's to resolve. Terminal still never reopens.
- #2: document why a merged/closed PR parks the session axis even over an
  activity-owned needs_input/blocked (a merge is a milestone), unlike the
  open-PR path that defers to activity.
- #3: map plain idle activity to a neutral session reason instead of the
  misleading research_complete (kept for ready, which implies completion).
- #6: cover all three kill kinds (manual/cleanup/error), the open-PR review
  branches (changes_requested/mergeable/review_pending), and the neutral idle
  reason. Coverage 86.5% -> 88.6%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(lifecycle): LCM Apply* pipeline (split A)
…t B)

Add the ACT half of the LCM: map persisted status transitions to reactions
(send-to-agent / notify / auto-merge) and drive escalation.

- reactions.go: the §4.2 default reaction table, reactionEventFor (mirrors
  DeriveLegacyStatus for the ACT layer), in-memory per-(session,reaction)
  escalation trackers, the react() dispatch chokepoint, and a real
  TickEscalations for duration-based escalations the synchronous LCM can't
  wake itself for. auto-merge action exists but is off by default;
  bugbot-comments/merge-conflicts are configured but dormant (no decide-core
  producer yet).
- manager.go: mutate now returns a transition; each Apply* path fires the
  mapped reaction after persist via the single synchronous react() seam.
  OnKillRequested intentionally does not react (explicit kill != inferred
  event). Split-A load->decide->diff->persist behavior is unchanged.

ci-failed budget is persistent across fail->pending->fail oscillation;
non-persistent trackers reset when the status leaves the triggering state.
Escalation silences further auto-dispatch until the condition clears.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
harshitsinghbhandari and others added 6 commits May 27, 2026 02:26
 review)

Address review finding #1: the persistent ci-failed tracker leaked and could
stale-silence a future regression. It was only cleared when leaving the
ci-failed reaction AND incidentOver held at that moment — so a recovery to
another open-PR state (ci-failed -> approved -> merged) never cleared it.

- react() now clears ALL of a session's trackers when the state REACHED is
  incident-over (PR resolved / session terminal) OR a genuine recovery
  (approved/mergeable, which the open-PR ladder guarantees means CI is no
  longer failing). Keyed on the state reached, not the one left, since the
  recovery transition is typically review_pending->approved (empty beforeKey).
- Persistent ci-failed still survives the ambiguous review_pending limbo, so
  fail->pending->fail keeps one shared budget (§4.2).
- Document the out-of-lock react() dispatch caveat for the daemon integration
  step (review #2) and the intentionally-skipped agent-stuck 10m threshold.

Tests: re-arm after a genuine recovery (regression re-nudges, not silenced);
all session trackers cleared once the incident is over.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…escalation (Copilot review)

- OnKillRequested now clears the session's escalation trackers after a
  successful kill, so a later duration-based TickEscalations can't emit
  reaction.escalated for a dead session (dispatch is still skipped).
- sendToAgent rolls back the attempt (and firstAttemptAt when it set it) on a
  messenger.Send error, so undelivered messages don't march a reaction toward
  escalation — honoring "send failures retry next tick" (§4.3).
- Duration escalation now uses an inclusive boundary (>=) in both shouldEscalate
  and TickEscalations, so a 30m reaction escalates at exactly 30m instead of
  waiting for the next tick.

Tests: kill clears trackers + no post-kill escalation; repeated failed delivery
never escalates; duration escalation fires at exactly escalateAfter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(lifecycle): ACT layer — reaction table + escalation engine (split B)
…store/cleanup)

Implements ports.SessionManager against fakes for the outbound ports. The SM is
the explicit-mutation half of the lane: it drives Runtime/Agent/Workspace, seeds
the initial lifecycle, and routes outcomes to the LCM (OnSpawnCompleted /
OnKillRequested). It never derives observed state and is the single producer of
the derived display status (attached on read, never persisted).

- Spawn: Workspace.Create -> Runtime.Create (AO_* identity env) -> Seed ->
  OnSpawnCompleted, with eager rollback of completed steps on failure.
- Kill: OnKillRequested first -> Runtime.Destroy -> Workspace.Destroy, honoring
  the worktree-remove safety (refusal surfaced, never forced).
- List/Get: derive status via DeriveLegacyStatus. Send: via AgentMessenger.
  Restore: re-seed (reopen) + relaunch via GetRestoreCommand. Cleanup: reclaim
  terminal sessions, skip worktrees holding uncommitted work.

Store-contract additions (co-owned with Tom's persistence layer, flagged for
review): LifecycleStore.Seed (explicit create-with-identity; OnSpawnCompleted
requires a seeded record) and LifecycleStore.Get (single record-with-identity
read; Load is lifecycle-only). Lifecycle test fake updated to satisfy both.

Tests route through the real LCM Manager (wrapped to record call order).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ntime on post-create failure (PR #7 review)

- Restore now fails early with a clear error if MetaAgentSessionID is missing,
  rather than emitting an ambiguous "resume nothing" launch command (no stored
  prompt means a fresh-launch fallback isn't possible).
- On a post-runtime-create failure (reopen patch or OnSpawnCompleted), best-effort
  destroy the newly created runtime (never the workspace, which holds prior work)
  so we don't strand a live process while parking the session terminal.
- Added a test for Restore with a missing agent session id: errors early, touches
  no workspace/runtime, leaves the session terminal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e rollback (review follow-ups)

Adds an injectable OnSpawnCompleted failure to the recording LCM and two tests:
- Spawn: when OnSpawnCompleted fails, the seeded record is parked terminal/errored
  (via OnKillRequested(KillError)) and runtime+workspace are torn down.
- Restore: when OnSpawnCompleted fails post-create, the new runtime is destroyed
  while the workspace is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(session): Session Manager — spawn/kill/list/get/send/restore/cleanup
Add docs/ for newcomers: an index, an architecture deep-dive (the
OBSERVE→DECIDE→ACT loop, the canonical state model, the package layout, every
component, and the load-bearing invariants), and a status/roadmap (what's done
PR-by-PR, what's left, the integration to-dos + carried-forward items, the open
cross-lane contract questions, and where to plug in). Link them from the README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@harshitsinghbhandari harshitsinghbhandari changed the title feat(backend): LCM + Session Manager contract package feat(backend): Lifecycle Manager + Session Manager lane May 27, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.

Comment thread backend/internal/session/manager.go
Comment thread backend/internal/session/manager.go
Comment thread backend/internal/ports/inbound.go Outdated
Comment thread backend/internal/lifecycle/reactions.go
Address PR #2 Copilot review comments on the merged LCM+SM lane:

- session: validate runtime handle + workspace path before Kill/Cleanup
  teardown; refuse (ErrIncompleteTeardownMetadata) or skip rather than
  hand empty args to a real adapter's Destroy (unsafe delete).
- session: reject Restore unless the session is terminal
  (ErrNotRestorable) so a live session can't spawn a duplicate
  runtime/workspace.
- ports: document SpawnConfig.OpenTerminal as reserved/not yet honored.
- lifecycle: remove the unread reactionConfig.auto field; note
  approved-and-green is notify-only (human decides to merge).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(session): harden teardown/restore safety + drop dead reaction flag
@harshitsinghbhandari
Copy link
Copy Markdown
Collaborator Author

All 4 review comments addressed in #8 (f03c7c8), now merged into this branch — resolving the threads. Verified on the current tip (CI build-test + scan green; local gofmt/build/vet/test -race green):

  1. Kill/Cleanup teardownErrIncompleteTeardownMetadata: required handles (workspace path, runtime id/name) are validated before recording intent or calling any adapter Destroy, so an empty-arg (unsafe) delete can't happen on a partial/orphaned record.
  2. RestoreErrNotRestorable: Restore now rejects a non-terminal session, preventing a duplicate runtime/workspace for a live id.
  3. SpawnConfig.OpenTerminal — documented as reserved / not yet honored by Spawn (planned terminal-lane feature), so callers aren't misled.
  4. reactionConfig.auto — removed the dead/derivable flag (the only auto:false entry is a notify, which is the human-decision path; auto-merge stays off).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants