Skip to content

RCA: 7 bugs from discussion #149 smoke walk (opaque 500s, orphan rows, silent failures) #152

@harshitsinghbhandari

Description

@harshitsinghbhandari

Context

The smoke walk in discussion #149 (run by session aa-72) surfaced 7 distinct bugs across the HTTP API, CLI spawn lifecycle, and observer/CDC layers. This issue is the root-cause analysis follow-up (aa-73): each bug has been traced to a real code path on main, with a suggested fix sketch. No code has been changed yet.

Three cross-cutting patterns emerged — see the bottom of this issue.

State left for inspection: /tmp/ao-smoke-{A,B,C} data dirs, /tmp/ao_smoke_aa72 binary, throwaway repo harshitsinghbhandari/ao-smoke-149-1780787250.


Bug 1 — POST /api/v1/orchestrators returns opaque 500 + leaves orphan terminated session row for unknown projectId

Code path:

  • backend/internal/httpd/controllers/sessions.go:288-306 spawnOrchestrator validates only ProjectID == "", then calls Svc.SpawnOrchestrator.
  • backend/internal/service/session/service.go:106-120 SpawnOrchestrator does not pre-validate the project exists, despite Store.GetProject being available at line 25.
  • backend/internal/session_manager/manager.go:109 m.store.CreateSession(...) — row is created before workspace validation.
  • backend/internal/session_manager/manager.go:122-126 workspace.Creategitworktree.repoPath (backend/internal/adapters/workspace/gitworktree/workspace.go:104) fails for unknown project. Error wrapped as "spawn %s: workspace: %w" with a raw gitworktree: string — not the ErrProjectNotResolvable sentinel that exists and is mapped.
  • backend/internal/session_manager/manager.go:124 markSpawnFailedTerminated flips IsTerminated=true. Store has no DeleteSession; row sticks (comment at lines 172-176 acknowledges this is intentional given create-first ordering).
  • backend/internal/service/session/service.go:226-243 toAPIError has no match → default → raw error.
  • backend/internal/httpd/envelope/envelope.go:44-52 WriteError collapses any non-*apierr.Error to INTERNAL_ERROR 500.

Root cause: Spawn creates the DB row before validating the project, and the underlying gitworktree error is untyped, so the typed-envelope mapping is bypassed.

Suggested fix: In Service.SpawnOrchestrator/Service.Spawn, call store.GetProject first and return apierr.NotFound("PROJECT_NOT_FOUND", ...) before manager.Spawn. Additionally (or instead), make gitworktree.repoPath return ErrProjectNotResolvable so the existing toAPIError mapping fires. Pre-validation also eliminates the orphan row.

Confidence: high.


Bug 2 — POST /api/v1/sessions/{id}/restore returns opaque 500 on terminated/half-spawned sessions (Kill returns proper 409 on the same row)

Code path:

  • backend/internal/httpd/controllers/sessions.go:195-219 restore and kill handlers are symmetric — both use envelope.WriteError. Divergence is in the manager.
  • backend/internal/session_manager/manager.go:189-193 Kill has the guard: if handle.ID == "" || ws.Path == "" → returns ErrIncompleteHandle. toAPIError (service.go:236-237) maps it to apierr.Conflict("SESSION_INCOMPLETE_HANDLE", ...).
  • backend/internal/session_manager/manager.go:209-255 Restore has no equivalent guard. It checks not-found and not-terminated, then either falls through "nothing to resume from" (line 222, untyped string error) or calls workspace.Restore with empty Metadata.Branch/Metadata.WorkspacePath and wraps any failure as "restore %s: workspace: %w" (no sentinel).
  • toAPIError matches none of these → default → opaque 500.

Root cause: Restore is missing the ErrIncompleteHandle guard that Kill has. The envelope/sentinel infra is fine — Restore just never feeds into it.

Suggested fix: In Manager.Restore, mirror the Kill guard immediately after the terminated check: if rec.Metadata.WorkspacePath == "" || rec.Metadata.Branch == "" { return ... ErrIncompleteHandle }. Optionally convert the "nothing to resume from" string into a typed sentinel.

Confidence: high.


Bug 3 — ao spawn --branch <unfetched> and --branch <checked-out-elsewhere> return opaque 500

Code path:

  • backend/internal/adapters/workspace/gitworktree/workspace.go:100-117 Create calls validateBranch (only git check-ref-format, no existence/fetched check), then addWorktree.
  • workspace.go:197-216 addWorktree:
    • refExists("refs/heads/"+branch) is local-only. An unfetched refs/remotes/origin/X is invisible.
    • Falls into the "new branch" path → git worktree add -b ... <base>. If the branch is checked out in another worktree, git fails: fatal: 'X' is already checked out at ....
  • workspace.go:212-214 wraps as "gitworktree: worktree add branch ...: <commandError>" (preserves stderr) → manager.go:124-126 re-wraps "spawn %s: workspace: %w"service.go:95-98 calls toAPIError, no workspace sentinels match → default → opaque 500.

Root cause: No pre-flight for (a) branch fetched (only local heads checked), (b) branch checked out in another worktree. Useful git stderr is wrapped but toAPIError's whitelist has no workspace-error sentinels, so any workspace failure becomes opaque 500.

Suggested fix:

  • Before git worktree add, call listRecords (already wrapped) and reject early with new ErrBranchCheckedOutElsewhere naming the conflicting path.
  • In refExists/addWorktree, also probe refs/remotes/origin/<branch> and refs/tags/<branch>; if remote-only, either auto-track or return new ErrBranchNotFetched carrying the git fetch suggestion.
  • Wire both sentinels through ports.Workspace; extend toAPIErrorapierr.Conflict("BRANCH_CHECKED_OUT_ELSEWHERE", ...) and apierr.Invalid("BRANCH_NOT_FETCHED", ...).

Confidence: high.


Bug 4 — ao spawn --claim-pr <bad-url> orphan row + (claimed) exit-0; ao session claim-pr --no-takeover 409 exit-0

Code path (orphan row):

  • backend/internal/cli/spawn.go:79-94 spawn first POSTs /sessions (creates row), then POSTs /sessions/{id}/pr/claim.
  • On claim failure: spawn.go:86-89 calls rollbackSpawnedSession (spawn.go:136-139) which POSTs /sessions/{id}/kill.
  • manager.go:181-204 Killlcm.MarkTerminated (lifecycle/manager.go:119-128) only flips IsTerminated=true — not a delete. Store has no DeleteSession. Row persists under --include-terminated. Comment at manager.go:172-176 calls out this trade-off ("phantom half-spawned row is worse than terminal one").

Code path (exit code):

  • backend/internal/cli/client.go:114-121 turns any non-2xx into fmt.Errorf("%s", e.String()). Returns through cobra RunE (SilenceErrors: true) to main.go:11-14 which does os.Exit(ExitCode(err)) → exit 1.
  • Test TestSpawnClaimPRFailureRollsBackSession (spawn_test.go:82-126) confirms the error is returned.
  • Could NOT reproduce exit-0 from current source. Every traced path returns a non-nil error → exit 1. Either (a) the smoke binary at /tmp/ao_smoke_aa72 predates a fix, (b) a shell wrapper (rtk proxy / hooks) is swallowing the code, or (c) there's a code path returning 200 OK + {"ok": false} that wasn't found.

Root cause:

  • Orphan row: rollback is kill (marks terminated), and Store has no delete method. Symptom is by design but user-visible through --include-terminated.
  • Exit-0: unconfirmed in current code. Needs reproduction with the actual binary the user ran.

Suggested fix:

  • Add Store.DeleteSession(ctx, id) gated to seed-state rows only (preserves no-resurrection). Wire a Manager.rollbackSpawn(id) that deletes instead of MarkTerminated. Use from cli/spawn.go:86. Better: push claim into the daemon (POST /sessions?claim=...) so the daemon owns the all-or-nothing transaction.
  • For exit-0: get /tmp/ao_smoke_aa72 --version + echo $? transcript and the exact PR URL used; add CLI integration test asserting ExitCode != 0 for both flows.

Confidence: high on orphan-row; LOW on exit-0 — cannot reproduce in current source.


Bug 5 — pr_review_thread_resolved CDC event never lands (only _added does)

Code path:

  • backend/internal/cdc/event.go:29-30 both event types are declared.
  • backend/internal/storage/sqlite/migrations/0006_pr_session_changed_cdc.sql:43-82 two triggers:
    • pr_review_threads_cdc_insert (AFTER INSERT) emits pr_review_thread_added.
    • pr_review_threads_cdc_update (AFTER UPDATE, WHEN OLD.resolved <> NEW.resolved) emits pr_review_thread_resolved. Only fires on in-place UPDATE.
  • backend/internal/observe/scm/observer.go:720-732 picks mode: ReviewWriteMerge only when review.Partial == true, else ReviewWriteReplace.
  • backend/internal/adapters/scm/github/observer_provider.go:193 Partial: true is set only when GitHub pagination is bounded. Single-page reviews → Partial: false → Replace.
  • backend/internal/storage/sqlite/store/pr_store.go:113-117, 127-133 ReviewWriteReplace calls DeletePRReviewThreads(pr.URL) then UpsertPRReviewThread(...). Every upsert hits INSERT (rows were just deleted) → only the INSERT trigger fires. The UPDATE trigger never sees a resolved flip.
  • pr_cdc_test.go:67-101 the only resolved-CDC test seeds via Replace and then resolves via Merge — masks the bug.

Root cause: Real-world polls usually take the Replace path (DELETE-all + UPSERT). SQLite sees fresh inserts every poll. The UPDATE trigger that emits _resolved is unreachable for the common case.

Suggested fix: In writePRRows (pr_store.go:113-117), replace "DELETE then upsert" with "upsert observed threads, then DELETE threads whose thread_id is not in the observed set" (set-diff delete). The upsert becomes a true UPDATE for known rows, the resolved flip fires the UPDATE trigger, and stale threads still get pruned. Add a regression test that exercises Replace on the second poll and asserts a pr_review_thread_resolved event.

Confidence: high.


Bug 6 — ao spawn --agent qwen succeeds silently when qwen binary is not on PATH

Code path (two stacked failures):

  1. Adapter swallows missing binary. backend/internal/adapters/agent/qwen/qwen.go:149-210 ResolveQwenBinary tries exec.LookPath and well-known dirs; on miss returns "qwen", nil (line 209) — explicit "last-ditch fallback". No error path. GetLaunchCommand (lines 74-92) builds argv ["qwen", "-p", ...].
  2. Runtime liveness conflates zellij-session-alive with agent-process-alive. backend/internal/adapters/runtime/zellij/zellij.go:120-153 Create writes a layout KDL and runs zellij --new-session-with-layout. Zellij creates the session + pane regardless of whether the inner command exists; the pane's shell runs qwencommand not found → shell exits, but the pane (and zellij session) persist. findAgentPane (lines 268-291) matches by title — that pane is there. Create returns success → manager.go:163-168 lcm.MarkSpawned records idle/alive.
  3. backend/internal/observe/reaper/reaper.go:140-167 Reaper.probeOne calls runtime.IsAlive, which is zellij list-sessions (zellij.go:208-222). Zellij session is alive → Probe=Alive → session stays "live" forever. No per-pane PID tracking.

Root cause: Two layered bugs reinforce each other — qwen's resolver returns success with a string fallback (audit grok, kimi, etc. for the same pattern), and the zellij runtime has no signal for "command in pane died at execve". Stderr from the failed exec stays in the pane scroll buffer, never logged.

Suggested fix:

  • Drop the silent fallback in ResolveQwenBinary — return a typed ErrAgentBinaryNotFound when exec.LookPath misses. Audit other adapters.
  • Expose ports.Agent.Validate(ctx) (or pre-flight exec.LookPath(argv[0]) in Manager.Spawn after GetLaunchCommand) so the failure aborts BEFORE runtime.Create. Map sentinel via toAPIError to apierr.Invalid("AGENT_BINARY_NOT_FOUND", ...).
  • Longer-term (separate issue): wrap argv in a shim that reports inner-process exit to the daemon, so the reaper can mark the session dead. Also fixes the broader "agent crashed post-launch" silent class.

Confidence: high.


Bug 7 — SCM observer "scm observer disabled: provider credentials unavailable" log line never emitted on no-token startup

Code path:

  • backend/internal/daemon/scm_wiring.go:22-30 startSCMObserver builds the GitHub provider with SkipTokenPreflight: true (intentional — readiness must not block on gh). Constructs observer, starts loop. No log if token absent.
  • scm_wiring.go:44-50 logSCMProviderDisabled has the right log line, but only runs when newGitHubSCMProvider itself errors. With SkipTokenPreflight: true, it doesn't error on missing token. Dead branch in the no-token case.
  • backend/internal/observe/scm/observer.go:163-186 Start/loop immediately calls Poll.
  • observer.go:223-244 Poll runs discoverSubjects first, returns early at line 236 if len(subjects) == 0. The credential check (line 238) is unreachable until at least one non-terminated session with a branch + supported origin exists.
  • observer.go:402-426 checkCredentials contains the documented o.logger.Warn("scm observer disabled: provider credentials unavailable") at line 422. Guarded by credentialsChecked (once per process) AND only runs if Poll reaches past the subject check.

Root cause: The credential gate is hidden behind two preconditions inside Poll: subjects must exist AND Poll must reach the check. Fresh daemon with no tracked PRs → discoverSubjects empty → Poll returns nil → warn never fires. Wiring layer can't catch it because SkipTokenPreflight: true makes provider construction succeed regardless.

Suggested fix: Either (a) in Observer.loop, call checkCredentials once before the first Poll, independent of subjects, and log the disabled-state warning there; or (b) in startSCMObserver, explicitly probe token availability (non-blocking, no gh shell-out) after constructing the provider and emit the documented warn line before starting the loop. Add a daemon-level test that asserts the log when AO_GITHUB_TOKEN is unset.

Confidence: high.


Cross-cutting patterns

  1. toAPIError's sentinel whitelist is the choke point for opaque 500s (bugs 1, 2, 3). Every workspace / external error that doesn't carry a recognized sentinel collapses to INTERNAL_ERROR in the envelope. Either widen the whitelist, or make wrappers (gitworktree, etc.) emit typed errors.
  2. Spawn creates the DB row before validating preconditions (bugs 1, 4). Store.CreateSession is irreversible (no delete). Either add DeleteSession gated on seed state, OR validate everything before CreateSession. The phantom-row comment at manager.go:172-176 documents the trade-off explicitly — it's a design choice that bleeds through to users in two different code paths.
  3. Liveness/observation gates run too late (bugs 5, 6, 7): SCM credential warning gated behind a subject check; zellij session liveness gated above per-pane process liveness; CDC resolved trigger gated behind UPSERT-as-update. In each case, a "should be obvious" signal is silently lost because the upstream gate filters it out.

One unverifiable claim

Bug 4's "exit 0 on error" sub-symptom could not be reproduced from current source. Every traced path returns a non-nil error → os.Exit(ExitCode(err)) == 1. Worth getting --version + echo $? from the user before fixing.


RCA done by aa-73 (3 parallel codex agents partitioned by subsystem). Follow-up to aa-72 / discussion #149.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions