Context
The smoke walk in discussion #149 (run by session aa-72) surfaced 7 distinct bugs across the HTTP API, CLI spawn lifecycle, and observer/CDC layers. This issue is the root-cause analysis follow-up (aa-73): each bug has been traced to a real code path on main, with a suggested fix sketch. No code has been changed yet.
Three cross-cutting patterns emerged — see the bottom of this issue.
State left for inspection: /tmp/ao-smoke-{A,B,C} data dirs, /tmp/ao_smoke_aa72 binary, throwaway repo harshitsinghbhandari/ao-smoke-149-1780787250.
Bug 1 — POST /api/v1/orchestrators returns opaque 500 + leaves orphan terminated session row for unknown projectId
Code path:
backend/internal/httpd/controllers/sessions.go:288-306 spawnOrchestrator validates only ProjectID == "", then calls Svc.SpawnOrchestrator.
backend/internal/service/session/service.go:106-120 SpawnOrchestrator does not pre-validate the project exists, despite Store.GetProject being available at line 25.
backend/internal/session_manager/manager.go:109 m.store.CreateSession(...) — row is created before workspace validation.
backend/internal/session_manager/manager.go:122-126 workspace.Create → gitworktree.repoPath (backend/internal/adapters/workspace/gitworktree/workspace.go:104) fails for unknown project. Error wrapped as "spawn %s: workspace: %w" with a raw gitworktree: string — not the ErrProjectNotResolvable sentinel that exists and is mapped.
backend/internal/session_manager/manager.go:124 markSpawnFailedTerminated flips IsTerminated=true. Store has no DeleteSession; row sticks (comment at lines 172-176 acknowledges this is intentional given create-first ordering).
backend/internal/service/session/service.go:226-243 toAPIError has no match → default → raw error.
backend/internal/httpd/envelope/envelope.go:44-52 WriteError collapses any non-*apierr.Error to INTERNAL_ERROR 500.
Root cause: Spawn creates the DB row before validating the project, and the underlying gitworktree error is untyped, so the typed-envelope mapping is bypassed.
Suggested fix: In Service.SpawnOrchestrator/Service.Spawn, call store.GetProject first and return apierr.NotFound("PROJECT_NOT_FOUND", ...) before manager.Spawn. Additionally (or instead), make gitworktree.repoPath return ErrProjectNotResolvable so the existing toAPIError mapping fires. Pre-validation also eliminates the orphan row.
Confidence: high.
Bug 2 — POST /api/v1/sessions/{id}/restore returns opaque 500 on terminated/half-spawned sessions (Kill returns proper 409 on the same row)
Code path:
backend/internal/httpd/controllers/sessions.go:195-219 restore and kill handlers are symmetric — both use envelope.WriteError. Divergence is in the manager.
backend/internal/session_manager/manager.go:189-193 Kill has the guard: if handle.ID == "" || ws.Path == "" → returns ErrIncompleteHandle. toAPIError (service.go:236-237) maps it to apierr.Conflict("SESSION_INCOMPLETE_HANDLE", ...).
backend/internal/session_manager/manager.go:209-255 Restore has no equivalent guard. It checks not-found and not-terminated, then either falls through "nothing to resume from" (line 222, untyped string error) or calls workspace.Restore with empty Metadata.Branch/Metadata.WorkspacePath and wraps any failure as "restore %s: workspace: %w" (no sentinel).
toAPIError matches none of these → default → opaque 500.
Root cause: Restore is missing the ErrIncompleteHandle guard that Kill has. The envelope/sentinel infra is fine — Restore just never feeds into it.
Suggested fix: In Manager.Restore, mirror the Kill guard immediately after the terminated check: if rec.Metadata.WorkspacePath == "" || rec.Metadata.Branch == "" { return ... ErrIncompleteHandle }. Optionally convert the "nothing to resume from" string into a typed sentinel.
Confidence: high.
Bug 3 — ao spawn --branch <unfetched> and --branch <checked-out-elsewhere> return opaque 500
Code path:
backend/internal/adapters/workspace/gitworktree/workspace.go:100-117 Create calls validateBranch (only git check-ref-format, no existence/fetched check), then addWorktree.
workspace.go:197-216 addWorktree:
refExists("refs/heads/"+branch) is local-only. An unfetched refs/remotes/origin/X is invisible.
- Falls into the "new branch" path →
git worktree add -b ... <base>. If the branch is checked out in another worktree, git fails: fatal: 'X' is already checked out at ....
workspace.go:212-214 wraps as "gitworktree: worktree add branch ...: <commandError>" (preserves stderr) → manager.go:124-126 re-wraps "spawn %s: workspace: %w" → service.go:95-98 calls toAPIError, no workspace sentinels match → default → opaque 500.
Root cause: No pre-flight for (a) branch fetched (only local heads checked), (b) branch checked out in another worktree. Useful git stderr is wrapped but toAPIError's whitelist has no workspace-error sentinels, so any workspace failure becomes opaque 500.
Suggested fix:
- Before
git worktree add, call listRecords (already wrapped) and reject early with new ErrBranchCheckedOutElsewhere naming the conflicting path.
- In
refExists/addWorktree, also probe refs/remotes/origin/<branch> and refs/tags/<branch>; if remote-only, either auto-track or return new ErrBranchNotFetched carrying the git fetch suggestion.
- Wire both sentinels through
ports.Workspace; extend toAPIError → apierr.Conflict("BRANCH_CHECKED_OUT_ELSEWHERE", ...) and apierr.Invalid("BRANCH_NOT_FETCHED", ...).
Confidence: high.
Bug 4 — ao spawn --claim-pr <bad-url> orphan row + (claimed) exit-0; ao session claim-pr --no-takeover 409 exit-0
Code path (orphan row):
backend/internal/cli/spawn.go:79-94 spawn first POSTs /sessions (creates row), then POSTs /sessions/{id}/pr/claim.
- On claim failure:
spawn.go:86-89 calls rollbackSpawnedSession (spawn.go:136-139) which POSTs /sessions/{id}/kill.
manager.go:181-204 Kill → lcm.MarkTerminated (lifecycle/manager.go:119-128) only flips IsTerminated=true — not a delete. Store has no DeleteSession. Row persists under --include-terminated. Comment at manager.go:172-176 calls out this trade-off ("phantom half-spawned row is worse than terminal one").
Code path (exit code):
backend/internal/cli/client.go:114-121 turns any non-2xx into fmt.Errorf("%s", e.String()). Returns through cobra RunE (SilenceErrors: true) to main.go:11-14 which does os.Exit(ExitCode(err)) → exit 1.
- Test
TestSpawnClaimPRFailureRollsBackSession (spawn_test.go:82-126) confirms the error is returned.
- Could NOT reproduce exit-0 from current source. Every traced path returns a non-nil error → exit 1. Either (a) the smoke binary at
/tmp/ao_smoke_aa72 predates a fix, (b) a shell wrapper (rtk proxy / hooks) is swallowing the code, or (c) there's a code path returning 200 OK + {"ok": false} that wasn't found.
Root cause:
- Orphan row: rollback is
kill (marks terminated), and Store has no delete method. Symptom is by design but user-visible through --include-terminated.
- Exit-0: unconfirmed in current code. Needs reproduction with the actual binary the user ran.
Suggested fix:
- Add
Store.DeleteSession(ctx, id) gated to seed-state rows only (preserves no-resurrection). Wire a Manager.rollbackSpawn(id) that deletes instead of MarkTerminated. Use from cli/spawn.go:86. Better: push claim into the daemon (POST /sessions?claim=...) so the daemon owns the all-or-nothing transaction.
- For exit-0: get
/tmp/ao_smoke_aa72 --version + echo $? transcript and the exact PR URL used; add CLI integration test asserting ExitCode != 0 for both flows.
Confidence: high on orphan-row; LOW on exit-0 — cannot reproduce in current source.
Bug 5 — pr_review_thread_resolved CDC event never lands (only _added does)
Code path:
backend/internal/cdc/event.go:29-30 both event types are declared.
backend/internal/storage/sqlite/migrations/0006_pr_session_changed_cdc.sql:43-82 two triggers:
pr_review_threads_cdc_insert (AFTER INSERT) emits pr_review_thread_added.
pr_review_threads_cdc_update (AFTER UPDATE, WHEN OLD.resolved <> NEW.resolved) emits pr_review_thread_resolved. Only fires on in-place UPDATE.
backend/internal/observe/scm/observer.go:720-732 picks mode: ReviewWriteMerge only when review.Partial == true, else ReviewWriteReplace.
backend/internal/adapters/scm/github/observer_provider.go:193 Partial: true is set only when GitHub pagination is bounded. Single-page reviews → Partial: false → Replace.
backend/internal/storage/sqlite/store/pr_store.go:113-117, 127-133 ReviewWriteReplace calls DeletePRReviewThreads(pr.URL) then UpsertPRReviewThread(...). Every upsert hits INSERT (rows were just deleted) → only the INSERT trigger fires. The UPDATE trigger never sees a resolved flip.
pr_cdc_test.go:67-101 the only resolved-CDC test seeds via Replace and then resolves via Merge — masks the bug.
Root cause: Real-world polls usually take the Replace path (DELETE-all + UPSERT). SQLite sees fresh inserts every poll. The UPDATE trigger that emits _resolved is unreachable for the common case.
Suggested fix: In writePRRows (pr_store.go:113-117), replace "DELETE then upsert" with "upsert observed threads, then DELETE threads whose thread_id is not in the observed set" (set-diff delete). The upsert becomes a true UPDATE for known rows, the resolved flip fires the UPDATE trigger, and stale threads still get pruned. Add a regression test that exercises Replace on the second poll and asserts a pr_review_thread_resolved event.
Confidence: high.
Bug 6 — ao spawn --agent qwen succeeds silently when qwen binary is not on PATH
Code path (two stacked failures):
- Adapter swallows missing binary.
backend/internal/adapters/agent/qwen/qwen.go:149-210 ResolveQwenBinary tries exec.LookPath and well-known dirs; on miss returns "qwen", nil (line 209) — explicit "last-ditch fallback". No error path. GetLaunchCommand (lines 74-92) builds argv ["qwen", "-p", ...].
- Runtime liveness conflates zellij-session-alive with agent-process-alive.
backend/internal/adapters/runtime/zellij/zellij.go:120-153 Create writes a layout KDL and runs zellij --new-session-with-layout. Zellij creates the session + pane regardless of whether the inner command exists; the pane's shell runs qwen → command not found → shell exits, but the pane (and zellij session) persist. findAgentPane (lines 268-291) matches by title — that pane is there. Create returns success → manager.go:163-168 lcm.MarkSpawned records idle/alive.
backend/internal/observe/reaper/reaper.go:140-167 Reaper.probeOne calls runtime.IsAlive, which is zellij list-sessions (zellij.go:208-222). Zellij session is alive → Probe=Alive → session stays "live" forever. No per-pane PID tracking.
Root cause: Two layered bugs reinforce each other — qwen's resolver returns success with a string fallback (audit grok, kimi, etc. for the same pattern), and the zellij runtime has no signal for "command in pane died at execve". Stderr from the failed exec stays in the pane scroll buffer, never logged.
Suggested fix:
- Drop the silent fallback in
ResolveQwenBinary — return a typed ErrAgentBinaryNotFound when exec.LookPath misses. Audit other adapters.
- Expose
ports.Agent.Validate(ctx) (or pre-flight exec.LookPath(argv[0]) in Manager.Spawn after GetLaunchCommand) so the failure aborts BEFORE runtime.Create. Map sentinel via toAPIError to apierr.Invalid("AGENT_BINARY_NOT_FOUND", ...).
- Longer-term (separate issue): wrap argv in a shim that reports inner-process exit to the daemon, so the reaper can mark the session dead. Also fixes the broader "agent crashed post-launch" silent class.
Confidence: high.
Bug 7 — SCM observer "scm observer disabled: provider credentials unavailable" log line never emitted on no-token startup
Code path:
backend/internal/daemon/scm_wiring.go:22-30 startSCMObserver builds the GitHub provider with SkipTokenPreflight: true (intentional — readiness must not block on gh). Constructs observer, starts loop. No log if token absent.
scm_wiring.go:44-50 logSCMProviderDisabled has the right log line, but only runs when newGitHubSCMProvider itself errors. With SkipTokenPreflight: true, it doesn't error on missing token. Dead branch in the no-token case.
backend/internal/observe/scm/observer.go:163-186 Start/loop immediately calls Poll.
observer.go:223-244 Poll runs discoverSubjects first, returns early at line 236 if len(subjects) == 0. The credential check (line 238) is unreachable until at least one non-terminated session with a branch + supported origin exists.
observer.go:402-426 checkCredentials contains the documented o.logger.Warn("scm observer disabled: provider credentials unavailable") at line 422. Guarded by credentialsChecked (once per process) AND only runs if Poll reaches past the subject check.
Root cause: The credential gate is hidden behind two preconditions inside Poll: subjects must exist AND Poll must reach the check. Fresh daemon with no tracked PRs → discoverSubjects empty → Poll returns nil → warn never fires. Wiring layer can't catch it because SkipTokenPreflight: true makes provider construction succeed regardless.
Suggested fix: Either (a) in Observer.loop, call checkCredentials once before the first Poll, independent of subjects, and log the disabled-state warning there; or (b) in startSCMObserver, explicitly probe token availability (non-blocking, no gh shell-out) after constructing the provider and emit the documented warn line before starting the loop. Add a daemon-level test that asserts the log when AO_GITHUB_TOKEN is unset.
Confidence: high.
Cross-cutting patterns
toAPIError's sentinel whitelist is the choke point for opaque 500s (bugs 1, 2, 3). Every workspace / external error that doesn't carry a recognized sentinel collapses to INTERNAL_ERROR in the envelope. Either widen the whitelist, or make wrappers (gitworktree, etc.) emit typed errors.
- Spawn creates the DB row before validating preconditions (bugs 1, 4).
Store.CreateSession is irreversible (no delete). Either add DeleteSession gated on seed state, OR validate everything before CreateSession. The phantom-row comment at manager.go:172-176 documents the trade-off explicitly — it's a design choice that bleeds through to users in two different code paths.
- Liveness/observation gates run too late (bugs 5, 6, 7): SCM credential warning gated behind a subject check; zellij session liveness gated above per-pane process liveness; CDC
resolved trigger gated behind UPSERT-as-update. In each case, a "should be obvious" signal is silently lost because the upstream gate filters it out.
One unverifiable claim
Bug 4's "exit 0 on error" sub-symptom could not be reproduced from current source. Every traced path returns a non-nil error → os.Exit(ExitCode(err)) == 1. Worth getting --version + echo $? from the user before fixing.
RCA done by aa-73 (3 parallel codex agents partitioned by subsystem). Follow-up to aa-72 / discussion #149.
Context
The smoke walk in discussion #149 (run by session aa-72) surfaced 7 distinct bugs across the HTTP API, CLI spawn lifecycle, and observer/CDC layers. This issue is the root-cause analysis follow-up (aa-73): each bug has been traced to a real code path on
main, with a suggested fix sketch. No code has been changed yet.Three cross-cutting patterns emerged — see the bottom of this issue.
State left for inspection:
/tmp/ao-smoke-{A,B,C}data dirs,/tmp/ao_smoke_aa72binary, throwaway repoharshitsinghbhandari/ao-smoke-149-1780787250.Bug 1 —
POST /api/v1/orchestratorsreturns opaque 500 + leaves orphan terminated session row for unknown projectIdCode path:
backend/internal/httpd/controllers/sessions.go:288-306spawnOrchestratorvalidates onlyProjectID == "", then callsSvc.SpawnOrchestrator.backend/internal/service/session/service.go:106-120SpawnOrchestratordoes not pre-validate the project exists, despiteStore.GetProjectbeing available at line 25.backend/internal/session_manager/manager.go:109m.store.CreateSession(...)— row is created before workspace validation.backend/internal/session_manager/manager.go:122-126workspace.Create→gitworktree.repoPath(backend/internal/adapters/workspace/gitworktree/workspace.go:104) fails for unknown project. Error wrapped as"spawn %s: workspace: %w"with a rawgitworktree:string — not theErrProjectNotResolvablesentinel that exists and is mapped.backend/internal/session_manager/manager.go:124markSpawnFailedTerminatedflipsIsTerminated=true. Store has noDeleteSession; row sticks (comment at lines 172-176 acknowledges this is intentional given create-first ordering).backend/internal/service/session/service.go:226-243toAPIErrorhas no match → default → raw error.backend/internal/httpd/envelope/envelope.go:44-52WriteErrorcollapses any non-*apierr.ErrortoINTERNAL_ERROR500.Root cause: Spawn creates the DB row before validating the project, and the underlying gitworktree error is untyped, so the typed-envelope mapping is bypassed.
Suggested fix: In
Service.SpawnOrchestrator/Service.Spawn, callstore.GetProjectfirst and returnapierr.NotFound("PROJECT_NOT_FOUND", ...)beforemanager.Spawn. Additionally (or instead), makegitworktree.repoPathreturnErrProjectNotResolvableso the existingtoAPIErrormapping fires. Pre-validation also eliminates the orphan row.Confidence: high.
Bug 2 —
POST /api/v1/sessions/{id}/restorereturns opaque 500 on terminated/half-spawned sessions (Kill returns proper 409 on the same row)Code path:
backend/internal/httpd/controllers/sessions.go:195-219restore and kill handlers are symmetric — both useenvelope.WriteError. Divergence is in the manager.backend/internal/session_manager/manager.go:189-193Killhas the guard:if handle.ID == "" || ws.Path == ""→ returnsErrIncompleteHandle.toAPIError(service.go:236-237) maps it toapierr.Conflict("SESSION_INCOMPLETE_HANDLE", ...).backend/internal/session_manager/manager.go:209-255Restorehas no equivalent guard. It checks not-found and not-terminated, then either falls through "nothing to resume from" (line 222, untyped string error) or callsworkspace.Restorewith emptyMetadata.Branch/Metadata.WorkspacePathand wraps any failure as"restore %s: workspace: %w"(no sentinel).toAPIErrormatches none of these → default → opaque 500.Root cause: Restore is missing the
ErrIncompleteHandleguard that Kill has. The envelope/sentinel infra is fine — Restore just never feeds into it.Suggested fix: In
Manager.Restore, mirror the Kill guard immediately after the terminated check:if rec.Metadata.WorkspacePath == "" || rec.Metadata.Branch == "" { return ... ErrIncompleteHandle }. Optionally convert the "nothing to resume from" string into a typed sentinel.Confidence: high.
Bug 3 —
ao spawn --branch <unfetched>and--branch <checked-out-elsewhere>return opaque 500Code path:
backend/internal/adapters/workspace/gitworktree/workspace.go:100-117CreatecallsvalidateBranch(onlygit check-ref-format, no existence/fetched check), thenaddWorktree.workspace.go:197-216addWorktree:refExists("refs/heads/"+branch)is local-only. An unfetchedrefs/remotes/origin/Xis invisible.git worktree add -b ... <base>. If the branch is checked out in another worktree, git fails:fatal: 'X' is already checked out at ....workspace.go:212-214wraps as"gitworktree: worktree add branch ...: <commandError>"(preserves stderr) →manager.go:124-126re-wraps"spawn %s: workspace: %w"→service.go:95-98callstoAPIError, no workspace sentinels match → default → opaque 500.Root cause: No pre-flight for (a) branch fetched (only local heads checked), (b) branch checked out in another worktree. Useful git stderr is wrapped but
toAPIError's whitelist has no workspace-error sentinels, so any workspace failure becomes opaque 500.Suggested fix:
git worktree add, calllistRecords(already wrapped) and reject early with newErrBranchCheckedOutElsewherenaming the conflicting path.refExists/addWorktree, also proberefs/remotes/origin/<branch>andrefs/tags/<branch>; if remote-only, either auto-track or return newErrBranchNotFetchedcarrying thegit fetchsuggestion.ports.Workspace; extendtoAPIError→apierr.Conflict("BRANCH_CHECKED_OUT_ELSEWHERE", ...)andapierr.Invalid("BRANCH_NOT_FETCHED", ...).Confidence: high.
Bug 4 —
ao spawn --claim-pr <bad-url>orphan row + (claimed) exit-0;ao session claim-pr --no-takeover409 exit-0Code path (orphan row):
backend/internal/cli/spawn.go:79-94spawn first POSTs/sessions(creates row), then POSTs/sessions/{id}/pr/claim.spawn.go:86-89callsrollbackSpawnedSession(spawn.go:136-139) which POSTs/sessions/{id}/kill.manager.go:181-204Kill→lcm.MarkTerminated(lifecycle/manager.go:119-128) only flipsIsTerminated=true— not a delete. Store has noDeleteSession. Row persists under--include-terminated. Comment atmanager.go:172-176calls out this trade-off ("phantom half-spawned row is worse than terminal one").Code path (exit code):
backend/internal/cli/client.go:114-121turns any non-2xx intofmt.Errorf("%s", e.String()). Returns through cobraRunE(SilenceErrors: true) tomain.go:11-14which doesos.Exit(ExitCode(err))→ exit 1.TestSpawnClaimPRFailureRollsBackSession(spawn_test.go:82-126) confirms the error is returned./tmp/ao_smoke_aa72predates a fix, (b) a shell wrapper (rtk proxy / hooks) is swallowing the code, or (c) there's a code path returning200 OK + {"ok": false}that wasn't found.Root cause:
kill(marks terminated), andStorehas no delete method. Symptom is by design but user-visible through--include-terminated.Suggested fix:
Store.DeleteSession(ctx, id)gated to seed-state rows only (preserves no-resurrection). Wire aManager.rollbackSpawn(id)that deletes instead ofMarkTerminated. Use fromcli/spawn.go:86. Better: push claim into the daemon (POST /sessions?claim=...) so the daemon owns the all-or-nothing transaction./tmp/ao_smoke_aa72 --version+echo $?transcript and the exact PR URL used; add CLI integration test assertingExitCode != 0for both flows.Confidence: high on orphan-row; LOW on exit-0 — cannot reproduce in current source.
Bug 5 —
pr_review_thread_resolvedCDC event never lands (only_addeddoes)Code path:
backend/internal/cdc/event.go:29-30both event types are declared.backend/internal/storage/sqlite/migrations/0006_pr_session_changed_cdc.sql:43-82two triggers:pr_review_threads_cdc_insert(AFTER INSERT) emitspr_review_thread_added.pr_review_threads_cdc_update(AFTER UPDATE,WHEN OLD.resolved <> NEW.resolved) emitspr_review_thread_resolved. Only fires on in-place UPDATE.backend/internal/observe/scm/observer.go:720-732picks mode:ReviewWriteMergeonly whenreview.Partial == true, elseReviewWriteReplace.backend/internal/adapters/scm/github/observer_provider.go:193Partial: trueis set only when GitHub pagination is bounded. Single-page reviews →Partial: false→ Replace.backend/internal/storage/sqlite/store/pr_store.go:113-117, 127-133ReviewWriteReplacecallsDeletePRReviewThreads(pr.URL)thenUpsertPRReviewThread(...). Every upsert hits INSERT (rows were just deleted) → only the INSERT trigger fires. The UPDATE trigger never sees aresolvedflip.pr_cdc_test.go:67-101the only resolved-CDC test seeds via Replace and then resolves via Merge — masks the bug.Root cause: Real-world polls usually take the Replace path (DELETE-all + UPSERT). SQLite sees fresh inserts every poll. The UPDATE trigger that emits
_resolvedis unreachable for the common case.Suggested fix: In
writePRRows(pr_store.go:113-117), replace "DELETE then upsert" with "upsert observed threads, then DELETE threads whosethread_idis not in the observed set" (set-diff delete). The upsert becomes a true UPDATE for known rows, theresolvedflip fires the UPDATE trigger, and stale threads still get pruned. Add a regression test that exercises Replace on the second poll and asserts apr_review_thread_resolvedevent.Confidence: high.
Bug 6 —
ao spawn --agent qwensucceeds silently whenqwenbinary is not on PATHCode path (two stacked failures):
backend/internal/adapters/agent/qwen/qwen.go:149-210ResolveQwenBinarytriesexec.LookPathand well-known dirs; on miss returns"qwen", nil(line 209) — explicit "last-ditch fallback". No error path.GetLaunchCommand(lines 74-92) builds argv["qwen", "-p", ...].backend/internal/adapters/runtime/zellij/zellij.go:120-153Createwrites a layout KDL and runszellij --new-session-with-layout. Zellij creates the session + pane regardless of whether the inner command exists; the pane's shell runsqwen→command not found→ shell exits, but the pane (and zellij session) persist.findAgentPane(lines 268-291) matches by title — that pane is there.Createreturns success →manager.go:163-168lcm.MarkSpawnedrecords idle/alive.backend/internal/observe/reaper/reaper.go:140-167Reaper.probeOnecallsruntime.IsAlive, which iszellij list-sessions(zellij.go:208-222). Zellij session is alive →Probe=Alive→ session stays "live" forever. No per-pane PID tracking.Root cause: Two layered bugs reinforce each other — qwen's resolver returns success with a string fallback (audit
grok,kimi, etc. for the same pattern), and the zellij runtime has no signal for "command in pane died at execve". Stderr from the failed exec stays in the pane scroll buffer, never logged.Suggested fix:
ResolveQwenBinary— return a typedErrAgentBinaryNotFoundwhenexec.LookPathmisses. Audit other adapters.ports.Agent.Validate(ctx)(or pre-flightexec.LookPath(argv[0])inManager.SpawnafterGetLaunchCommand) so the failure aborts BEFOREruntime.Create. Map sentinel viatoAPIErrortoapierr.Invalid("AGENT_BINARY_NOT_FOUND", ...).Confidence: high.
Bug 7 — SCM observer "scm observer disabled: provider credentials unavailable" log line never emitted on no-token startup
Code path:
backend/internal/daemon/scm_wiring.go:22-30startSCMObserverbuilds the GitHub provider withSkipTokenPreflight: true(intentional — readiness must not block ongh). Constructs observer, starts loop. No log if token absent.scm_wiring.go:44-50logSCMProviderDisabledhas the right log line, but only runs whennewGitHubSCMProvideritself errors. WithSkipTokenPreflight: true, it doesn't error on missing token. Dead branch in the no-token case.backend/internal/observe/scm/observer.go:163-186Start/loopimmediately callsPoll.observer.go:223-244PollrunsdiscoverSubjectsfirst, returns early at line 236 iflen(subjects) == 0. The credential check (line 238) is unreachable until at least one non-terminated session with a branch + supported origin exists.observer.go:402-426checkCredentialscontains the documentedo.logger.Warn("scm observer disabled: provider credentials unavailable")at line 422. Guarded bycredentialsChecked(once per process) AND only runs if Poll reaches past the subject check.Root cause: The credential gate is hidden behind two preconditions inside
Poll: subjects must exist AND Poll must reach the check. Fresh daemon with no tracked PRs →discoverSubjectsempty → Poll returns nil → warn never fires. Wiring layer can't catch it becauseSkipTokenPreflight: truemakes provider construction succeed regardless.Suggested fix: Either (a) in
Observer.loop, callcheckCredentialsonce before the first Poll, independent of subjects, and log the disabled-state warning there; or (b) instartSCMObserver, explicitly probe token availability (non-blocking, noghshell-out) after constructing the provider and emit the documented warn line before starting the loop. Add a daemon-level test that asserts the log whenAO_GITHUB_TOKENis unset.Confidence: high.
Cross-cutting patterns
toAPIError's sentinel whitelist is the choke point for opaque 500s (bugs 1, 2, 3). Every workspace / external error that doesn't carry a recognized sentinel collapses toINTERNAL_ERRORin the envelope. Either widen the whitelist, or make wrappers (gitworktree, etc.) emit typed errors.Store.CreateSessionis irreversible (no delete). Either addDeleteSessiongated on seed state, OR validate everything beforeCreateSession. The phantom-row comment atmanager.go:172-176documents the trade-off explicitly — it's a design choice that bleeds through to users in two different code paths.resolvedtrigger gated behind UPSERT-as-update. In each case, a "should be obvious" signal is silently lost because the upstream gate filters it out.One unverifiable claim
Bug 4's "exit 0 on error" sub-symptom could not be reproduced from current source. Every traced path returns a non-nil error →
os.Exit(ExitCode(err)) == 1. Worth getting--version+echo $?from the user before fixing.RCA done by aa-73 (3 parallel codex agents partitioned by subsystem). Follow-up to aa-72 / discussion #149.