refactor(decopilot): drop /attach orphan-resume; collapse to pure tail#3387
Merged
Conversation
Pre-PR-#3376, POST /messages dispatched runs directly, so a pod death mid-run left orphan threads that only the next /attach could resurrect. Now every user message lives inside a thread-gate DBOS workflow step, and the recovery executor replays it on a healthy pod with the streamBuffer wired in — chunks land back on the per-thread JetStream subject and the existing /attach tail picks them up. The heartbeat watcher in app.ts remains as a backstop. Also removes the now-dead fire-and-forget dispatchRun export and threadStorage from DecopilotDeps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
🧪 BenchmarkShould we run the Virtual MCP strategy benchmark for this PR? React with 👍 to run the benchmark.
Benchmark will run on the next push after you react. |
Contributor
Release OptionsSuggested: Patch ( React with an emoji to override the release type:
Current version:
|
Contributor
There was a problem hiding this comment.
1 issue found across 7 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/mesh/src/api/routes/decopilot/routes.ts">
<violation number="1" location="apps/mesh/src/api/routes/decopilot/routes.ts:403">
P2: `/attach` chooses `deliverPolicy` from pod-local `runRegistry`, so attaches handled by a different pod can miss buffered chunks of an in-flight run. Use thread state from `validateThreadAccess` (global DB view) when deciding `"all"` vs `"new"`.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
isRunning() is pod-local; a client attached to a non-owner pod (multi-pod deployment, mid-deploy, or post-DBOS-replay rehome) would silently miss chunks the owner had already pumped to the shared JetStream subject. thread.status is set synchronously by run-reactor's claimRunStart, so it's a cluster-wide signal. The buffer purges on terminal events, so "all" only ever replays the current in-flight run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
viktormarinho
added a commit
that referenced
this pull request
May 17, 2026
) * test(multi-pod): cross-pod /attach + mock-ai provider scaffolding Builds out the LLM-dependent half of the multi-pod framework so we can exercise the decopilot dispatch pipeline end-to-end without burning real provider budget. **Mock-ai service** (tests/multi-pod/mock-ai/) is a ~140-LOC Bun HTTP server that speaks the OpenAI chat-completions wire protocol — enough for mesh's `openai-compatible` adapter to drive `streamText`. Test-time controls (chunk count + delay) come from the user message text because mesh strips request headers before calling the provider; the mock parses "slow:NxMS" / "many:N" hints out of the prompt. **Setup helpers** (lib/setup.ts) gain `wireMockProvider`, `createTestAgent`, `createTestThread` for the three calls every dispatch scenario needs: register an openai-compatible credential pointing at mock-ai, pin the org's "smart" tier to it, create a virtual MCP, pre-create the thread row (without which /attach 404s before the workflow's prepareRun gets to insert it). **DB helper** (lib/db.ts) shells out to `docker compose exec postgres psql` for the rare cases where a scenario needs internal state not surfaced by the public API — currently used to look up the actual dispatch owner from `threads.run_owner_pod` for targeted-kill scenarios. **POD_NAME wired into compose** so `run_owner_pod` matches the compose service name ("mesh-1") and tests can map straight from a thread row to a `docker compose kill` target. **Scenarios:** - `attach-cross-pod` (✅ passing) — the headline test that directly validates the deliverPolicy fix from PR #3387. POSTs on pod-1, attaches on pods 2 and 3, asserts both see the buffered prefix of a slowed-down run. Catches the bug regardless of which pod DBOS picks for dispatch. - `pod-death-dbos-replay` (⚠️ skipped, fully wired) — would validate that SIGKILLing the run-owning pod still completes the run via DBOS replay. Surfaced a real architectural finding: DBOS replay on another pod fails `claimRunStart`'s strict CAS because the dead pod still owns `run_owner_pod`; the heartbeat watcher that DOES handle this (via `claimOrphanedRun`) never fires because the `KV_POD_HEARTBEATS` JetStream bucket is missing — bucket creation appears to fail silently at app.ts:867. Full repro + diagnostic notes in the test's docstring; remove `.skip` to verify once one of those is fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(multi-pod): address review fixes + revise pod-death finding Three review-flagged issues: - **dbQuery docstring/spawn cleanup**: psql was invoked with both `-A -F "|"` and `--csv`. `--csv` wins, so output was actually CSV (parser already split on `,`); the docstring's "pipe separator" warning would have misled callers picking column lists. Drop the dead flags and rewrite the warning around comma-safety. - **attach-cross-pod header timing**: header claimed "300ms × 5 chunks ≈ 1.5s", actual hint is `slow:5x500` (≈ 2.5s). Sync the header. - **mcpCall structuredContent short-circuit**: `if (sc) return sc` returned on `{}`, skipping the text-content fallback that some MCP tools rely on. Tighten to require a non-empty object so callers get the real payload (or a deliberate empty echo at the end of the chain) instead of a phantom `{}`. Also revises the pod-death-dbos-replay docstring with the corrected architectural diagnosis: original write-up blamed the heartbeat bucket in isolation; deeper investigation shows three compounding gaps (no cross-pod DBOS recovery scan because `executor_id="local"` for all pods, unconditional `streamBuffer.purge()` on resume wiping survivor buffers, and the heartbeat bucket reliability issue). Test stays `.skip`ped pending the architectural fix. Pulls in two helpers the pod-death scenario was already using: - `lib/db.ts` for direct postgres inspection of `run_owner_pod` - `lib/hooks.ts` auto-restores any pod left stopped by a previous scenario, so kill-style tests don't poison subsequent runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(multi-pod): replace racy sleep with JetStream-confirmation probe The 500ms sleep between POST and the cross-pod attach was right at the edge of the dispatch chain's typical latency. If DBOS dequeue + prepareRun + streamText init completed in under 500ms (the mock's per-chunk delay), the test's attaches would open BEFORE chunk-1 was published, and the `deliverPolicy: "new"` bug path would still deliver chunk-1 live — the regression we claim to catch wouldn't manifest. Fix: open a throwaway /attach on pod-1 first as a synchronization probe. Only proceed to opening the real test watchers once the probe has actually seen chunk-1 in its stream — at that point chunk-1 is provably already in JetStream, so receiving it from a cross-pod attach requires the `deliverPolicy: "all"` fix to work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(multi-pod): fail loudly when docker compose ps errors in hook Before, a failing `docker compose ps -a` (daemon down, compose file moved, permission issue) silently produced empty output. The hook then treated it as "no pods need restoring" and handed control to waitReady, which would time out 2 minutes later with a misleading "mesh-1 not healthy" — turning the real diagnosis into a needle in a haystack. Now we read stderr too and surface the actual compose error on a non-zero exit code, so the developer sees the root cause immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is this contribution about?
Pre-PR-#3376,
POST /messagesdispatched runs directly, so a pod death mid-run left orphan threads that only the next/attachcould resurrect — hence the orphan-resume code path in/attach(routes.ts:442-537). Now every user message lives inside athreadGateWorkflowDBOS step, and the recovery executor replays it on a healthy pod withstreamBufferwired in — chunks land back on the per-thread JetStream subject and the existing/attachtail picks them up. The heartbeat watcher inapp.tsstays as a backstop. This PR strips the now-vestigial client-triggered resume, removes the dead fire-and-forgetdispatchRunexport (only the orphan-resume called it), and dropsthreadStoragefromDecopilotDeps. Net -150 lines; no behavior change for the happy path.How to Test
bun run dev, open a thread, send a message, confirm the stream renders normally and reconnecting via the page-refresh/attachtail works.bun test apps/mesh/src/api/routes/decopilot/— 350/350 pass./attachcontinues to receive chunks (DBOS replays the workflow step on another pod, which re-pumps into the same JetStream subject).Migration Notes
None.
Review Checklist
Summary by cubic
Removed the client-triggered orphan-resume from
/attachand collapsed the endpoint to a pure JetStream tail. All runs now go through the per-thread DBOSthreadGateWorkflowviadispatchRunAndWait, which pumps to JetStream whenstreamBufferis present.Refactors
/attachorphan-resume; endpoint now only tails the per-thread JetStream subject and serves SSE.dispatchRun; all producers usedispatchRunAndWait(drains and pumps when configured)./attachtails pick them up.threadStoragefromDecopilotDepsand cleaned up docs/comments to referencedispatchRunAndWait.Bug Fixes
/attachnow setsdeliverPolicyfrom DBthread.status(cluster-wide) instead of pod-localisRunning, preventing missed chunks in multi-pod setups.dispatchRunreference inprepareRunerror messaging todispatchRunAndWait.Written for commit 0cf93de. Summary will update on new commits. Review in cubic