refactor(decopilot): drop /attach orphan-resume; collapse to pure tail by viktormarinho · Pull Request #3387 · decocms/studio

viktormarinho · 2026-05-15T20:05:57Z

What is this contribution about?

Pre-PR-#3376, POST /messages dispatched runs directly, so a pod death mid-run left orphan threads that only the next /attach could resurrect — hence the orphan-resume code path in /attach (routes.ts:442-537). Now every user message lives inside a threadGateWorkflow DBOS step, and the recovery executor replays it on a healthy pod with streamBuffer wired in — chunks land back on the per-thread JetStream subject and the existing /attach tail picks them up. The heartbeat watcher in app.ts stays as a backstop. This PR strips the now-vestigial client-triggered resume, removes the dead fire-and-forget dispatchRun export (only the orphan-resume called it), and drops threadStorage from DecopilotDeps. Net -150 lines; no behavior change for the happy path.

How to Test

bun run dev, open a thread, send a message, confirm the stream renders normally and reconnecting via the page-refresh /attach tail works.
Run bun test apps/mesh/src/api/routes/decopilot/ — 350/350 pass.
Pod-death scenario (optional, requires multi-pod setup against shared Postgres + NATS): kill the pod owning an in-flight run mid-stream and confirm a client tailing /attach continues to receive chunks (DBOS replays the workflow step on another pod, which re-pumps into the same JetStream subject).

Migration Notes

None.

Review Checklist

PR title is clear and descriptive
Changes are tested and working
Documentation is updated (if needed)
No breaking changes

Summary by cubic

Removed the client-triggered orphan-resume from /attach and collapsed the endpoint to a pure JetStream tail. All runs now go through the per-thread DBOS threadGateWorkflow via dispatchRunAndWait, which pumps to JetStream when streamBuffer is present.

Refactors
- Dropped /attach orphan-resume; endpoint now only tails the per-thread JetStream subject and serves SSE.
- Removed fire-and-forget dispatchRun; all producers use dispatchRunAndWait (drains and pumps when configured).
- Recovery relies on DBOS replay and the heartbeat watcher; chunks return to JetStream and existing /attach tails pick them up.
- Removed threadStorage from DecopilotDeps and cleaned up docs/comments to reference dispatchRunAndWait.
Bug Fixes
- /attach now sets deliverPolicy from DB thread.status (cluster-wide) instead of pod-local isRunning, preventing missed chunks in multi-pod setups.
- Corrected a stale dispatchRun reference in prepareRun error messaging to dispatchRunAndWait.

^{Written for commit 0cf93de. Summary will update on new commits. Review in cubic}

Pre-PR-#3376, POST /messages dispatched runs directly, so a pod death mid-run left orphan threads that only the next /attach could resurrect. Now every user message lives inside a thread-gate DBOS workflow step, and the recovery executor replays it on a healthy pod with the streamBuffer wired in — chunks land back on the per-thread JetStream subject and the existing /attach tail picks them up. The heartbeat watcher in app.ts remains as a backstop. Also removes the now-dead fire-and-forget dispatchRun export and threadStorage from DecopilotDeps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T20:06:06Z

🧪 Benchmark

Should we run the Virtual MCP strategy benchmark for this PR?

React with 👍 to run the benchmark.

Reaction	Action
👍	Run quick benchmark (10 & 128 tools)

Benchmark will run on the next push after you react.

github-actions · 2026-05-15T20:06:07Z

Release Options

Suggested: Patch (2.330.2) — based on refactor: prefix

React with an emoji to override the release type:

Reaction	Type	Next Version
👍	Prerelease	`2.330.2-alpha.1`
🎉	Patch	`2.330.2`
❤️	Minor	`2.331.0`
🚀	Major	`3.0.0`

Current version: 2.330.1

Note: If multiple reactions exist, the smallest bump wins. If no reactions, the suggested bump is used (default: patch).

cubic-dev-ai

1 issue found across 7 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/api/routes/decopilot/routes.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/routes.ts:403">
P2: `/attach` chooses `deliverPolicy` from pod-local `runRegistry`, so attaches handled by a different pod can miss buffered chunks of an in-flight run. Use thread state from `validateThreadAccess` (global DB view) when deciding `"all"` vs `"new"`.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic}

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

isRunning() is pod-local; a client attached to a non-owner pod (multi-pod deployment, mid-deploy, or post-DBOS-replay rehome) would silently miss chunks the owner had already pumped to the shared JetStream subject. thread.status is set synchronously by run-reactor's claimRunStart, so it's a cluster-wide signal. The buffer purges on terminal events, so "all" only ever replays the current in-flight run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

) * test(multi-pod): cross-pod /attach + mock-ai provider scaffolding Builds out the LLM-dependent half of the multi-pod framework so we can exercise the decopilot dispatch pipeline end-to-end without burning real provider budget. **Mock-ai service** (tests/multi-pod/mock-ai/) is a ~140-LOC Bun HTTP server that speaks the OpenAI chat-completions wire protocol — enough for mesh's `openai-compatible` adapter to drive `streamText`. Test-time controls (chunk count + delay) come from the user message text because mesh strips request headers before calling the provider; the mock parses "slow:NxMS" / "many:N" hints out of the prompt. **Setup helpers** (lib/setup.ts) gain `wireMockProvider`, `createTestAgent`, `createTestThread` for the three calls every dispatch scenario needs: register an openai-compatible credential pointing at mock-ai, pin the org's "smart" tier to it, create a virtual MCP, pre-create the thread row (without which /attach 404s before the workflow's prepareRun gets to insert it). **DB helper** (lib/db.ts) shells out to `docker compose exec postgres psql` for the rare cases where a scenario needs internal state not surfaced by the public API — currently used to look up the actual dispatch owner from `threads.run_owner_pod` for targeted-kill scenarios. **POD_NAME wired into compose** so `run_owner_pod` matches the compose service name ("mesh-1") and tests can map straight from a thread row to a `docker compose kill` target. **Scenarios:** - `attach-cross-pod` (✅ passing) — the headline test that directly validates the deliverPolicy fix from PR #3387. POSTs on pod-1, attaches on pods 2 and 3, asserts both see the buffered prefix of a slowed-down run. Catches the bug regardless of which pod DBOS picks for dispatch. - `pod-death-dbos-replay` (⚠️ skipped, fully wired) — would validate that SIGKILLing the run-owning pod still completes the run via DBOS replay. Surfaced a real architectural finding: DBOS replay on another pod fails `claimRunStart`'s strict CAS because the dead pod still owns `run_owner_pod`; the heartbeat watcher that DOES handle this (via `claimOrphanedRun`) never fires because the `KV_POD_HEARTBEATS` JetStream bucket is missing — bucket creation appears to fail silently at app.ts:867. Full repro + diagnostic notes in the test's docstring; remove `.skip` to verify once one of those is fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(multi-pod): address review fixes + revise pod-death finding Three review-flagged issues: - **dbQuery docstring/spawn cleanup**: psql was invoked with both `-A -F "|"` and `--csv`. `--csv` wins, so output was actually CSV (parser already split on `,`); the docstring's "pipe separator" warning would have misled callers picking column lists. Drop the dead flags and rewrite the warning around comma-safety. - **attach-cross-pod header timing**: header claimed "300ms × 5 chunks ≈ 1.5s", actual hint is `slow:5x500` (≈ 2.5s). Sync the header. - **mcpCall structuredContent short-circuit**: `if (sc) return sc` returned on `{}`, skipping the text-content fallback that some MCP tools rely on. Tighten to require a non-empty object so callers get the real payload (or a deliberate empty echo at the end of the chain) instead of a phantom `{}`. Also revises the pod-death-dbos-replay docstring with the corrected architectural diagnosis: original write-up blamed the heartbeat bucket in isolation; deeper investigation shows three compounding gaps (no cross-pod DBOS recovery scan because `executor_id="local"` for all pods, unconditional `streamBuffer.purge()` on resume wiping survivor buffers, and the heartbeat bucket reliability issue). Test stays `.skip`ped pending the architectural fix. Pulls in two helpers the pod-death scenario was already using: - `lib/db.ts` for direct postgres inspection of `run_owner_pod` - `lib/hooks.ts` auto-restores any pod left stopped by a previous scenario, so kill-style tests don't poison subsequent runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(multi-pod): replace racy sleep with JetStream-confirmation probe The 500ms sleep between POST and the cross-pod attach was right at the edge of the dispatch chain's typical latency. If DBOS dequeue + prepareRun + streamText init completed in under 500ms (the mock's per-chunk delay), the test's attaches would open BEFORE chunk-1 was published, and the `deliverPolicy: "new"` bug path would still deliver chunk-1 live — the regression we claim to catch wouldn't manifest. Fix: open a throwaway /attach on pod-1 first as a synchronization probe. Only proceed to opening the real test watchers once the probe has actually seen chunk-1 in its stream — at that point chunk-1 is provably already in JetStream, so receiving it from a cross-pod attach requires the `deliverPolicy: "all"` fix to work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(multi-pod): fail loudly when docker compose ps errors in hook Before, a failing `docker compose ps -a` (daemon down, compose file moved, permission issue) silently produced empty output. The hook then treated it as "no pods need restoring" and handed control to waitReady, which would time out 2 minutes later with a misleading "mesh-1 not healthy" — turning the real diagnosis into a needle in a haystack. Now we read stderr too and surface the actual compose error on a non-zero exit code, so the developer sees the root cause immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cubic-dev-ai Bot reviewed May 15, 2026

View reviewed changes

Comment thread apps/mesh/src/api/routes/decopilot/routes.ts Outdated

viktormarinho and others added 3 commits May 15, 2026 17:10

chore(decopilot): fix stale dispatchRun reference in prepareRun error

eda5f0f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(decopilot): fix stale doc references missed by rename

0cf93de

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

viktormarinho merged commit 2b62743 into main May 17, 2026
16 checks passed

viktormarinho deleted the viktormarinho/remove-orphan-resume branch May 17, 2026 18:00

viktormarinho mentioned this pull request May 17, 2026

test(multi-pod): cross-pod /attach + mock-ai provider scaffolding #3392

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(decopilot): drop /attach orphan-resume; collapse to pure tail#3387

refactor(decopilot): drop /attach orphan-resume; collapse to pure tail#3387
viktormarinho merged 4 commits into
mainfrom
viktormarinho/remove-orphan-resume

viktormarinho commented May 15, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

viktormarinho commented May 15, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is this contribution about?

How to Test

Migration Notes

Review Checklist

Summary by cubic

Uh oh!

github-actions Bot commented May 15, 2026

🧪 Benchmark

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Options

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

viktormarinho commented May 15, 2026 •

edited by cubic-dev-ai Bot

Loading

github-actions Bot commented May 15, 2026 •

edited

Loading