Skip to content

refactor(decopilot): drop /attach orphan-resume; collapse to pure tail#3387

Merged
viktormarinho merged 4 commits into
mainfrom
viktormarinho/remove-orphan-resume
May 17, 2026
Merged

refactor(decopilot): drop /attach orphan-resume; collapse to pure tail#3387
viktormarinho merged 4 commits into
mainfrom
viktormarinho/remove-orphan-resume

Conversation

@viktormarinho
Copy link
Copy Markdown
Contributor

@viktormarinho viktormarinho commented May 15, 2026

What is this contribution about?

Pre-PR-#3376, POST /messages dispatched runs directly, so a pod death mid-run left orphan threads that only the next /attach could resurrect — hence the orphan-resume code path in /attach (routes.ts:442-537). Now every user message lives inside a threadGateWorkflow DBOS step, and the recovery executor replays it on a healthy pod with streamBuffer wired in — chunks land back on the per-thread JetStream subject and the existing /attach tail picks them up. The heartbeat watcher in app.ts stays as a backstop. This PR strips the now-vestigial client-triggered resume, removes the dead fire-and-forget dispatchRun export (only the orphan-resume called it), and drops threadStorage from DecopilotDeps. Net -150 lines; no behavior change for the happy path.

How to Test

  1. bun run dev, open a thread, send a message, confirm the stream renders normally and reconnecting via the page-refresh /attach tail works.
  2. Run bun test apps/mesh/src/api/routes/decopilot/ — 350/350 pass.
  3. Pod-death scenario (optional, requires multi-pod setup against shared Postgres + NATS): kill the pod owning an in-flight run mid-stream and confirm a client tailing /attach continues to receive chunks (DBOS replays the workflow step on another pod, which re-pumps into the same JetStream subject).

Migration Notes

None.

Review Checklist

  • PR title is clear and descriptive
  • Changes are tested and working
  • Documentation is updated (if needed)
  • No breaking changes

Summary by cubic

Removed the client-triggered orphan-resume from /attach and collapsed the endpoint to a pure JetStream tail. All runs now go through the per-thread DBOS threadGateWorkflow via dispatchRunAndWait, which pumps to JetStream when streamBuffer is present.

  • Refactors

    • Dropped /attach orphan-resume; endpoint now only tails the per-thread JetStream subject and serves SSE.
    • Removed fire-and-forget dispatchRun; all producers use dispatchRunAndWait (drains and pumps when configured).
    • Recovery relies on DBOS replay and the heartbeat watcher; chunks return to JetStream and existing /attach tails pick them up.
    • Removed threadStorage from DecopilotDeps and cleaned up docs/comments to reference dispatchRunAndWait.
  • Bug Fixes

    • /attach now sets deliverPolicy from DB thread.status (cluster-wide) instead of pod-local isRunning, preventing missed chunks in multi-pod setups.
    • Corrected a stale dispatchRun reference in prepareRun error messaging to dispatchRunAndWait.

Written for commit 0cf93de. Summary will update on new commits. Review in cubic

Pre-PR-#3376, POST /messages dispatched runs directly, so a pod death
mid-run left orphan threads that only the next /attach could resurrect.
Now every user message lives inside a thread-gate DBOS workflow step,
and the recovery executor replays it on a healthy pod with the
streamBuffer wired in — chunks land back on the per-thread JetStream
subject and the existing /attach tail picks them up. The heartbeat
watcher in app.ts remains as a backstop. Also removes the now-dead
fire-and-forget dispatchRun export and threadStorage from DecopilotDeps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

🧪 Benchmark

Should we run the Virtual MCP strategy benchmark for this PR?

React with 👍 to run the benchmark.

Reaction Action
👍 Run quick benchmark (10 & 128 tools)

Benchmark will run on the next push after you react.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

Release Options

Suggested: Patch (2.330.2) — based on refactor: prefix

React with an emoji to override the release type:

Reaction Type Next Version
👍 Prerelease 2.330.2-alpha.1
🎉 Patch 2.330.2
❤️ Minor 2.331.0
🚀 Major 3.0.0

Current version: 2.330.1

Note: If multiple reactions exist, the smallest bump wins. If no reactions, the suggested bump is used (default: patch).

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 7 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/api/routes/decopilot/routes.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/routes.ts:403">
P2: `/attach` chooses `deliverPolicy` from pod-local `runRegistry`, so attaches handled by a different pod can miss buffered chunks of an in-flight run. Use thread state from `validateThreadAccess` (global DB view) when deciding `"all"` vs `"new"`.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic

Comment thread apps/mesh/src/api/routes/decopilot/routes.ts Outdated
viktormarinho and others added 3 commits May 15, 2026 17:10
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
isRunning() is pod-local; a client attached to a non-owner pod (multi-pod
deployment, mid-deploy, or post-DBOS-replay rehome) would silently miss
chunks the owner had already pumped to the shared JetStream subject.
thread.status is set synchronously by run-reactor's claimRunStart, so
it's a cluster-wide signal. The buffer purges on terminal events, so
"all" only ever replays the current in-flight run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@viktormarinho viktormarinho merged commit 2b62743 into main May 17, 2026
16 checks passed
@viktormarinho viktormarinho deleted the viktormarinho/remove-orphan-resume branch May 17, 2026 18:00
viktormarinho added a commit that referenced this pull request May 17, 2026
)

* test(multi-pod): cross-pod /attach + mock-ai provider scaffolding

Builds out the LLM-dependent half of the multi-pod framework so we can
exercise the decopilot dispatch pipeline end-to-end without burning real
provider budget.

**Mock-ai service** (tests/multi-pod/mock-ai/) is a ~140-LOC Bun HTTP
server that speaks the OpenAI chat-completions wire protocol — enough
for mesh's `openai-compatible` adapter to drive `streamText`. Test-time
controls (chunk count + delay) come from the user message text because
mesh strips request headers before calling the provider; the mock
parses "slow:NxMS" / "many:N" hints out of the prompt.

**Setup helpers** (lib/setup.ts) gain `wireMockProvider`,
`createTestAgent`, `createTestThread` for the three calls every
dispatch scenario needs: register an openai-compatible credential
pointing at mock-ai, pin the org's "smart" tier to it, create a
virtual MCP, pre-create the thread row (without which /attach 404s
before the workflow's prepareRun gets to insert it).

**DB helper** (lib/db.ts) shells out to `docker compose exec postgres
psql` for the rare cases where a scenario needs internal state not
surfaced by the public API — currently used to look up the actual
dispatch owner from `threads.run_owner_pod` for targeted-kill
scenarios.

**POD_NAME wired into compose** so `run_owner_pod` matches the compose
service name ("mesh-1") and tests can map straight from a thread row
to a `docker compose kill` target.

**Scenarios:**

- `attach-cross-pod` (✅ passing) — the headline test that directly
  validates the deliverPolicy fix from PR #3387. POSTs on pod-1,
  attaches on pods 2 and 3, asserts both see the buffered prefix of a
  slowed-down run. Catches the bug regardless of which pod DBOS picks
  for dispatch.

- `pod-death-dbos-replay` (⚠️ skipped, fully wired) — would validate
  that SIGKILLing the run-owning pod still completes the run via DBOS
  replay. Surfaced a real architectural finding: DBOS replay on
  another pod fails `claimRunStart`'s strict CAS because the dead pod
  still owns `run_owner_pod`; the heartbeat watcher that DOES handle
  this (via `claimOrphanedRun`) never fires because the
  `KV_POD_HEARTBEATS` JetStream bucket is missing — bucket creation
  appears to fail silently at app.ts:867. Full repro + diagnostic
  notes in the test's docstring; remove `.skip` to verify once one of
  those is fixed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(multi-pod): address review fixes + revise pod-death finding

Three review-flagged issues:

- **dbQuery docstring/spawn cleanup**: psql was invoked with both
  `-A -F "|"` and `--csv`. `--csv` wins, so output was actually CSV
  (parser already split on `,`); the docstring's "pipe separator" warning
  would have misled callers picking column lists. Drop the dead flags and
  rewrite the warning around comma-safety.

- **attach-cross-pod header timing**: header claimed "300ms × 5 chunks ≈
  1.5s", actual hint is `slow:5x500` (≈ 2.5s). Sync the header.

- **mcpCall structuredContent short-circuit**: `if (sc) return sc` returned
  on `{}`, skipping the text-content fallback that some MCP tools rely
  on. Tighten to require a non-empty object so callers get the real
  payload (or a deliberate empty echo at the end of the chain) instead
  of a phantom `{}`.

Also revises the pod-death-dbos-replay docstring with the corrected
architectural diagnosis: original write-up blamed the heartbeat bucket
in isolation; deeper investigation shows three compounding gaps (no
cross-pod DBOS recovery scan because `executor_id="local"` for all
pods, unconditional `streamBuffer.purge()` on resume wiping survivor
buffers, and the heartbeat bucket reliability issue). Test stays
`.skip`ped pending the architectural fix.

Pulls in two helpers the pod-death scenario was already using:
- `lib/db.ts` for direct postgres inspection of `run_owner_pod`
- `lib/hooks.ts` auto-restores any pod left stopped by a previous
  scenario, so kill-style tests don't poison subsequent runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(multi-pod): replace racy sleep with JetStream-confirmation probe

The 500ms sleep between POST and the cross-pod attach was right at the
edge of the dispatch chain's typical latency. If DBOS dequeue + prepareRun
+ streamText init completed in under 500ms (the mock's per-chunk delay),
the test's attaches would open BEFORE chunk-1 was published, and the
`deliverPolicy: "new"` bug path would still deliver chunk-1 live — the
regression we claim to catch wouldn't manifest.

Fix: open a throwaway /attach on pod-1 first as a synchronization probe.
Only proceed to opening the real test watchers once the probe has actually
seen chunk-1 in its stream — at that point chunk-1 is provably already in
JetStream, so receiving it from a cross-pod attach requires the
`deliverPolicy: "all"` fix to work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(multi-pod): fail loudly when docker compose ps errors in hook

Before, a failing `docker compose ps -a` (daemon down, compose file
moved, permission issue) silently produced empty output. The hook then
treated it as "no pods need restoring" and handed control to waitReady,
which would time out 2 minutes later with a misleading "mesh-1 not
healthy" — turning the real diagnosis into a needle in a haystack.

Now we read stderr too and surface the actual compose error on a
non-zero exit code, so the developer sees the root cause immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant