Skip to content

feat(decopilot): per-thread DBOS gate for user messages#3376

Merged
viktormarinho merged 5 commits into
mainfrom
viktormarinho/phase-2-thread-gate-and-cutover
May 15, 2026
Merged

feat(decopilot): per-thread DBOS gate for user messages#3376
viktormarinho merged 5 commits into
mainfrom
viktormarinho/phase-2-thread-gate-and-cutover

Conversation

@viktormarinho
Copy link
Copy Markdown
Contributor

@viktormarinho viktormarinho commented May 15, 2026

Summary

Adds a per-thread DBOS gate so agent runs on the same thread execute one at a time, while runs on different threads progress in parallel. Both user messages and automation fires now funnel through the same gate.

POST /:org/decopilot/threads/:threadId/messages enqueues onto a partitioned DBOS workflow (threadGateWorkflow, partition key = threadId, concurrency = 1) and returns 202 { taskId } in milliseconds. A second POST while a run is in flight queues behind it and dispatches only after the active run completes. Streaming continues through GET /:org/decopilot/attach/:threadId.

Changes

apps/mesh/src/dispatch-queue/ (new)

  • threadGateWorkflow — single DBOS workflow that owns the per-thread serialization. Body is two steps:
    1. trackMessageStarted — emits chat_message_started PostHog event (skipped for automation sources). Wrapped in a step so idempotent retries that collapse onto an existing workflow ID don't double-count.
    2. dispatchRunAndWait — invokes the configured dispatch fn; constructs its own AbortController (since AbortSignal isn't serializable across workflow boundaries). If the step throws, trackMessageFailed emits with error_category: "setup" to balance the started event.
  • enqueueThreadRun(ctx, { workflowID? }) — fire-and-forget. Used by the user-message route.
  • awaitThreadRun(ctx, { workflowID? }) — enqueue + getResult(). Used by automation fires that need to block on the inner run's outcome.
  • Module-level setThreadGateRuntime() registry wires dispatchRunFn, meshContextFactory, and dispatch deps before DBOS.launch().
  • THREAD_GATE_QUEUE + THREAD_GATE_PARTITION_CONCURRENCY constants; queue registered in createApp.
  • Abort timer is opt-in: automations pass an explicit 5 min cap so a runaway cron can't pin a thread slot; user messages leave it unset because tool-using loops (Claude Code, deep research) routinely outlast any fixed cap and weren't bounded by the legacy fire-and-forget path.

apps/mesh/src/api/routes/decopilot/routes.ts

  • POST /messages replaces the inline dispatchAndTrack call with enqueueThreadRun. Returns 202 { taskId }.
  • Idempotency: prefers X-Idempotency-Key header, falls back to the last message's id. Combined as workflowID = thread-run:<threadId>:<key>. Without either, retries get a fresh workflow ID (at-least-once).
  • dispatchAndTrack deleted; PostHog chat_message_started is now emitted from the workflow step.
  • Orphan-resume in /attach continues to call dispatchRun directly — going through the gate would deadlock against itself.

apps/mesh/src/automations/dbos-workflow.ts

  • fireAutomationWorkflowFn hands off to the thread gate via awaitThreadRun instead of calling dispatchRunFn directly. Existing per-automation (concurrency=3) and global (concurrency=5) gates remain layered above; the per-thread gate adds the third layer.
  • Split the dispatch into two phases so DBOS rules are respected:
    • buildDispatchRequest step — membership pre-check + buildStreamRequest, journaled so replays reuse the same crypto.randomUUID() message ids.
    • awaitThreadRun from the workflow body — DBOS forbids invoking a workflow from inside a step, so this lives in the workflow context. Failures are caught at the workflow level to preserve the FireAutomationOutcome {taskId, error} shape.
  • markRunFailed extracted into its own step so the side effect is recorded on the journal.
  • AutomationRuntime loses dispatchRunFn and deps (now owned by the thread-gate runtime).

apps/mesh/src/api/app.ts

  • Imports setThreadGateRuntime, THREAD_GATE_QUEUE, THREAD_GATE_PARTITION_CONCURRENCY from @/dispatch-queue.
  • Calls setThreadGateRuntime({ dispatchRunFn: dispatchRunAndWait, meshContextFactory, deps }) before DBOS.launch().
  • Registers the thread-gate queue with partitionQueue: true, concurrency: 1.

apps/mesh/src/automations/fire.ts

  • Removes the now-unused DispatchRunFn type and DispatchRunInput/DispatchRunDeps imports.

Telemetry semantics

  • chat_message_started — fires from inside the workflow body (post queue-admission), once per workflow. Idempotent retries collapse onto the same workflowID and do not double-count.
  • chat_message_failed — emitted from one of three places, never two:
    • Setup errors thrown by prepareRun before streamText starts → workflow's trackMessageFailed step with error_category: "setup".
    • Mid-stream errors → createUIMessageStream.onError inside dispatchRun with classifyStreamError(error) category.
    • Automation fires are excluded from message-send analytics.

Idempotency contract

  • Header: X-Idempotency-Key: <stable-key> (recommended for retrying clients).
  • Fallback: the last (user) message's id. Most chat clients already send a UUID per message.
  • Without either, retries are at-least-once. Clients that need exactly-once must send one of the two.
  • Format: thread-run:<threadId>:<key>. A redelivered POST gets 202 and getResult()-equivalent semantics inside DBOS.

What's NOT in this PR

  • Inbox UI + GET/DELETE /threads/:id/queue endpoints — follow-up.
  • Orphan-resume continues to bypass the gate (recovery path).

Bug fixes included on the branch

  • DBOS step/workflow invariant (commit ff8914bb0): the initial automation routing crashed every fire with Invalid call to a 'workflow' function from within a 'step' or 'transaction' because awaitThreadRun was called from inside DBOS.runStep. Fixed by moving awaitThreadRun to the workflow body and splitting the prep work into a journaled step.

Test plan

  • bun run --cwd=apps/mesh check clean (TypeScript).
  • bun run fmt clean.
  • thread-gate-workflow.test.ts covers queue plumbing (constant exports + setThreadGateRuntime runtime shape).
  • Manual: send a user message; verify 202 { taskId } and stream attaches via /attach.
  • Manual: send two user messages back-to-back on the same thread; verify they serialize.
  • Manual: trigger an automation fire; verify it dispatches through the thread gate without the DBOS step/workflow error.
  • Manual: retry a POST with the same X-Idempotency-Key; verify no duplicate run / no duplicate chat_message_started.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

🧪 Benchmark

Should we run the Virtual MCP strategy benchmark for this PR?

React with 👍 to run the benchmark.

Reaction Action
👍 Run quick benchmark (10 & 128 tools)

Benchmark will run on the next push after you react.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

Release Options

Suggested: Minor (2.329.0) — based on feat: prefix

React with an emoji to override the release type:

Reaction Type Next Version
👍 Prerelease 2.328.1-alpha.1
🎉 Patch 2.328.1
❤️ Minor 2.329.0
🚀 Major 3.0.0

Current version: 2.328.0

Note: If multiple reactions exist, the smallest bump wins. If no reactions, the suggested bump is used (default: patch).

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 5 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/dispatch-queue/thread-gate-workflow.ts">

<violation number="1" location="apps/mesh/src/dispatch-queue/thread-gate-workflow.ts:127">
P1: Do not swallow dispatch exceptions in the workflow step; rethrow so failed runs are recorded as workflow failures (and can follow normal retry/failure handling) instead of succeeding with an ignored `{ error }` payload.</violation>
</file>

<file name="apps/mesh/src/api/routes/decopilot/routes.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/routes.ts:328">
P2: `chat_message_started` is emitted unconditionally after enqueue, so idempotent POST retries can be double-counted even when they collapse to an existing workflow.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic

Comment thread apps/mesh/src/dispatch-queue/thread-gate-workflow.ts Outdated
Comment thread apps/mesh/src/api/routes/decopilot/routes.ts Outdated
@viktormarinho viktormarinho force-pushed the viktormarinho/phase-2-thread-gate-and-cutover branch from f2f6d16 to 96b817a Compare May 15, 2026 14:52
Base automatically changed from viktormarinho/phase-1-dispatch-refactor to main May 15, 2026 15:12
@viktormarinho viktormarinho force-pushed the viktormarinho/phase-2-thread-gate-and-cutover branch 2 times, most recently from 17fd03f to 7b0d010 Compare May 15, 2026 15:42
Adds `threadGateWorkflow` — a DBOS workflow with partitioned concurrency=1
per threadId — and cuts `POST /messages` over to enqueue on it instead of
calling `dispatchRun` directly. Holding the partition slot until
`dispatchRunAndWait` returns is what serializes messages on the same
thread: a second POST while a run is in-flight queues behind it and
dispatches once the active run finishes.

Idempotency: when the client supplies a request message id, the DBOS
workflow ID is derived from `<threadId>:<messageId>`, so a retried POST
collapses onto the existing handle.

Other paths unchanged:
- Orphan-resume in `/attach` still calls `dispatchRun` directly (recovery
  path; can't go through the queue because the run is already in flight).
- Automations still go through `fireAutomationWorkflow` and its existing
  per-automation / global gates (rerouting them through `threadGateWorkflow`
  is a follow-up PR).

Inbox UI and cancel endpoints land in a follow-up PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@viktormarinho viktormarinho force-pushed the viktormarinho/phase-2-thread-gate-and-cutover branch from 7b0d010 to 32139cb Compare May 15, 2026 16:36
viktormarinho and others added 2 commits May 15, 2026 14:43
Phase 5 of the dispatch-queue unification. Automation fires now hand off
to `threadGateWorkflow` via `awaitThreadRun` instead of calling
`dispatchRunAndWait` directly. Practical effects:

- A user-message run and an automation run on the same thread now
  serialize through the same per-thread gate (concurrency=1). Previously
  they could race.
- Automation chunks still publish to the per-thread JetStream subject
  (already true after Phase 1), so `/attach` keeps surfacing them live.
- The per-automation (concurrency=3) and global (concurrency=5) gates
  remain — they layer above the new per-thread gate, not replaced.

API:
- New `awaitThreadRun(ctx, opts)` helper alongside `enqueueThreadRun`.
  Returns the workflow result; used by callers that hold an outer queue
  slot and need the outcome to advance (the automation fire step).
- `ThreadGateContext.source: "user-message" | "automation"` so the
  workflow body suppresses `chat_message_started` for automation fires —
  they reuse the gate but don't count as user message sends.

Cleanups:
- `AutomationRuntime.dispatchRunFn`, `.deps`, and the standalone
  `DispatchRunFn` type are gone — the thread-gate runtime owns dispatch
  now, automations only need `storage` + `meshContextFactory`.
- The automation workflow body still resolves with `{taskId, error}` on
  failure (preserves the `FireAutomationOutcome` contract callers rely
  on). The inner dispatch step *throws* so DBOS records step-level
  failure for observability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cept idempotency header

Three review findings on the per-thread gate:

1. `chat_message_started` was emitted before `dispatchRunAndWaitStep`, so
   setup failures from `prepareRun` (model-permission, agent-not-found,
   thread-ownership, async-research model-slot guard) produced orphan
   started events — `streamText.onError` only covers in-flight errors,
   not the pre-stream gap. Workflow now wraps the dispatch step in
   try/catch and emits `chat_message_failed` (in its own DBOS step, so
   replay-safe) when the step throws. `error_category: "setup"` keeps it
   distinguishable from runtime stream errors.

2. The 5-minute hard timeout was lifted from the automations gate and
   regressed user-chat runs (Claude Code, deep research, multi-step tool
   loops routinely exceed 5 min; the legacy fire-and-forget HTTP path had
   no timeout at all). Timeout is now opt-in: automations still pass an
   explicit cap (so cron runs are bounded), user messages leave it unset
   and no abort timer is installed.

3. Idempotency only worked when the client happened to send a message id,
   which is optional in the schema. Route now also accepts an explicit
   `X-Idempotency-Key` header (preferred over message id); docstring
   spells out the fallback chain and notes at-least-once semantics when
   neither is supplied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/api/routes/decopilot/routes.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/routes.ts:303">
P2: Empty `X-Idempotency-Key` values block fallback to message id, which can silently disable idempotency on retries.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic

Comment thread apps/mesh/src/api/routes/decopilot/routes.ts Outdated
viktormarinho and others added 2 commits May 15, 2026 15:30
DBOS forbids invoking workflows from inside a step, so the previous
dispatchRunAndWaitStep crashed with "Invalid call to a workflow function
from within a step or transaction" on every fire. Split into a
buildDispatchRequest step (membership pre-check + request assembly,
journaled) and call awaitThreadRun directly from fireAutomationWorkflowFn.
markRunFailed moves into its own step so the side effect is recorded.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`??` only falls back on null/undefined, so a client sending the header
with an empty or whitespace-only value kept it as "", failed the truthy
check below, and silently dropped to at-least-once semantics. Trim the
header and treat an empty result as missing so the fallback to the last
message's id still applies.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@viktormarinho viktormarinho merged commit 4c00970 into main May 15, 2026
16 checks passed
@viktormarinho viktormarinho deleted the viktormarinho/phase-2-thread-gate-and-cutover branch May 15, 2026 19:25
viktormarinho added a commit that referenced this pull request May 17, 2026
#3387)

* refactor(decopilot): drop /attach orphan-resume; collapse to pure tail

Pre-PR-#3376, POST /messages dispatched runs directly, so a pod death
mid-run left orphan threads that only the next /attach could resurrect.
Now every user message lives inside a thread-gate DBOS workflow step,
and the recovery executor replays it on a healthy pod with the
streamBuffer wired in — chunks land back on the per-thread JetStream
subject and the existing /attach tail picks them up. The heartbeat
watcher in app.ts remains as a backstop. Also removes the now-dead
fire-and-forget dispatchRun export and threadStorage from DecopilotDeps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(decopilot): fix stale dispatchRun reference in prepareRun error

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(decopilot): use DB thread status for /attach deliverPolicy

isRunning() is pod-local; a client attached to a non-owner pod (multi-pod
deployment, mid-deploy, or post-DBOS-replay rehome) would silently miss
chunks the owner had already pumped to the shared JetStream subject.
thread.status is set synchronously by run-reactor's claimRunStart, so
it's a cluster-wide signal. The buffer purges on terminal events, so
"all" only ever replays the current in-flight run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(decopilot): fix stale doc references missed by rename

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant