Skip to content

feat(supervisor): wide events + warm-start trace propagation#9

Open
deepshekhardas wants to merge 347 commits into
mainfrom
pr/3669-supervisor-events
Open

feat(supervisor): wide events + warm-start trace propagation#9
deepshekhardas wants to merge 347 commits into
mainfrom
pr/3669-supervisor-events

Conversation

@deepshekhardas
Copy link
Copy Markdown
Owner

@deepshekhardas deepshekhardas commented May 20, 2026

Adds wide events system to supervisor for better event propagation across worker restarts.

Changes

  • New \wideEvents\ module with emit, middleware, record, state modules
  • Traceparent propagation for warm-start runs
  • Socket lifecycle management with noisy-routes flag
  • Workload server integration for wide events

Files changed (17)

  • apps/supervisor/src/index.ts
  • apps/supervisor/src/wideEvents/* (7 new files)
  • apps/supervisor/src/workloadServer/index.ts

Test

  • Added wideEvents tests: emit.test.ts, middleware.test.ts, new.test.ts, record.test.ts, traceparent.test.ts

Closes triggerdotdev#3669


Summary by cubic

Adds wide-event instrumentation to supervisor and propagates traceparent on warm-start runs so traces stay connected across worker restarts. Improves visibility of workload routes and run socket lifecycle; feature is disabled by default.

  • New Features
    • New wide-events pipeline (emit, middleware, record, state) for cross-restart observability.
    • Warm-start trace context propagation for continuous end-to-end tracing.
    • Socket lifecycle tracking with a “noisy routes” toggle.
    • Workload server integration to emit wide events.
    • Gated behind TRIGGER_WIDE_EVENTS_ENABLED (off by default).

Written for commit 671b137. Summary will update on new commits. Review in cubic

ericallam and others added 30 commits March 2, 2026 12:42
Also includes a claude.md audit workflow for PRs
…tdev#3109)

- New User onboarding questions added and stored in a new
`onboardingData` col
- Keeps the same Org creation screen and stores the data in the same
format in same DB column
- New Org onboarding questions addded and stored in a new
`onboardingData` col


https://github.com/user-attachments/assets/244e4bae-f74d-4ed4-a545-92c9b927e98b

---------

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…#3146)

Input streams enable sending typed data to executing tasks from external
callers — backends, frontends, or other tasks. This unlocks interactive
use cases like approval UIs, cancel buttons, chat interfaces, and
human-in-the-loop AI workflows where the task needs to receive data
while running.

Three consumption patterns inside a task:

* `.wait()` — Suspend the task until data arrives (process freed, most
efficient)
* `.once()` — Wait for the next message (process stays alive)
* `.on()` — Subscribe to a continuous stream of messages

One send pattern from outside:

* `.send(runId, data)` — Send typed data to a specific run's input
stream

## User-facing API

### Define a typed input stream

```ts
import { streams, task } from "@trigger.dev/sdk";

const approval = streams.input<{ approved: boolean; reviewer: string }>({ id: "approval" });
```

### Consume inside a task

```ts
export const myTask = task({
  id: "my-task",
  run: async () => {
    // Pattern 1: Suspend until data arrives (most efficient — frees the process)
    const result = await approval.wait({ timeout: "5m" });

    // Pattern 2: Wait for next message (process stays alive)
    const data = await approval.once().unwrap();

    // Pattern 3: Subscribe to multiple messages
    approval.on((data) => { /* handle each message */ });
  },
});
```

### Send from outside

```ts
// From a backend (using secret API key)
await approval.send(runId, { approved: true, reviewer: "alice" });

// From a frontend (using public JWT token from trigger response)
const { send } = useInputStreamSend("approval", runId, { accessToken });
send({ approved: true, reviewer: "alice" });
```

---------

Co-authored-by: Claude <noreply@anthropic.com>
…ggerdotdev#3095)

Adds a "Create custom dashboard" button to the top right of the metrics
dashboard

 
<img width="3546" height="1934" alt="CleanShot 2026-02-19 at 11 25
12@2x"
src="https://github.com/user-attachments/assets/0bb46ade-47c9-4396-b62a-f4801d7d90b4"
/>
…evel (triggerdotdev#3166)

The global rate limiter was being applied at the FairQueue claim phase,
consuming 1 token per queue-claim-attempt rather than per item
processed.
  With many small queues (each batch is its own queue), consumers burned
  through tokens on empty or single-item queues, causing aggressive
  throttling well below the intended items/sec limit.

  Changes:
- Move rate limiter from FairQueue claim phase to BatchQueue worker
queue
    consumer loop (before blockingPop), so each token = 1 item processed
  - Replace the FairQueue rate limiter with a worker queue depth cap to
    prevent unbounded growth that could cause visibility timeouts
- Add BATCH_QUEUE_WORKER_QUEUE_MAX_DEPTH env var (optional, disabled by
default)
# trigger.dev v4.4.2

## Summary
2 new features, 2 improvements, 8 bug fixes.

## Improvements
- Add input streams for bidirectional communication with running tasks.
Define typed input streams with `streams.input<T>({ id })`, then consume
inside tasks via `.wait()` (suspends the process), `.once()` (waits for
next message), or `.on()` (subscribes to a continuous stream). Send data
from backends with `.send(runId, data)` or from frontends with the new
`useInputStreamSend` React hook.
([triggerdotdev#3146](triggerdotdev#3146))
- Add PAYLOAD_TOO_LARGE error to handle graceful recovery of sending
batch trigger items with payloads that exceed the maximum payload size
([triggerdotdev#3137](triggerdotdev#3137))

## Bug fixes
- Fix slow batch queue processing by removing spurious cooloff on
concurrency blocks and fixing a race condition where retry attempt
counts were not atomically updated during message re-queue.
([triggerdotdev#3079](triggerdotdev#3079))
- fix(sdk): batch triggerAndWait variants now return correct
run.taskIdentifier instead of unknown
([triggerdotdev#3080](triggerdotdev#3080))

## Server changes

These changes affect the self-hosted Docker image and Trigger.dev Cloud:

- Two-level tenant dispatch architecture for batch queue processing.
Replaces the
single master queue with a two-level index: a dispatch index (tenant →
shard)
and per-tenant queue indexes (tenant → queues). This enables O(1) tenant
selection and fair scheduling across tenants regardless of queue count.
Improves batch queue processing performance.
([triggerdotdev#3133](triggerdotdev#3133))
- Add input streams with API routes for sending data to running tasks,
SSE reading, and waitpoint creation. Includes Redis cache for fast
`.send()` to `.wait()` bridging, dashboard span support for input stream
operations, and s2-lite support with configurable S2 endpoint, access
token skipping, and S2-Basin headers for self-hosted deployments. Adds
s2-lite to Docker Compose for local development.
([triggerdotdev#3146](triggerdotdev#3146))
- Speed up batch queue processing by disabling cooloff and increasing
the batch queue processing concurrency limits on the cloud:
  
  - Pro plan: increase to 50 from 10.
  - Hobby plan: increase to 10 from 5.
- Free plan: increase to 5 from 1.
([triggerdotdev#3079](triggerdotdev#3079))
- Move batch queue global rate limiter from FairQueue claim phase to
BatchQueue worker queue consumer for accurate per-item rate limiting.
Add worker queue depth cap to prevent unbounded growth that could cause
visibility timeouts.
([triggerdotdev#3166](triggerdotdev#3166))
- Fix a race condition in the waitpoint system where a run could be
blocked by a completed waitpoint but never be resumed because of a
PostgreSQL MVCC issue. This was most likely to occur when creating a
waitpoint via `wait.forToken()` at the same moment as completing the
token with `wait.completeToken()`. Other types of waitpoints (timed,
child runs) were not affected.
([triggerdotdev#3075](triggerdotdev#3075))
- Fix metrics dashboard chart series colors going out of sync and
widgets not reloading stale data when scrolled back into view
([triggerdotdev#3126](triggerdotdev#3126))
- Gracefully handle oversized batch items instead of aborting the
stream.
  
When an NDJSON batch item exceeds the maximum size, the parser now emits
an error marker instead of throwing, allowing the batch to seal
normally. The oversized item becomes a pre-failed run with
`PAYLOAD_TOO_LARGE` error code, while other items in the batch process
successfully. This prevents `batchTriggerAndWait` from seeing connection
errors and retrying with exponential backoff.
  
Also fixes the NDJSON parser not consuming the remainder of an oversized
line split across multiple chunks, which caused "Invalid JSON" errors on
subsequent lines.
([triggerdotdev#3137](triggerdotdev#3137))
- Require the user is an admin during an impersonation session.
Previously only the impersonation cookie was checked; now the real
user's admin flag is verified on every request. If admin has been
revoked, the session falls back to the real user's ID.
([triggerdotdev#3078](triggerdotdev#3078))

<details>
<summary>Raw changeset output</summary>

# Releases
## @trigger.dev/build@4.4.2

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.2`

## trigger.dev@4.4.2

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/build@4.4.2`
    -   `@trigger.dev/core@4.4.2`
    -   `@trigger.dev/schema-to-json@4.4.2`

## @trigger.dev/python@4.4.2

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/sdk@4.4.2`
    -   `@trigger.dev/build@4.4.2`
    -   `@trigger.dev/core@4.4.2`

## @trigger.dev/react-hooks@4.4.2

### Patch Changes

- Add input streams for bidirectional communication with running tasks.
Define typed input streams with `streams.input<T>({ id })`, then consume
inside tasks via `.wait()` (suspends the process), `.once()` (waits for
next message), or `.on()` (subscribes to a continuous stream). Send data
from backends with `.send(runId, data)` or from frontends with the new
`useInputStreamSend` React hook.
([triggerdotdev#3146](triggerdotdev#3146))

Upgrade S2 SDK from 0.17 to 0.22 with support for custom endpoints
(s2-lite) via the new `endpoints` configuration, `AppendRecord.string()`
API, and `maxInflightBytes` session option.

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.2`

## @trigger.dev/redis-worker@4.4.2

### Patch Changes

- Fix slow batch queue processing by removing spurious cooloff on
concurrency blocks and fixing a race condition where retry attempt
counts were not atomically updated during message re-queue.
([triggerdotdev#3079](triggerdotdev#3079))
-   Updated dependencies:
    -   `@trigger.dev/core@4.4.2`

## @trigger.dev/rsc@4.4.2

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.2`

## @trigger.dev/schema-to-json@4.4.2

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.2`

## @trigger.dev/sdk@4.4.2

### Patch Changes

- Add input streams for bidirectional communication with running tasks.
Define typed input streams with `streams.input<T>({ id })`, then consume
inside tasks via `.wait()` (suspends the process), `.once()` (waits for
next message), or `.on()` (subscribes to a continuous stream). Send data
from backends with `.send(runId, data)` or from frontends with the new
`useInputStreamSend` React hook.
([triggerdotdev#3146](triggerdotdev#3146))

Upgrade S2 SDK from 0.17 to 0.22 with support for custom endpoints
(s2-lite) via the new `endpoints` configuration, `AppendRecord.string()`
API, and `maxInflightBytes` session option.

- fix(sdk): batch triggerAndWait variants now return correct
run.taskIdentifier instead of unknown
([triggerdotdev#3080](triggerdotdev#3080))

- Add PAYLOAD_TOO_LARGE error to handle graceful recovery of sending
batch trigger items with payloads that exceed the maximum payload size
([triggerdotdev#3137](triggerdotdev#3137))

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.2`

## @trigger.dev/core@4.4.2

</details>

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
A top-level Errors page that aggregates errors from failed runs with
occurrences metrics.


https://github.com/user-attachments/assets/8f0ef55e-90dd-4faa-9051-59f4665181e4

Errors are “fingerprinted” so similar errors are grouped together (e.g.
has an ID in the error message).

You can view an individual error to view a timeline of when it fired,
the runs, and bulk replay them.
In theory this will make Log queries faster
…n payload schemas (triggerdotdev#3188)

<img width="2191" height="1023" alt="CleanShot 2026-03-06 at 13 36 53"
src="https://github.com/user-attachments/assets/4eba0d1a-1528-49a3-be5b-6bde89030193"
/>
<img width="411" height="1069" alt="CleanShot 2026-03-06 at 13 37 28"
src="https://github.com/user-attachments/assets/e5f7bb9c-c894-41cc-9ca6-96b43fcf6005"
/>

Add a tabbed sidebar to the Test page for standard tasks, reusing the
ClientTabs pattern from the Query page.

- Options tab: existing sidebar content (machine, version, queue, etc.)
- AI tab: AI-powered payload generation with streaming, supports JSON
  Schema, inferred schema from recent runs, and task source code lookup
  via tool calling for tasks without schemas
- Schema tab: displays payload JSON Schema (from schemaTask), inferred
  schema (from recent runs via @jsonhero/schema-infer), or empty state
  with schemaTask docs and example code

Data layer changes:
- Surface payloadSchema and inferredPayloadSchema from TestTaskPresenter
- Add payloadSchema and fileId to WorkerDeploymentWithWorkerTasks type
- Decompress zlib-deflated source files for AI context
### Fixes and improvements to the onboarding questions: 

**This change is worth double checking @matt-aitken** 
- Update to the Button.tsx file: it now takes `isLoading` that shows a
spinner in the middle of the button (replacing the button text and any
icons) and sets it to `disabled`. It does this nicely by keeping the
button width the same so there's no layout shift.

**Other fixes**
- Fixes an issue where if you type a custom option in the "What
technologies do you use" question, it doesn't check the list to see if
it matches. Now it checks the box if you've typed an option from that
list.
- When we randomize the list of onboarding question options, we now
store the position they appeared in the list
…iggerdotdev#3191)

When the dev CLI exits (e.g. ctrl+c via pnpm), runs that were
mid-execution
previously stayed stuck in EXECUTING status for up to 5 minutes until
the
heartbeat timeout fired. Now they are cancelled within seconds.

The dev CLI spawns a lightweight detached watchdog process at startup.
The
watchdog monitors the CLI process ID and, when it detects the CLI has
exited,
calls a new POST /engine/v1/dev/disconnect endpoint to cancel all
in-flight
runs immediately (skipping PENDING_CANCEL since the worker is known to
be dead).

Watchdog design:
- Fully detached (detached: true, stdio: ignore, unref()) so it survives
  even when pnpm sends SIGKILL to the process tree
- Active run IDs maintained via atomic file write
(.trigger/active-runs.json)
- Single-instance guarantee via PID file (.trigger/watchdog.pid)
- Safety timeout: exits after 24 hours to prevent zombie processes
- On clean shutdown, the watchdog is killed (no disconnect needed)

Disconnect endpoint:
- Rate-limited: 5 calls/min per environment
- Capped at 500 runs per call
- Small counts (<= 25): cancelled inline with pMap concurrency 10
- Large counts: delegated to the bulk action system
- Uses finalizeRun: true to skip PENDING_CANCEL and go straight to
FINISHED

Run engine change:
- cancelRun() now respects finalizeRun when the run is in EXECUTING
status,
skipping the PENDING_CANCEL waiting state and going directly to FINISHED
## Summary
2 new features, 2 improvements.

## Improvements
- Add syncSupabaseEnvVars to pull database connection strings and save
them as trigger.dev environment variables
([triggerdotdev#3152](triggerdotdev#3152))
- Auto-cancel in-flight dev runs when the CLI exits, using a detached
watchdog process that survives pnpm SIGKILL
([triggerdotdev#3191](triggerdotdev#3191))

## Server changes

These changes affect the self-hosted Docker image and Trigger.dev Cloud:

- A new Errors page for viewing and tracking errors that cause runs to
fail
  
  - Errors are grouped using error fingerprinting
- View top errors for a time period, filter by task, or search the text
  - View occurrences over time
- View all the runs for an error and bulk replay them
([triggerdotdev#3172](triggerdotdev#3172))
- Add sidebar tabs (Options, AI, Schema) to the Test page for schemaTask
payload generation and schema viewing.
([triggerdotdev#3188](triggerdotdev#3188))

<details>
<summary>Raw changeset output</summary>

# Releases
## @trigger.dev/build@4.4.3

### Patch Changes

- Add syncSupabaseEnvVars to pull database connection strings and save
them as trigger.dev environment variables
([triggerdotdev#3152](triggerdotdev#3152))
-   Updated dependencies:
    -   `@trigger.dev/core@4.4.3`

## trigger.dev@4.4.3

### Patch Changes

- Auto-cancel in-flight dev runs when the CLI exits, using a detached
watchdog process that survives pnpm SIGKILL
([triggerdotdev#3191](triggerdotdev#3191))
-   Updated dependencies:
    -   `@trigger.dev/core@4.4.3`
    -   `@trigger.dev/build@4.4.3`
    -   `@trigger.dev/schema-to-json@4.4.3`

## @trigger.dev/core@4.4.3

### Patch Changes

- Auto-cancel in-flight dev runs when the CLI exits, using a detached
watchdog process that survives pnpm SIGKILL
([triggerdotdev#3191](triggerdotdev#3191))

## @trigger.dev/python@4.4.3

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.3`
    -   `@trigger.dev/build@4.4.3`
    -   `@trigger.dev/sdk@4.4.3`

## @trigger.dev/react-hooks@4.4.3

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.3`

## @trigger.dev/redis-worker@4.4.3

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.3`

## @trigger.dev/rsc@4.4.3

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.3`

## @trigger.dev/schema-to-json@4.4.3

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.3`

## @trigger.dev/sdk@4.4.3

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.3`

</details>

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…riggerdotdev#3208)

Deprecates the syncVercelEnvVars build extension and adds warnings in
both the Vercel integration docs and the extension's own page to prevent
conflicts with the native env var sync
…tdev#3201)

## Adds 2 self serve features

### 1. self serve preview branches

- Copies the patterns of the self serve concurrency
- Self serve only available on Pro plan (otherwise you are linked to the
billing plans page)
- Global self serve branches limit: 180 (+20 for the Pro plan). It can
be overridden per Org
- You need to archive branches before reducing the number of extra
branches you're paying for
- Branches are removed immediately but remain billed until the end of
the billing cycle like extra concurrency

### 2. self serve team members

- Copies the patterns of the self serve concurrency
- Self serve only available on Pro plan (otherwise you are linked to the
billing plans page)
- Global self serve members is unlimited but can be limited with the
same env var quota and overridden per org if needed
- You need to remove team members before reducing the number of members
you pay for
- Team members are removed immediately but remain billed until the end
of the billing cycle like extra concurrency
…hards (triggerdotdev#3219)

Queues with concurrency keys now appear as a single entry in the master
queue instead of one entry per key. This prevents high-CK-count tenants
from consuming the entire `parentQueueLimit` window and starving other
tenants on the same shard.

A new per-queue **CK index** (sorted set) tracks active concurrency key
sub-queues. The master queue gets one `:ck:*` wildcard entry per base
queue. Dequeuing from that entry round-robins across sub-queues,
maintaining per-CK concurrency tracking and fairness.

All existing operations (enqueue, dequeue, ack, nack, DLQ, TTL expiry)
are CK-index-aware and keep the index consistent. Old-format entries
drain naturally during rollout — no migration step needed, single
deploy.
d-cs and others added 28 commits May 15, 2026 09:42
…erdotdev#3627)

## Summary

A "Google auth conflict" Sentry alert fires whenever a user signs in via
Google whose Google account is linked to one user row but whose
Google-provided email is now on a *different* user row. The handler in
`apps/webapp/app/models/user.server.ts:236` already does the right thing
— it returns the existing auth-linked user and skips the update path so
neither row gets mutated — but it logs the situation with
`logger.error`, which routes to Sentry as an exception and pages the
on-call channel.

There's no exception to chase here: the branch is the intended outcome
for a known data shape (user changed their email on one account after
originally signing up via Google on another). Downgrading the call to
`logger.warn` keeps the diagnostic record in our logs (with all the same
context fields — email, both user IDs, authIdentifier) but stops it
firing the production error alert.

## Change

- `logger.error` → `logger.warn` for the conflict branch in
`findOrCreateGoogleUser`. Context payload is unchanged.

## Test plan

- [x] Typecheck only — there's no behavioural change to test, the log
level is the entire diff.
…riggerdotdev#3625)

## Summary

The trigger-task hotpath used to early-return without a DB query when a
caller passed both a queue override and a per-trigger TTL — the hottest
configuration on the trigger API. Adding `triggerSource` to the resolver
so the runs-list "Source" filter could distinguish STANDARD / SCHEDULED
/
AGENT runs removed those early-returns, costing +2 DB queries per
trigger
on non-locked calls and +1 on locked calls.

This change caches `BackgroundWorkerTask` metadata (`ttl`,
`triggerSource`,
`queueId`, `queueName`) in Redis so the resolver can satisfy every
caller
configuration with a single `HGET` on the warm path. PG fallback on miss
back-fills the cache.

Follow-up to triggerdotdev#3542.

## Design

Two key spaces:

- `task-meta:env:{envId}` — the "current worker" view, refreshed at
every
  deploy promotion. 24h safety TTL.
- `task-meta:by-worker:{workerId}` — used for `lockToVersion` triggers.
  Immutable post-create. 30d sliding TTL so historical workers age out.

Cache writes use Lua scripts via `defineCommand` so `DEL` + `HSET` +
`EXPIRE` land atomically — concurrent readers never see the empty
intermediate state of a naive pipeline. Read-path back-fill uses
single-field upserts so concurrent back-fills don't wipe each other's
siblings.

The cache lives behind its own `TASK_META_CACHE_REDIS_*` env-var prefix
that falls back to the default `REDIS_*` set, so operators can route the
cache to a dedicated Redis instance if they want.

The service/instance file split (`taskMetadataCache.server.ts` for the
pure class, `taskMetadataCacheInstance.server.ts` for the env-wired
singleton) mirrors the existing `runsReplicationService` /
`runsReplicationInstance` pattern.

## Test plan

- [ ] `pnpm run typecheck --filter webapp`
- [ ] `pnpm run test ./test/engine/triggerTask.test.ts --run` — 8
      existing tests untouched + 5 new tests covering warm cache, cold
      miss with back-fill, queue + ttl path, by-worker vs env keyspace,
      and the promotion cache write
- [ ] End-to-end against a dev worker: registering writes both keyspaces
with the expected TTLs, and `redis-cli HGETALL
"tr:task-meta:env:<envId>"`
      returns the cached entries


## Benchmark

Measured `DefaultQueueManager.resolveQueueProperties` against a real
Postgres + Redis (vitest `containerTest`, single-host docker). 500
sequential calls and 2,000 parallel calls (concurrency=50) per scenario,
request shaped as `{ taskId, queue: "bench-queue", ttl: "5m" }` — the
hot path this PR restores.

```
sequential (one in flight at a time):
[noop cache (baseline)]  n=500   mean=1.423ms  p50=1.394ms  p95=1.735ms  p99=2.629ms  max=11.100ms
[redis cache, cold   ]  n=500   mean=1.346ms  p50=1.283ms  p95=1.688ms  p99=2.463ms  max=5.058ms
[redis cache, warm   ]  n=500   mean=0.084ms  p50=0.078ms  p95=0.105ms  p99=0.156ms  max=1.129ms
speedup (warm vs baseline, sequential): 16.95x

parallel (concurrency=50):
[noop cache (baseline)]  n=2000  mean=10.069ms  p50=8.850ms  p95=14.718ms  p99=31.887ms  total=405ms  ops/s=4,940
[redis cache, warm   ]  n=2000  mean=0.614ms   p50=0.568ms  p95=1.189ms   p99=1.432ms   total=25ms   ops/s=80,389
throughput speedup (warm vs baseline, parallel): 16.27x
```

Read:

- **Warm cache cuts resolver latency 17×** at p50 — from ~1.4 ms to ~78
µs per call.
- **Cold cache is on par with baseline** — the extra `HGET` miss adds
<50 µs against the two Postgres queries that follow, so the worst case
is not worse than today.
- **Under burst load (50 concurrent triggers)**, the baseline's p99
jumps to ~32 ms as Postgres connections queue up; warm stays at ~1.4 ms.
The cache moves the saturation point from ~5k ops/s (PG pool) to ~80k
ops/s (single-client Redis pipelining).

Caveats: single-host docker, local Postgres + Redis, resolver-only
measurement (excludes the rest of the trigger transaction). Prod adds
region-local Redis RTT (~0.3–0.8 ms) which shifts warm absolute numbers
up but keeps the ratio intact.
…CP (triggerdotdev#3612)

## Summary

Adds a Region column and Region filter (under More filters) to the runs
list dashboard, the same filter on the public runs list API
(`filter[region]`), and a matching `region` input on the MCP `list_runs`
tool. Each run's executing region is also surfaced as a new optional
`region` field on the runs list and run retrieve responses, populated
from the worker instance group's `masterQueue` identifier.

Useful when you run tasks across multiple regions and want to slice the
runs list — or your existing run-querying scripts — by where the run
actually executed.

## Design

The filter value in the URL / API is the `masterQueue` identifier (the
same string already persisted on `TaskRun` and replicated to ClickHouse
as `worker_queue`), so the query just becomes `worker_queue IN (...)`
with no server-side translation. The Region dropdown options come from a
new resource loader backed by `RegionsPresenter`, which now also exposes
`masterQueue` alongside the existing region metadata.

```ts
// public API
const runs = await runs.list({ region: ["us-east-1", "eu-west-1"] });
// each item: { id, status, ..., region?: "us-east-1" }
```

```ts
// MCP
list_runs({ environment: "prod", region: "us-east-1" })
```
…dotdev#3628)

## Summary

Enables shipping `X.Y.Z-rc.N` prereleases of `@trigger.dev/*` via
changesets pre mode. RCs publish under the `rc` npm dist-tag, never
claim `latest`, and don't trigger marketing-site changelog PRs. The
plumbing is hyphen-in-version detection in `release.yml` — no separate
workflow, no opt-in flag at publish time.

Validated end-to-end against a sandbox repo (real npm publishes, Docker
builds, Helm chart pushes, GitHub releases) before porting back. Full RC
lifecycle tested: pre enter → rc.0 → iterate to rc.1 → pre exit →
stable. Plus interaction with the existing release-branch hotfix flow.

## What changes

### `release.yml`
- New `is_prerelease` output (hyphen-in-version)
- GitHub release adds `--prerelease` flag for RC publishes (Pre-release
badge, not Latest)
- `dispatch-changelog` job gated on `is_prerelease != 'true'` — no
marketing-site PR per RC

### Docker workflows
- Removes the `:v4-beta` floating tag entirely from `publish-webapp.yml`
and `publish-worker-v4.yml`. v4 is GA; the tag is a misnomer and is
already inconsistent with the npm side (npm `v4-beta` dist-tag was
frozen at 4.0.4 months ago while Docker `:v4-beta` kept bumping).
Self-hosters should pin to a versioned tag going forward — the last
value of `:v4-beta` stays frozen wherever it currently points.

### CLI version-check fix
(`packages/cli-v3/src/utilities/initialBanner.ts`)
Switches the "new version available" comparison from JavaScript
`localeCompare` to `semver.lt`. The old comparison handled `X.Y.Z-rc.N`
vs `X.Y.Z` incorrectly — a user on `4.5.0-rc.0` would never be prompted
to upgrade once `4.5.0` stable shipped (lex order put the prerelease
ahead of the bare version). Real semver gets this right.

Stable users were never affected: the check queries the `@latest`
dist-tag, which by convention never points at a prerelease.

## How an RC actually publishes after this

1. `pnpm exec changeset pre enter rc` on main, push the `pre.json`
2. Bot regenerates the release PR as `chore: release v<X.Y.Z>-rc.0`
3. Merge → `release.yml` runs `changeset publish` which reads
`pre.json.tag` and publishes under `--tag rc`. GitHub release marked
Pre-release. No marketing-site dispatch.
4. Iterate by adding changesets normally; bot bumps to `rc.1`, `rc.2`, …
5. When ready: `pnpm exec changeset pre exit`, push, merge regenerated
PR → stable ships under `latest` and the marketing-site dispatch fires.
…#3629)

## Summary

Refocuses the v4.5.0 changeset and server-changes content on the
public-facing AI features story, replacing the pre-release-internal diff
framing that had accumulated in `.changeset/` and `.server-changes/`.
Pairs with the RC support PR — the next bot regeneration will pick up
this content.

## What's in here

### Changeset rewrites

- **`chat-agent.md` rewritten as the headline AI Agents entry** —
written from the `docs/ai-chat/` surface (not from internal pre-release
diffs). Covers useChat integration, multi-turn durability via Sessions,
lifecycle hooks, stop generation, tool approvals (HITL), pending
messages + background injection, actions, typed state primitives,
`chat.toStreamTextOptions()`, multi-tab coordination, network
resilience, and the first-turn fast path (`chat.headStart`).
- **New `ai-prompts.md`** — announces the Prompts feature publicly for
the first time. Code-defined templates, deploy-versioning, dashboard
overrides, AI SDK telemetry integration, `chat.agent` integration via
`chat.prompt.set()` + `chat.toStreamTextOptions()`, full management SDK.
- **`sessions-primitive.md` expanded** — calls out
`tasks.triggerAndSubscribe()` and `sessions.list` as standalone
primitives (not just chat.agent infrastructure).
- **`chat-agent-on-boot-hook.md` trimmed** — drops "if you previously…"
pre-release migration framing.
- **Deletes 4 changesets** that described pre-release-internal
migrations or were circular ("groundwork for the upcoming chat.agent" —
chat.agent ships in the same release).

### Server-changes rewrites (`.server-changes/`)

Five new entries for the dashboard surface of the AI feature set:
- Agents list page
- Agent Playground
- Sessions dashboard
- Prompts dashboard (list with usage sparklines + detail with template /
Generations / Metrics / Versions tabs + override UI)
- Models registry (provider-grouped catalog with cross-tenant usage
metrics)
- AI generation span inspector on run traces
- Runs list Task source filter (Standard / Scheduled / Agent)
- Run-detail Agent view (segmented control)

Each entry is 1–2 sentences, no bullets, no implementation file paths —
fits as a single bullet in a future changelog.

Three older `.server-changes/` files were merged or split into the
cleaner taxonomy above and deleted.

## Out of scope

Non-AI-feature server-changes (admin-tabs, queue-length-cap fix,
worker-deployment race, streamdown upgrade, etc.) and changesets
(idempotency-key cap, sigsegv retry, locals-key fix, plugin auth, region
filters, etc.) are untouched.
…3630)

## Summary

Adds `.changeset/pre.json` to put the repo into changesets pre mode with
dist-tag `rc`. After this merges, the changesets bot regenerates the
existing release PR as `chore: release v4.5.0-rc.0`. Merging that PR
publishes the first release candidate of 4.5.0 to npm under `@rc`.

The pre-mode plumbing landed in triggerdotdev#3628. The release content (chat.agent +
sessions + ai prompts + dashboard server-changes) landed in triggerdotdev#3629.

## What ships when the bot PR merges

Under dist-tag `rc`:
-
`@trigger.dev/{sdk,core,build,react-hooks,redis-worker,plugins,python,rsc,schema-to-json}@4.5.0-rc.0`
- `trigger.dev@4.5.0-rc.0`

Plus:
- Docker image `ghcr.io/triggerdotdev/trigger.dev:v4.5.0-rc.0`
(immutable tag only — `:v4-beta` is not touched)
- Helm chart `oci://ghcr.io/triggerdotdev/charts/trigger.dev:4.5.0-rc.0`
- GitHub release `v4.5.0-rc.0` marked as Pre-release (no Latest badge)

What does NOT happen:
- npm `latest` stays at 4.4.6
- No marketing-site changelog PR (gated on `is_prerelease != 'true'`)
- Docker `:latest` not touched (we never push it anyway in this repo)

## Iteration

For subsequent rc.N: add a regular changeset to main, bot regenerates
the release PR as `v4.5.0-rc.N`. Merge to ship.

## Exiting pre mode

When ready to ship stable: `pnpm exec changeset pre exit`, push, merge
regenerated PR. That publishes `4.5.0` under `latest` and fires the
marketing-site dispatch.
…v#3631)

## Summary

Renumber `029_add_task_kind_to_task_runs_v2.sql` →
`031_add_task_kind_to_task_runs_v2.sql` to fix a deploy-blocking
out-of-order migration, and make the DDL idempotent with `ADD COLUMN IF
NOT EXISTS` / `DROP COLUMN IF EXISTS`.

## Root cause

- Migration `030_create_sessions_v1.sql` landed on main on 2026-04-28
(PR triggerdotdev#3417) and was applied to test cloud ClickHouse on a subsequent
deploy. Current goose version on test ClickHouse: **30**.
- Migration `029_add_task_kind_to_task_runs_v2.sql` was authored later
on 2026-05-10 as part of the Sessions primitive PR series (`be1a6cf8`).
- The next test cloud deploy failed because goose strict-mode refused to
apply a missing version *before* the current version:

```
goose run: error: found 1 missing migrations before current version 30:
  version 29: 029_add_task_kind_to_task_runs_v2.sql
```

## Fix

1. **Rename to `031_*`** (next available number after 030). Goose now
treats it as a new migration after 030 and applies it cleanly on
test/prod where the column does not yet exist.
2. **Make the DDL idempotent** (`ADD COLUMN IF NOT EXISTS`). The
original 029 may have been applied in environments that ran goose with
`--allow-missing` (e.g. some local dev databases) — those would have the
column already, and the rename causes goose to see 031 as new and
re-attempt the ADD. Idempotent DDL keeps that path safe. The `Down`
mirrors with `DROP COLUMN IF EXISTS`.

## Test plan

- [ ] Test cloud deploy (after this lands) successfully runs the
ClickHouse migration step
- [ ] `task_kind` column shows up on `trigger_dev.task_runs_v2`
post-migration
- [ ] Local environments that had previously applied 029 do not error on
the next `goose up`
…dotdev#3633)

## Summary

Codify two rules for ClickHouse migration authors that came out of the
029/030 ordering incident on the TRI-9367 test cloud deploy:

1. **Number files to `max(existing) + 1`, never slot in below the
latest.** Goose runs in strict mode in the cloud deploy pipeline and
refuses to apply a missing version below the current version — slotting
a file in below an already-applied number blocks the next deploy.
2. **DDL must be idempotent** (`ADD COLUMN IF NOT EXISTS`, `DROP COLUMN
IF EXISTS`, `CREATE TABLE IF NOT EXISTS`, etc.) so a retry or
out-of-order apply (`goose up --allow-missing` for local recovery,
manual fixups) is a no-op rather than an error.

## Where the rules live

- `internal-packages/clickhouse/CLAUDE.md` — full rules + example for
migration authors (and AI agents writing migrations).
- `.claude/REVIEW.md` — added a 🔴 finding under "What makes a 🔴
Important finding" so PR reviewers flag either fault as blocking.

The existing migration files are left untouched; the idempotency
requirement applies going forward.

## Test plan

- [ ] Next ClickHouse migration PR uses `IF NOT EXISTS` / `IF EXISTS`
forms
- [ ] No new migration files numbered below an already-applied version
on test/prod
`dorny/paths-filter` defaults to OR semantics across the pattern array,
so the leading `**` matched every file and the `!...` excludes were
no-ops. The `code` filter has been returning `true` for every PR since
triggerdotdev#3615.

Split into two filter steps: `code` moves into its own step with
`predicate-quantifier: every` so excludes actually subtract. The two
re-include workflow files become a separate `typecheck_self` filter that
the `typecheck` job ORs into its `if:`.

Side effect: workflow-file-only PRs that don't touch `pr_checks.yml` or
`typecheck.yml` no longer trigger typecheck. Previously they did because
the filter was broken-true.
…dotdev#3641)

`TriggerChatTransport` had a single `baseURL` option covering both the
`.in/append` POSTs and the long-lived `.out` SSE subscription. Customers
wanting to route the SSE through a proxy (e.g. a Cloudflare worker
capturing JA4 fingerprints for bot detection) had to send every append
through the proxy too, adding a hop to every user message.

New optional `streamBaseURL` overrides the SSE base URL only; appends
keep using `baseURL`. Falls back to `baseURL` when unset, so existing
transports are unchanged.

```ts
const transport = new TriggerChatTransport({
  task: "ai-chat",
  baseURL: "https://api.trigger.dev",
  streamBaseURL: "https://chat-proxy.example.com",
  accessToken,
  startSession,
});
```

Verified with a new test in `chat.test.ts` that asserts `.in/append`
routes through `baseURL` and `.out` SSE routes through `streamBaseURL`.
All existing tests still pass.
… self-hosted builds (triggerdotdev#3618)

## Summary

Local self-hosted deploys (`trigger deploy --local-build --push
--builder orbstack` or any other buildx setup using the **docker**
driver) fail at the push step with:

```
ERROR: failed to build: failed to solve:
  exporter option "rewrite-timestamp" conflicts with "unpack"
```

The docker driver auto-enables `unpack=true` when pushing, and that's
incompatible with `rewrite-timestamp` (which the CLI sets for
reproducible-build hashing).

Adds a simple env-var opt-out so contributors can keep using their
default builder. The flag is only read by the local-build code path;
remote/cloud builds are unaffected.

```bash
TRIGGER_BUILD_SKIP_REWRITE_TIMESTAMP=1 \
  pnpm exec trigger deploy --profile default --local-build --push --builder orbstack
```

The trade-off: skipping `rewrite-timestamp` means layer timestamps
reflect actual build time, so two identical builds produce different
layer hashes. Fine for a local-dev registry; the only real consumer of
timestamp-stability is registry-layer cache hit rates.

## Test plan

- [x] Manual: ran `trigger deploy --profile default --local-build --push
--builder orbstack` against the localhost webapp + a local Docker
registry on port 5001 — first failed with the rewrite-timestamp/unpack
error, then succeeded after setting
`TRIGGER_BUILD_SKIP_REWRITE_TIMESTAMP=1`.
- [x] Full chat.agent smoke sweep (15 tests, including suspend/resume,
deepResearch subtask, AgentChat orchestrator) against the deployed image
— all pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
…riggerdotdev#3632)

## Summary

- Adds a `beforeSend` rule in `apps/webapp/sentry.server.ts` that
collapses Prisma `P1001` ("Can't reach database server") errors into a
single Sentry issue regardless of which call site threw, by setting
`event.fingerprint = ["prisma-p1001-db-unreachable"]` and tagging
`db_unreachable:true`.
- Matches both `err.code === "P1001"` (Prisma's `KnownRequestError` when
a connection drops mid-query) and `err.errorCode === "P1001"`
(`InitializationError` when the client fails to connect at startup).
- Implemented as a small extensible `FINGERPRINT_RULES` table so further
fan-out errors can be added with one entry.

## Verification

End-to-end verified locally with `debug: true` on the SDK:
- Real Prisma `P1001` thrown from a loader (DB stopped mid-request) is
captured by Sentry's Remix auto-instrumentation
- `beforeSend` fires with `originalException.code === "P1001"`, rule
matches
- `event.fingerprint = ["prisma-p1001-db-unreachable"]` and
`tags.db_unreachable = "true"` applied
- Event lands in Sentry under the new fingerprint

## Test plan

- [ ] Deploy to staging; confirm P1001 events appear under a single
`prisma-p1001-db-unreachable` issue rather than fanning out
- [ ] Confirm `db_unreachable:true` tag is filterable in Sentry
- [ ] Verify non-P1001 errors are unaffected (event passes through
`beforeSend` untouched)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gerdotdev#3653)

## Summary

The PUT handler at `/realtime/v1/streams/:runId/:target/:streamId` ran
`taskRun.update({ realtimeStreams: { push: streamId } })` on every call,
even when the `streamId` was already present. SDK call patterns that
re-initialize the same stream key on every chunk produce a per-write row
UPDATE, duplicate entries pile up in the array, and the row-lock + TOAST
rewrite cost grows unbounded on long-running stream sessions.

## Fix

Mirror the sibling append handler: read the array first and only push
when the `streamId` isn't already present. Identical behavior for
first-time stream creation; repeat creates short-circuit to a single
indexed read. The dashboard's per-run stream listing keeps working
because the first create still records the entry.

## Test plan

- [ ] A fresh PUT for a new `(run, streamId)` adds the entry to the
array
- [ ] A repeat PUT for the same pair leaves the array unchanged
- [ ] 404 is returned when the run doesn't exist; 400 when the run is
completed
…erdotdev#3644)

## Summary

Long-running chat agents were filling `session.out` forever — every
`chat.agent` turn appended to the same S2 stream with no trim, and the
Sessions dashboard re-streamed the entire history from `seq_num=0` on
every page load. After this change the agent appends an S2 `trim`
command record after each `trigger:turn-complete`, pointing back at the
previous turn-complete's seq_num. `session.out` stays roughly one turn
long at steady state, regardless of session age.

`trigger:turn-complete` and `trigger:upgrade-required` move from
`chunk.type`-shaped data records into header-form control records under
a uniform `trigger-control` namespace. Built-in transports
(`TriggerChatTransport`, `AgentChat`, the dashboard's `AgentView`)
handle the new shape transparently. Custom transports need a one-line
filter on the `trigger-control` header — see the rewritten "Records on
session.out" section in the client-protocol docs.

The Sessions detail page in the dashboard fetches the agent's per-turn
S3 snapshot via a presigned URL and seeds the transcript view, then
SSE-tails from the snapshot's `lastOutEventId`. Bandwidth and
time-to-first-render scale with unread turns instead of session
lifetime.

Resume contract is now explicit: single-turn-boundary resume always
works (the prior turn-complete is still on the stream), the S2 trim is
eventually consistent over 10-60s, and multi-turn-away resume falls back
to a snapshot reload.
…iggerdotdev#3642)

## Summary

Two papercuts new contributors hit running this repo locally:

1. Fresh clones default to v1 (Redis-only) realtime streams, so Sessions
and `chat.agent` error with `"S2 configuration is missing"`, even though
the `s2` service is already in `docker/docker-compose.yml` and pre-seeds
a `trigger-local` basin. Wire `REALTIME_STREAMS_S2_*` to it in
`.env.example` so the new-contributor flow just works. (Also drop the s2
healthcheck: the image is distroless, so the `wget` check always reports
unhealthy.)

2. Two clones can't both run `pnpm run docker` because ports, project
name, and container names are all hardcoded. Parameterize every host
port as `${VAR:-default}`, drive the project name via
`COMPOSE_PROJECT_NAME` (with a top-level `name:` field as the default),
prefix container names with `${CONTAINER_PREFIX:-}`, and pass
`--env-file .env` so compose reads the same root `.env` the webapp does.
The "Running multiple instances side by side" block in `.env.example`
lists every overridable knob.

Also split the optional services (`electric-shard-1`, `ch-ui`,
`toxiproxy`, `nginx-h2`, `otel-collector`, `prometheus`, `grafana`) into
`docker-compose.extras.yml` behind a new `pnpm run docker:full` script.
The core stack keeps everything the webapp actually needs to boot:
postgres, redis, electric, minio, clickhouse + migrator, s2-lite.

Defaults match every previous hardcoded value, so existing setups keep
working without touching `.env`.

## Test plan

- [x] `pnpm run docker` on a clean clone brings up the core services on
the standard ports under the `triggerdotdev-docker` project name.
- [x] Setting `COMPOSE_PROJECT_NAME=triggerdotdev-docker-alt` + the
`*_HOST_PORT` overrides in `.env` brings up a second stack alongside the
default one with no port or container-name clashes.
- [x] Webapp boots cleanly against the default `.env.example` values;
`/healthcheck` returns 200, no S2 errors.
- [x] s2-lite basin `trigger-local` accepts an append + read via the
same REST endpoints the webapp uses.
- [x] `pnpm run docker:full` brings up the optional services alongside
the core ones in the same project.
…gerdotdev#3614)

## Summary
- Introduce the Mollifier: a Redis-backed buffer for `trigger()` API
calls during traffic spikes, with a per-env trip evaluator and a drainer
ack-loop.
- Phase 1 is dual-write monitoring — every mollified trigger is buffered
to Redis AND continues to `engine.trigger`. No customer-facing behaviour
change.
- Telemetry events: `mollifier.would_mollify`, `mollifier.buffered`,
`mollifier.drained`, plus the `mollifier.decisions` counter.
  - Gated behind a feature flag (default off).

  ## Test plan
  - [x] `pnpm run test --filter @trigger.dev/redis-worker`
  - [x] `pnpm run test --filter webapp -- mollifier`
  - [x] Manual: with flag off, no behaviour change vs main
- [x] Manual: with flag on + threshold lowered, observe
`mollifier.buffered` + `mollifier.drained` log pairs with matching
`runId`

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h on writer failure (triggerdotdev#3658)

## Summary

Hot-loop writers — `streams.writer` / `streams.pipe` on the run-scoped
side, `chat.response.write` / `chat.stream.*` on the session side — were
issuing a fresh `PUT` to mint S2 credentials for every chunk. On run
streams, each PUT also pushed the streamId onto
`TaskRun.realtimeStreams`,
so a chat-agent turn writing N chunks produced N PUTs and N duplicate
array pushes against the same row.

The SDK now caches the initialize response per cache slot: `(runId,
key)`
for run streams, the session id for session streams. First call PUTs as
before; subsequent calls reuse the cached promise. Hot-loop writers do
one PUT per slot for the lifetime of the cache.

S2 access tokens have a 1-day TTL. If a writer's `wait()` rejects (auth
error, expired token, network blip), the cache evicts the matching slot
so the next call re-PUTs and mints fresh credentials, identity-checked
so a concurrent caller's fresh promise isn't accidentally cleared.

## chat.agent guardrail

`streams.pipe / writer / append / read` called inside a `chat.agent` run
now logs a one-time warning pointing at `chat.response.write` /
`chat.stream.*` — `streams.*` is run-scoped and isn't visible on the
chat session. The ai-chat docs are updated to drop the old guidance
toward run-scoped streams.
…riggerdotdev#3655)

## Summary

`TriggerChatTransport`, `AgentChat`, and `chat.createStartSessionAction`
now accept a string-or-function `baseURL` so callers can route per
endpoint — e.g. `.in/append` through a trusted edge proxy while keeping
`.out` SSE direct. The same surfaces add a `fetch` override for header
injection, custom retries, or proxy rewrites that go beyond URL routing.
SSE GETs are covered too via a new `fetchClient` option on
`SSEStreamSubscription`.

```ts
// TriggerChatTransport / AgentChat — endpoints: "in" | "out"
baseURL: ({ endpoint }) =>
  endpoint === "out" ? DIRECT : PROXY,

fetch: (url, init, ctx) => {
  init.headers = new Headers(init.headers);
  init.headers.set("traceparent", currentTraceparent());
  return globalThis.fetch(url, init);
},

// chat.createStartSessionAction — endpoints: "sessions" | "auth"
chat.createStartSessionAction("my-agent", {
  baseURL: ({ endpoint }) => (endpoint === "sessions" ? PROXY : DIRECT),
});
```

`streamBaseURL` on `TriggerChatTransport` is kept as a backwards-compat
alias and continues to win for the `"out"` endpoint when set.
Plain-string `baseURL` still applies to every endpoint, matching prior
behavior.
Reject non-email strings at the magic link form instead of accepting any
string and proceeding through rate-limit / authenticator steps.
…riggerdotdev#3664)

## Summary

Companion to triggerdotdev#3536, which patched routes that already had a leaking
`catch (e) { return json({error: e.message}, 500) }`. That pattern can't
reach routes which have no catch in the first place — when those throw,
Remix's default error path serializes `error.message` into the response
body, and the SDK then wraps the leaked string as `TriggerApiError`.

Across 28 raw api.v1 loaders/actions plus one dashboard polling
endpoint, each handler now:

- Wraps its body in `try { ... } catch (error) { ... }`.
- Re-throws `Response` instances so auth helpers' `throw json(...)` /
`throw redirect(...)` pass through unchanged.
- Logs non-Response errors via `logger.error` so server-side visibility
is preserved.
- Returns a generic body — `{"error": "Internal Server Error"}` 500 for
raw API routes, or `{ changelogs: [] }` 200 for the polling widget
(degrade silently across transient blips; the consumer hook already
coped with empty payloads).

For six routes where triggerdotdev#3536 left an inner try/catch covering only a
service call (`alertChannels`, `batches.results`,
`deployments.finalize`, `deployments.background-workers`,
`deployments.promote`, `projects.background-workers`): an outer
try/catch is added so auth/parsing failures are also sanitized. Inner
typed-error handling (`ServiceValidationError` → 422 with message, etc.)
is preserved exactly.

For two routes whose existing catch returned 400 + `error.message`
(`api.v1.authorization-code`, `api.v1.orgs.\$orgParam.projects` action):
the body is sanitized to a generic per-route string. **Status code stays
400** — clients that key on the 4xx/5xx distinction (and the SDK's
no-retry-on-4xx behavior) are unaffected.

## Test plan

- [x] \`pnpm run typecheck --filter webapp\`
- [x] Per-route synthetic-throw probe: inject \`throw new
Error("SYNTHETIC ...")\` at the top of each catch'd try, curl the route
with a dummy bearer, confirm the response body is the generic shape and
that the synthetic message lands server-side via \`logger.error\`. 29
routes verified.
- [x] Real-P1001 probe on the envvars loader: \`docker stop database\`
mid-flight, confirm response is generic 500 (not the leaked Prisma
message).
- [x] Sampled legitimate 4xx/2xx paths across each pattern variant
(naked-wrap, partial-expanded, 400-preserved) to confirm the wraps don't
interfere with normal control flow.
…gerdotdev#3665)

## Summary

The prerelease (snapshot) path of the release workflow fails immediately
whenever `main` carries an active `.changeset/pre.json` (i.e. during an
in-progress RC cycle, like the current v4 RC):

```
🦋 error Snapshot release is not allowed in pre mode
🦋 To resolve this exit the pre mode by running `changeset pre exit`
```

This blocks `chat-prerelease` snapshots from main even though the
snapshots are unrelated to the RC cycle.

Adds a conditional `changeset pre exit` step right before `Snapshot
version` in the prerelease job. The job runs on a checkout with
`persist-credentials: false`, so the `pre.json` deletion stays on the
runner's working tree — main's persisted pre-mode state is untouched,
and v4 RC publishes keep working normally.

## Test plan

- [ ] Re-run the `🦋 Changesets Release` workflow with `type=prerelease`,
`ref=main`, `prerelease_tag=chat-prerelease` and confirm it gets past
the snapshot step and publishes.
- [ ] Confirm `.changeset/pre.json` on `main` is unchanged after the
run.
…mic deployments (triggerdotdev#3666)

- Ask user if they want to remove TRIGGER_VERSION when they disable
atomic deployments, and explain what is the situation if they leave it
as it is
- Install TRIGGER_SECRET keys as sensitive values in Vercel
<img width="1136" height="714" alt="image"
src="https://github.com/user-attachments/assets/a7351da1-5b2a-44e5-acdd-d30c9359f3ed"
/>
<img width="1136" height="714" alt="image"
src="https://github.com/user-attachments/assets/e773ede2-74cb-438e-811c-338f678d2f7d"
/>
<img width="1136" height="714" alt="image"
src="https://github.com/user-attachments/assets/c7b235a8-e06d-48d3-ac28-c5c9aacc6069"
/>
…dotdev#3668)

## Summary

The S2 access-token cache key was `${basin}:${streamPrefix}` — purely
server-derived but blind to the **scope/ops list** hardcoded one method
away. When the ops list changes in code (e.g. triggerdotdev#3644 added `trim` so
`chat.agent`'s per-turn trim chain can issue `AppendRecord.trim()`),
pre-deploy tokens still in cache get returned to SDK callers for up to
the token's TTL (24h default), surfacing as `Operation not permitted`
403s on any op outside the old scope.

## Fix

Lift the ops list to a module constant and fold its sorted-join
fingerprint into the cache key:

```ts
const S2_TOKEN_OPS = ["append", "create-stream", "trim"] as const;
const S2_TOKEN_OPS_FINGERPRINT = [...S2_TOKEN_OPS].sort().join(",");

// in getS2AccessToken
const cacheKey = `${this.basin}:${this.streamPrefix}:${S2_TOKEN_OPS_FINGERPRINT}`;

// in s2IssueAccessToken
scope: { /* ... */ ops: [...S2_TOKEN_OPS], /* ... */ }
```

The fingerprint is derived from the single source of truth, so any
future scope change auto-invalidates without anyone remembering to bump
a literal version. The Unkey L1 (in-memory LRU) and L2 (Redis) layers
share the same key derivation, so both reset together on the next deploy
with no manual cache busting.

## Test plan

- [ ] `pnpm run typecheck --filter webapp`
- [ ] Run a multi-turn `chat.agent` chat via `references/ai-chat` and
confirm no `chat.agent: trim failed; will retry next turn` warn span
fires across turn-completes.
id: release
uses: softprops/action-gh-release@v1
if: github.event_name == 'push'
uses: softprops/action-gh-release@b4309332981a82ec1c5618f44dd2e27cc8bfbfda # v3.0.0
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 1525 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name=".github/workflows/e2e-webapp.yml">

<violation number="1" location=".github/workflows/e2e-webapp.yml:67">
P2: Guard DockerHub login on both username and token so optional secrets don't cause a hard failure when only one is provided.</violation>
</file>

<file name=".github/workflows/publish-worker-v4.yml">

<violation number="1" location=".github/workflows/publish-worker-v4.yml:69">
P2: Semver releases no longer publish the additional `v4-beta` image tag, which regresses the previous tagging behavior.</violation>
</file>

<file name=".github/workflows/claude.yml">

<violation number="1" location=".github/workflows/claude.yml:22">
P1: This workflow now grants repository write permissions on `@claude` comment triggers without restricting who can invoke it, which creates an untrusted-to-write privilege escalation path.</violation>
</file>

<file name=".changeset/agent-skills.md">

<violation number="1" location=".changeset/agent-skills.md:1">
P2: Changeset describes 'Agent Skills for chat.agent' but the PR implements supervisor wide events and warm-start trace propagation. This changeset would create an inaccurate changelog entry and trigger patch bumps for packages that may not have corresponding code changes in this PR.</violation>
</file>

Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.

Re-trigger cubic

contents: read
pull-requests: read
issues: read
contents: write
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: This workflow now grants repository write permissions on @claude comment triggers without restricting who can invoke it, which creates an untrusted-to-write privilege escalation path.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/claude.yml, line 22:

<comment>This workflow now grants repository write permissions on `@claude` comment triggers without restricting who can invoke it, which creates an untrusted-to-write privilege escalation path.</comment>

<file context>
@@ -19,24 +19,25 @@ jobs:
-      contents: read
-      pull-requests: read
-      issues: read
+      contents: write
+      pull-requests: write
+      issues: write
</file context>


# ..to avoid rate limits when pulling images
- name: 🐳 Login to DockerHub
if: ${{ env.DOCKERHUB_USERNAME }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Guard DockerHub login on both username and token so optional secrets don't cause a hard failure when only one is provided.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/e2e-webapp.yml, line 67:

<comment>Guard DockerHub login on both username and token so optional secrets don't cause a hard failure when only one is provided.</comment>

<file context>
@@ -0,0 +1,97 @@
+
+      # ..to avoid rate limits when pulling images
+      - name: 🐳 Login to DockerHub
+        if: ${{ env.DOCKERHUB_USERNAME }}
+        uses: docker/login-action@4907a6ddec9925e35a0a9e82d7399ccc52663121 # v4.1.0
+        with:
</file context>
Suggested change
if: ${{ env.DOCKERHUB_USERNAME }}
if: ${{ secrets.DOCKERHUB_USERNAME && secrets.DOCKERHUB_TOKEN }}

image_tags=$image_tags,$ref_without_tag:v4-beta
fi
ref_without_tag=ghcr.io/triggerdotdev/${STEPS_GET_REPOSITORY_OUTPUTS_REPO}
image_tags=$ref_without_tag:${STEPS_GET_TAG_OUTPUTS_TAG}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Semver releases no longer publish the additional v4-beta image tag, which regresses the previous tagging behavior.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/publish-worker-v4.yml, line 69:

<comment>Semver releases no longer publish the additional `v4-beta` image tag, which regresses the previous tagging behavior.</comment>

<file context>
@@ -62,26 +65,24 @@ jobs:
-            image_tags=$image_tags,$ref_without_tag:v4-beta
-          fi
+          ref_without_tag=ghcr.io/triggerdotdev/${STEPS_GET_REPOSITORY_OUTPUTS_REPO}
+          image_tags=$ref_without_tag:${STEPS_GET_TAG_OUTPUTS_TAG}
 
           echo "image_tags=${image_tags}" >> "$GITHUB_OUTPUT"
</file context>
Suggested change
image_tags=$ref_without_tag:${STEPS_GET_TAG_OUTPUTS_TAG}
image_tags=$ref_without_tag:${STEPS_GET_TAG_OUTPUTS_TAG}
# if tag is a semver, also tag it as v4
if [[ "$STEPS_GET_TAG_OUTPUTS_IS_SEMVER" == true ]]; then
image_tags=$image_tags,$ref_without_tag:v4-beta
fi

@@ -0,0 +1,16 @@
---
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Changeset describes 'Agent Skills for chat.agent' but the PR implements supervisor wide events and warm-start trace propagation. This changeset would create an inaccurate changelog entry and trigger patch bumps for packages that may not have corresponding code changes in this PR.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .changeset/agent-skills.md:

<comment>Changeset describes 'Agent Skills for chat.agent' but the PR implements supervisor wide events and warm-start trace propagation. This changeset would create an inaccurate changelog entry and trigger patch bumps for packages that may not have corresponding code changes in this PR.</comment>

<file context>
@@ -0,0 +1,16 @@
+---
+"@trigger.dev/sdk": patch
+"@trigger.dev/core": patch
+"@trigger.dev/build": patch
+"trigger.dev": patch
+---
+
+Add Agent Skills for `chat.agent`. Drop a folder with a `SKILL.md` and any helper scripts/references next to your task code, register it with `skills.define({ id, path })`, and the CLI bundles it into the deploy image automatically — no `trigger.config.ts` changes. The agent gets a one-line summary in its system prompt and discovers full instructions on demand via `loadSkill`, with `bash` and `readFile` tools scoped per-skill (path-traversal guards, output caps, abort-signal propagation).
+
</file context>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.