feat(supervisor): wide events + warm-start trace propagation#9
feat(supervisor): wide events + warm-start trace propagation#9deepshekhardas wants to merge 347 commits into
Conversation
Also includes a claude.md audit workflow for PRs
…tdev#3109) - New User onboarding questions added and stored in a new `onboardingData` col - Keeps the same Org creation screen and stores the data in the same format in same DB column - New Org onboarding questions addded and stored in a new `onboardingData` col https://github.com/user-attachments/assets/244e4bae-f74d-4ed4-a545-92c9b927e98b --------- Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
… version changelog draft PR (triggerdotdev#3162)
…#3146) Input streams enable sending typed data to executing tasks from external callers — backends, frontends, or other tasks. This unlocks interactive use cases like approval UIs, cancel buttons, chat interfaces, and human-in-the-loop AI workflows where the task needs to receive data while running. Three consumption patterns inside a task: * `.wait()` — Suspend the task until data arrives (process freed, most efficient) * `.once()` — Wait for the next message (process stays alive) * `.on()` — Subscribe to a continuous stream of messages One send pattern from outside: * `.send(runId, data)` — Send typed data to a specific run's input stream ## User-facing API ### Define a typed input stream ```ts import { streams, task } from "@trigger.dev/sdk"; const approval = streams.input<{ approved: boolean; reviewer: string }>({ id: "approval" }); ``` ### Consume inside a task ```ts export const myTask = task({ id: "my-task", run: async () => { // Pattern 1: Suspend until data arrives (most efficient — frees the process) const result = await approval.wait({ timeout: "5m" }); // Pattern 2: Wait for next message (process stays alive) const data = await approval.once().unwrap(); // Pattern 3: Subscribe to multiple messages approval.on((data) => { /* handle each message */ }); }, }); ``` ### Send from outside ```ts // From a backend (using secret API key) await approval.send(runId, { approved: true, reviewer: "alice" }); // From a frontend (using public JWT token from trigger response) const { send } = useInputStreamSend("approval", runId, { accessToken }); send({ approved: true, reviewer: "alice" }); ``` --------- Co-authored-by: Claude <noreply@anthropic.com>
The latest goose requires go version 1.25: https://github.com/pressly/goose/releases/tag/v3.27.0
…ggerdotdev#3095) Adds a "Create custom dashboard" button to the top right of the metrics dashboard <img width="3546" height="1934" alt="CleanShot 2026-02-19 at 11 25 12@2x" src="https://github.com/user-attachments/assets/0bb46ade-47c9-4396-b62a-f4801d7d90b4" />
…evel (triggerdotdev#3166) The global rate limiter was being applied at the FairQueue claim phase, consuming 1 token per queue-claim-attempt rather than per item processed. With many small queues (each batch is its own queue), consumers burned through tokens on empty or single-item queues, causing aggressive throttling well below the intended items/sec limit. Changes: - Move rate limiter from FairQueue claim phase to BatchQueue worker queue consumer loop (before blockingPop), so each token = 1 item processed - Replace the FairQueue rate limiter with a worker queue depth cap to prevent unbounded growth that could cause visibility timeouts - Add BATCH_QUEUE_WORKER_QUEUE_MAX_DEPTH env var (optional, disabled by default)
# trigger.dev v4.4.2
## Summary
2 new features, 2 improvements, 8 bug fixes.
## Improvements
- Add input streams for bidirectional communication with running tasks.
Define typed input streams with `streams.input<T>({ id })`, then consume
inside tasks via `.wait()` (suspends the process), `.once()` (waits for
next message), or `.on()` (subscribes to a continuous stream). Send data
from backends with `.send(runId, data)` or from frontends with the new
`useInputStreamSend` React hook.
([triggerdotdev#3146](triggerdotdev#3146))
- Add PAYLOAD_TOO_LARGE error to handle graceful recovery of sending
batch trigger items with payloads that exceed the maximum payload size
([triggerdotdev#3137](triggerdotdev#3137))
## Bug fixes
- Fix slow batch queue processing by removing spurious cooloff on
concurrency blocks and fixing a race condition where retry attempt
counts were not atomically updated during message re-queue.
([triggerdotdev#3079](triggerdotdev#3079))
- fix(sdk): batch triggerAndWait variants now return correct
run.taskIdentifier instead of unknown
([triggerdotdev#3080](triggerdotdev#3080))
## Server changes
These changes affect the self-hosted Docker image and Trigger.dev Cloud:
- Two-level tenant dispatch architecture for batch queue processing.
Replaces the
single master queue with a two-level index: a dispatch index (tenant →
shard)
and per-tenant queue indexes (tenant → queues). This enables O(1) tenant
selection and fair scheduling across tenants regardless of queue count.
Improves batch queue processing performance.
([triggerdotdev#3133](triggerdotdev#3133))
- Add input streams with API routes for sending data to running tasks,
SSE reading, and waitpoint creation. Includes Redis cache for fast
`.send()` to `.wait()` bridging, dashboard span support for input stream
operations, and s2-lite support with configurable S2 endpoint, access
token skipping, and S2-Basin headers for self-hosted deployments. Adds
s2-lite to Docker Compose for local development.
([triggerdotdev#3146](triggerdotdev#3146))
- Speed up batch queue processing by disabling cooloff and increasing
the batch queue processing concurrency limits on the cloud:
- Pro plan: increase to 50 from 10.
- Hobby plan: increase to 10 from 5.
- Free plan: increase to 5 from 1.
([triggerdotdev#3079](triggerdotdev#3079))
- Move batch queue global rate limiter from FairQueue claim phase to
BatchQueue worker queue consumer for accurate per-item rate limiting.
Add worker queue depth cap to prevent unbounded growth that could cause
visibility timeouts.
([triggerdotdev#3166](triggerdotdev#3166))
- Fix a race condition in the waitpoint system where a run could be
blocked by a completed waitpoint but never be resumed because of a
PostgreSQL MVCC issue. This was most likely to occur when creating a
waitpoint via `wait.forToken()` at the same moment as completing the
token with `wait.completeToken()`. Other types of waitpoints (timed,
child runs) were not affected.
([triggerdotdev#3075](triggerdotdev#3075))
- Fix metrics dashboard chart series colors going out of sync and
widgets not reloading stale data when scrolled back into view
([triggerdotdev#3126](triggerdotdev#3126))
- Gracefully handle oversized batch items instead of aborting the
stream.
When an NDJSON batch item exceeds the maximum size, the parser now emits
an error marker instead of throwing, allowing the batch to seal
normally. The oversized item becomes a pre-failed run with
`PAYLOAD_TOO_LARGE` error code, while other items in the batch process
successfully. This prevents `batchTriggerAndWait` from seeing connection
errors and retrying with exponential backoff.
Also fixes the NDJSON parser not consuming the remainder of an oversized
line split across multiple chunks, which caused "Invalid JSON" errors on
subsequent lines.
([triggerdotdev#3137](triggerdotdev#3137))
- Require the user is an admin during an impersonation session.
Previously only the impersonation cookie was checked; now the real
user's admin flag is verified on every request. If admin has been
revoked, the session falls back to the real user's ID.
([triggerdotdev#3078](triggerdotdev#3078))
<details>
<summary>Raw changeset output</summary>
# Releases
## @trigger.dev/build@4.4.2
### Patch Changes
- Updated dependencies:
- `@trigger.dev/core@4.4.2`
## trigger.dev@4.4.2
### Patch Changes
- Updated dependencies:
- `@trigger.dev/build@4.4.2`
- `@trigger.dev/core@4.4.2`
- `@trigger.dev/schema-to-json@4.4.2`
## @trigger.dev/python@4.4.2
### Patch Changes
- Updated dependencies:
- `@trigger.dev/sdk@4.4.2`
- `@trigger.dev/build@4.4.2`
- `@trigger.dev/core@4.4.2`
## @trigger.dev/react-hooks@4.4.2
### Patch Changes
- Add input streams for bidirectional communication with running tasks.
Define typed input streams with `streams.input<T>({ id })`, then consume
inside tasks via `.wait()` (suspends the process), `.once()` (waits for
next message), or `.on()` (subscribes to a continuous stream). Send data
from backends with `.send(runId, data)` or from frontends with the new
`useInputStreamSend` React hook.
([triggerdotdev#3146](triggerdotdev#3146))
Upgrade S2 SDK from 0.17 to 0.22 with support for custom endpoints
(s2-lite) via the new `endpoints` configuration, `AppendRecord.string()`
API, and `maxInflightBytes` session option.
- Updated dependencies:
- `@trigger.dev/core@4.4.2`
## @trigger.dev/redis-worker@4.4.2
### Patch Changes
- Fix slow batch queue processing by removing spurious cooloff on
concurrency blocks and fixing a race condition where retry attempt
counts were not atomically updated during message re-queue.
([triggerdotdev#3079](triggerdotdev#3079))
- Updated dependencies:
- `@trigger.dev/core@4.4.2`
## @trigger.dev/rsc@4.4.2
### Patch Changes
- Updated dependencies:
- `@trigger.dev/core@4.4.2`
## @trigger.dev/schema-to-json@4.4.2
### Patch Changes
- Updated dependencies:
- `@trigger.dev/core@4.4.2`
## @trigger.dev/sdk@4.4.2
### Patch Changes
- Add input streams for bidirectional communication with running tasks.
Define typed input streams with `streams.input<T>({ id })`, then consume
inside tasks via `.wait()` (suspends the process), `.once()` (waits for
next message), or `.on()` (subscribes to a continuous stream). Send data
from backends with `.send(runId, data)` or from frontends with the new
`useInputStreamSend` React hook.
([triggerdotdev#3146](triggerdotdev#3146))
Upgrade S2 SDK from 0.17 to 0.22 with support for custom endpoints
(s2-lite) via the new `endpoints` configuration, `AppendRecord.string()`
API, and `maxInflightBytes` session option.
- fix(sdk): batch triggerAndWait variants now return correct
run.taskIdentifier instead of unknown
([triggerdotdev#3080](triggerdotdev#3080))
- Add PAYLOAD_TOO_LARGE error to handle graceful recovery of sending
batch trigger items with payloads that exceed the maximum payload size
([triggerdotdev#3137](triggerdotdev#3137))
- Updated dependencies:
- `@trigger.dev/core@4.4.2`
## @trigger.dev/core@4.4.2
</details>
---------
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
A top-level Errors page that aggregates errors from failed runs with occurrences metrics. https://github.com/user-attachments/assets/8f0ef55e-90dd-4faa-9051-59f4665181e4 Errors are “fingerprinted” so similar errors are grouped together (e.g. has an ID in the error message). You can view an individual error to view a timeline of when it fired, the runs, and bulk replay them.
In theory this will make Log queries faster
…release PR description (triggerdotdev#3183)
…n payload schemas (triggerdotdev#3188) <img width="2191" height="1023" alt="CleanShot 2026-03-06 at 13 36 53" src="https://github.com/user-attachments/assets/4eba0d1a-1528-49a3-be5b-6bde89030193" /> <img width="411" height="1069" alt="CleanShot 2026-03-06 at 13 37 28" src="https://github.com/user-attachments/assets/e5f7bb9c-c894-41cc-9ca6-96b43fcf6005" /> Add a tabbed sidebar to the Test page for standard tasks, reusing the ClientTabs pattern from the Query page. - Options tab: existing sidebar content (machine, version, queue, etc.) - AI tab: AI-powered payload generation with streaming, supports JSON Schema, inferred schema from recent runs, and task source code lookup via tool calling for tasks without schemas - Schema tab: displays payload JSON Schema (from schemaTask), inferred schema (from recent runs via @jsonhero/schema-infer), or empty state with schemaTask docs and example code Data layer changes: - Surface payloadSchema and inferredPayloadSchema from TestTaskPresenter - Add payloadSchema and fileId to WorkerDeploymentWithWorkerTasks type - Decompress zlib-deflated source files for AI context
### Fixes and improvements to the onboarding questions: **This change is worth double checking @matt-aitken** - Update to the Button.tsx file: it now takes `isLoading` that shows a spinner in the middle of the button (replacing the button text and any icons) and sets it to `disabled`. It does this nicely by keeping the button width the same so there's no layout shift. **Other fixes** - Fixes an issue where if you type a custom option in the "What technologies do you use" question, it doesn't check the list to see if it matches. Now it checks the box if you've typed an option from that list. - When we randomize the list of onboarding question options, we now store the position they appeared in the list
…iggerdotdev#3191) When the dev CLI exits (e.g. ctrl+c via pnpm), runs that were mid-execution previously stayed stuck in EXECUTING status for up to 5 minutes until the heartbeat timeout fired. Now they are cancelled within seconds. The dev CLI spawns a lightweight detached watchdog process at startup. The watchdog monitors the CLI process ID and, when it detects the CLI has exited, calls a new POST /engine/v1/dev/disconnect endpoint to cancel all in-flight runs immediately (skipping PENDING_CANCEL since the worker is known to be dead). Watchdog design: - Fully detached (detached: true, stdio: ignore, unref()) so it survives even when pnpm sends SIGKILL to the process tree - Active run IDs maintained via atomic file write (.trigger/active-runs.json) - Single-instance guarantee via PID file (.trigger/watchdog.pid) - Safety timeout: exits after 24 hours to prevent zombie processes - On clean shutdown, the watchdog is killed (no disconnect needed) Disconnect endpoint: - Rate-limited: 5 calls/min per environment - Capped at 500 runs per call - Small counts (<= 25): cancelled inline with pMap concurrency 10 - Large counts: delegated to the bulk action system - Uses finalizeRun: true to skip PENDING_CANCEL and go straight to FINISHED Run engine change: - cancelRun() now respects finalizeRun when the run is in EXECUTING status, skipping the PENDING_CANCEL waiting state and going directly to FINISHED
## Summary 2 new features, 2 improvements. ## Improvements - Add syncSupabaseEnvVars to pull database connection strings and save them as trigger.dev environment variables ([triggerdotdev#3152](triggerdotdev#3152)) - Auto-cancel in-flight dev runs when the CLI exits, using a detached watchdog process that survives pnpm SIGKILL ([triggerdotdev#3191](triggerdotdev#3191)) ## Server changes These changes affect the self-hosted Docker image and Trigger.dev Cloud: - A new Errors page for viewing and tracking errors that cause runs to fail - Errors are grouped using error fingerprinting - View top errors for a time period, filter by task, or search the text - View occurrences over time - View all the runs for an error and bulk replay them ([triggerdotdev#3172](triggerdotdev#3172)) - Add sidebar tabs (Options, AI, Schema) to the Test page for schemaTask payload generation and schema viewing. ([triggerdotdev#3188](triggerdotdev#3188)) <details> <summary>Raw changeset output</summary> # Releases ## @trigger.dev/build@4.4.3 ### Patch Changes - Add syncSupabaseEnvVars to pull database connection strings and save them as trigger.dev environment variables ([triggerdotdev#3152](triggerdotdev#3152)) - Updated dependencies: - `@trigger.dev/core@4.4.3` ## trigger.dev@4.4.3 ### Patch Changes - Auto-cancel in-flight dev runs when the CLI exits, using a detached watchdog process that survives pnpm SIGKILL ([triggerdotdev#3191](triggerdotdev#3191)) - Updated dependencies: - `@trigger.dev/core@4.4.3` - `@trigger.dev/build@4.4.3` - `@trigger.dev/schema-to-json@4.4.3` ## @trigger.dev/core@4.4.3 ### Patch Changes - Auto-cancel in-flight dev runs when the CLI exits, using a detached watchdog process that survives pnpm SIGKILL ([triggerdotdev#3191](triggerdotdev#3191)) ## @trigger.dev/python@4.4.3 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.3` - `@trigger.dev/build@4.4.3` - `@trigger.dev/sdk@4.4.3` ## @trigger.dev/react-hooks@4.4.3 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.3` ## @trigger.dev/redis-worker@4.4.3 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.3` ## @trigger.dev/rsc@4.4.3 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.3` ## @trigger.dev/schema-to-json@4.4.3 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.3` ## @trigger.dev/sdk@4.4.3 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.3` </details> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…riggerdotdev#3208) Deprecates the syncVercelEnvVars build extension and adds warnings in both the Vercel integration docs and the extension's own page to prevent conflicts with the native env var sync
…tdev#3201) ## Adds 2 self serve features ### 1. self serve preview branches - Copies the patterns of the self serve concurrency - Self serve only available on Pro plan (otherwise you are linked to the billing plans page) - Global self serve branches limit: 180 (+20 for the Pro plan). It can be overridden per Org - You need to archive branches before reducing the number of extra branches you're paying for - Branches are removed immediately but remain billed until the end of the billing cycle like extra concurrency ### 2. self serve team members - Copies the patterns of the self serve concurrency - Self serve only available on Pro plan (otherwise you are linked to the billing plans page) - Global self serve members is unlimited but can be limited with the same env var quota and overridden per org if needed - You need to remove team members before reducing the number of members you pay for - Team members are removed immediately but remain billed until the end of the billing cycle like extra concurrency
…Rs open in "ready to review" status (triggerdotdev#3218)
…hards (triggerdotdev#3219) Queues with concurrency keys now appear as a single entry in the master queue instead of one entry per key. This prevents high-CK-count tenants from consuming the entire `parentQueueLimit` window and starving other tenants on the same shard. A new per-queue **CK index** (sorted set) tracks active concurrency key sub-queues. The master queue gets one `:ck:*` wildcard entry per base queue. Dequeuing from that entry round-robins across sub-queues, maintaining per-CK concurrency tracking and fairness. All existing operations (enqueue, dequeue, ack, nack, DLQ, TTL expiry) are CK-index-aware and keep the index consistent. Old-format entries drain naturally during rollout — no migration step needed, single deploy.
…erdotdev#3627) ## Summary A "Google auth conflict" Sentry alert fires whenever a user signs in via Google whose Google account is linked to one user row but whose Google-provided email is now on a *different* user row. The handler in `apps/webapp/app/models/user.server.ts:236` already does the right thing — it returns the existing auth-linked user and skips the update path so neither row gets mutated — but it logs the situation with `logger.error`, which routes to Sentry as an exception and pages the on-call channel. There's no exception to chase here: the branch is the intended outcome for a known data shape (user changed their email on one account after originally signing up via Google on another). Downgrading the call to `logger.warn` keeps the diagnostic record in our logs (with all the same context fields — email, both user IDs, authIdentifier) but stops it firing the production error alert. ## Change - `logger.error` → `logger.warn` for the conflict branch in `findOrCreateGoogleUser`. Context payload is unchanged. ## Test plan - [x] Typecheck only — there's no behavioural change to test, the log level is the entire diff.
…riggerdotdev#3625) ## Summary The trigger-task hotpath used to early-return without a DB query when a caller passed both a queue override and a per-trigger TTL — the hottest configuration on the trigger API. Adding `triggerSource` to the resolver so the runs-list "Source" filter could distinguish STANDARD / SCHEDULED / AGENT runs removed those early-returns, costing +2 DB queries per trigger on non-locked calls and +1 on locked calls. This change caches `BackgroundWorkerTask` metadata (`ttl`, `triggerSource`, `queueId`, `queueName`) in Redis so the resolver can satisfy every caller configuration with a single `HGET` on the warm path. PG fallback on miss back-fills the cache. Follow-up to triggerdotdev#3542. ## Design Two key spaces: - `task-meta:env:{envId}` — the "current worker" view, refreshed at every deploy promotion. 24h safety TTL. - `task-meta:by-worker:{workerId}` — used for `lockToVersion` triggers. Immutable post-create. 30d sliding TTL so historical workers age out. Cache writes use Lua scripts via `defineCommand` so `DEL` + `HSET` + `EXPIRE` land atomically — concurrent readers never see the empty intermediate state of a naive pipeline. Read-path back-fill uses single-field upserts so concurrent back-fills don't wipe each other's siblings. The cache lives behind its own `TASK_META_CACHE_REDIS_*` env-var prefix that falls back to the default `REDIS_*` set, so operators can route the cache to a dedicated Redis instance if they want. The service/instance file split (`taskMetadataCache.server.ts` for the pure class, `taskMetadataCacheInstance.server.ts` for the env-wired singleton) mirrors the existing `runsReplicationService` / `runsReplicationInstance` pattern. ## Test plan - [ ] `pnpm run typecheck --filter webapp` - [ ] `pnpm run test ./test/engine/triggerTask.test.ts --run` — 8 existing tests untouched + 5 new tests covering warm cache, cold miss with back-fill, queue + ttl path, by-worker vs env keyspace, and the promotion cache write - [ ] End-to-end against a dev worker: registering writes both keyspaces with the expected TTLs, and `redis-cli HGETALL "tr:task-meta:env:<envId>"` returns the cached entries ## Benchmark Measured `DefaultQueueManager.resolveQueueProperties` against a real Postgres + Redis (vitest `containerTest`, single-host docker). 500 sequential calls and 2,000 parallel calls (concurrency=50) per scenario, request shaped as `{ taskId, queue: "bench-queue", ttl: "5m" }` — the hot path this PR restores. ``` sequential (one in flight at a time): [noop cache (baseline)] n=500 mean=1.423ms p50=1.394ms p95=1.735ms p99=2.629ms max=11.100ms [redis cache, cold ] n=500 mean=1.346ms p50=1.283ms p95=1.688ms p99=2.463ms max=5.058ms [redis cache, warm ] n=500 mean=0.084ms p50=0.078ms p95=0.105ms p99=0.156ms max=1.129ms speedup (warm vs baseline, sequential): 16.95x parallel (concurrency=50): [noop cache (baseline)] n=2000 mean=10.069ms p50=8.850ms p95=14.718ms p99=31.887ms total=405ms ops/s=4,940 [redis cache, warm ] n=2000 mean=0.614ms p50=0.568ms p95=1.189ms p99=1.432ms total=25ms ops/s=80,389 throughput speedup (warm vs baseline, parallel): 16.27x ``` Read: - **Warm cache cuts resolver latency 17×** at p50 — from ~1.4 ms to ~78 µs per call. - **Cold cache is on par with baseline** — the extra `HGET` miss adds <50 µs against the two Postgres queries that follow, so the worst case is not worse than today. - **Under burst load (50 concurrent triggers)**, the baseline's p99 jumps to ~32 ms as Postgres connections queue up; warm stays at ~1.4 ms. The cache moves the saturation point from ~5k ops/s (PG pool) to ~80k ops/s (single-client Redis pipelining). Caveats: single-host docker, local Postgres + Redis, resolver-only measurement (excludes the rest of the trigger transaction). Prod adds region-local Redis RTT (~0.3–0.8 ms) which shifts warm absolute numbers up but keeps the ratio intact.
…CP (triggerdotdev#3612) ## Summary Adds a Region column and Region filter (under More filters) to the runs list dashboard, the same filter on the public runs list API (`filter[region]`), and a matching `region` input on the MCP `list_runs` tool. Each run's executing region is also surfaced as a new optional `region` field on the runs list and run retrieve responses, populated from the worker instance group's `masterQueue` identifier. Useful when you run tasks across multiple regions and want to slice the runs list — or your existing run-querying scripts — by where the run actually executed. ## Design The filter value in the URL / API is the `masterQueue` identifier (the same string already persisted on `TaskRun` and replicated to ClickHouse as `worker_queue`), so the query just becomes `worker_queue IN (...)` with no server-side translation. The Region dropdown options come from a new resource loader backed by `RegionsPresenter`, which now also exposes `masterQueue` alongside the existing region metadata. ```ts // public API const runs = await runs.list({ region: ["us-east-1", "eu-west-1"] }); // each item: { id, status, ..., region?: "us-east-1" } ``` ```ts // MCP list_runs({ environment: "prod", region: "us-east-1" }) ```
…dotdev#3628) ## Summary Enables shipping `X.Y.Z-rc.N` prereleases of `@trigger.dev/*` via changesets pre mode. RCs publish under the `rc` npm dist-tag, never claim `latest`, and don't trigger marketing-site changelog PRs. The plumbing is hyphen-in-version detection in `release.yml` — no separate workflow, no opt-in flag at publish time. Validated end-to-end against a sandbox repo (real npm publishes, Docker builds, Helm chart pushes, GitHub releases) before porting back. Full RC lifecycle tested: pre enter → rc.0 → iterate to rc.1 → pre exit → stable. Plus interaction with the existing release-branch hotfix flow. ## What changes ### `release.yml` - New `is_prerelease` output (hyphen-in-version) - GitHub release adds `--prerelease` flag for RC publishes (Pre-release badge, not Latest) - `dispatch-changelog` job gated on `is_prerelease != 'true'` — no marketing-site PR per RC ### Docker workflows - Removes the `:v4-beta` floating tag entirely from `publish-webapp.yml` and `publish-worker-v4.yml`. v4 is GA; the tag is a misnomer and is already inconsistent with the npm side (npm `v4-beta` dist-tag was frozen at 4.0.4 months ago while Docker `:v4-beta` kept bumping). Self-hosters should pin to a versioned tag going forward — the last value of `:v4-beta` stays frozen wherever it currently points. ### CLI version-check fix (`packages/cli-v3/src/utilities/initialBanner.ts`) Switches the "new version available" comparison from JavaScript `localeCompare` to `semver.lt`. The old comparison handled `X.Y.Z-rc.N` vs `X.Y.Z` incorrectly — a user on `4.5.0-rc.0` would never be prompted to upgrade once `4.5.0` stable shipped (lex order put the prerelease ahead of the bare version). Real semver gets this right. Stable users were never affected: the check queries the `@latest` dist-tag, which by convention never points at a prerelease. ## How an RC actually publishes after this 1. `pnpm exec changeset pre enter rc` on main, push the `pre.json` 2. Bot regenerates the release PR as `chore: release v<X.Y.Z>-rc.0` 3. Merge → `release.yml` runs `changeset publish` which reads `pre.json.tag` and publishes under `--tag rc`. GitHub release marked Pre-release. No marketing-site dispatch. 4. Iterate by adding changesets normally; bot bumps to `rc.1`, `rc.2`, … 5. When ready: `pnpm exec changeset pre exit`, push, merge regenerated PR → stable ships under `latest` and the marketing-site dispatch fires.
…#3629) ## Summary Refocuses the v4.5.0 changeset and server-changes content on the public-facing AI features story, replacing the pre-release-internal diff framing that had accumulated in `.changeset/` and `.server-changes/`. Pairs with the RC support PR — the next bot regeneration will pick up this content. ## What's in here ### Changeset rewrites - **`chat-agent.md` rewritten as the headline AI Agents entry** — written from the `docs/ai-chat/` surface (not from internal pre-release diffs). Covers useChat integration, multi-turn durability via Sessions, lifecycle hooks, stop generation, tool approvals (HITL), pending messages + background injection, actions, typed state primitives, `chat.toStreamTextOptions()`, multi-tab coordination, network resilience, and the first-turn fast path (`chat.headStart`). - **New `ai-prompts.md`** — announces the Prompts feature publicly for the first time. Code-defined templates, deploy-versioning, dashboard overrides, AI SDK telemetry integration, `chat.agent` integration via `chat.prompt.set()` + `chat.toStreamTextOptions()`, full management SDK. - **`sessions-primitive.md` expanded** — calls out `tasks.triggerAndSubscribe()` and `sessions.list` as standalone primitives (not just chat.agent infrastructure). - **`chat-agent-on-boot-hook.md` trimmed** — drops "if you previously…" pre-release migration framing. - **Deletes 4 changesets** that described pre-release-internal migrations or were circular ("groundwork for the upcoming chat.agent" — chat.agent ships in the same release). ### Server-changes rewrites (`.server-changes/`) Five new entries for the dashboard surface of the AI feature set: - Agents list page - Agent Playground - Sessions dashboard - Prompts dashboard (list with usage sparklines + detail with template / Generations / Metrics / Versions tabs + override UI) - Models registry (provider-grouped catalog with cross-tenant usage metrics) - AI generation span inspector on run traces - Runs list Task source filter (Standard / Scheduled / Agent) - Run-detail Agent view (segmented control) Each entry is 1–2 sentences, no bullets, no implementation file paths — fits as a single bullet in a future changelog. Three older `.server-changes/` files were merged or split into the cleaner taxonomy above and deleted. ## Out of scope Non-AI-feature server-changes (admin-tabs, queue-length-cap fix, worker-deployment race, streamdown upgrade, etc.) and changesets (idempotency-key cap, sigsegv retry, locals-key fix, plugin auth, region filters, etc.) are untouched.
…3630) ## Summary Adds `.changeset/pre.json` to put the repo into changesets pre mode with dist-tag `rc`. After this merges, the changesets bot regenerates the existing release PR as `chore: release v4.5.0-rc.0`. Merging that PR publishes the first release candidate of 4.5.0 to npm under `@rc`. The pre-mode plumbing landed in triggerdotdev#3628. The release content (chat.agent + sessions + ai prompts + dashboard server-changes) landed in triggerdotdev#3629. ## What ships when the bot PR merges Under dist-tag `rc`: - `@trigger.dev/{sdk,core,build,react-hooks,redis-worker,plugins,python,rsc,schema-to-json}@4.5.0-rc.0` - `trigger.dev@4.5.0-rc.0` Plus: - Docker image `ghcr.io/triggerdotdev/trigger.dev:v4.5.0-rc.0` (immutable tag only — `:v4-beta` is not touched) - Helm chart `oci://ghcr.io/triggerdotdev/charts/trigger.dev:4.5.0-rc.0` - GitHub release `v4.5.0-rc.0` marked as Pre-release (no Latest badge) What does NOT happen: - npm `latest` stays at 4.4.6 - No marketing-site changelog PR (gated on `is_prerelease != 'true'`) - Docker `:latest` not touched (we never push it anyway in this repo) ## Iteration For subsequent rc.N: add a regular changeset to main, bot regenerates the release PR as `v4.5.0-rc.N`. Merge to ship. ## Exiting pre mode When ready to ship stable: `pnpm exec changeset pre exit`, push, merge regenerated PR. That publishes `4.5.0` under `latest` and fires the marketing-site dispatch.
…v#3631) ## Summary Renumber `029_add_task_kind_to_task_runs_v2.sql` → `031_add_task_kind_to_task_runs_v2.sql` to fix a deploy-blocking out-of-order migration, and make the DDL idempotent with `ADD COLUMN IF NOT EXISTS` / `DROP COLUMN IF EXISTS`. ## Root cause - Migration `030_create_sessions_v1.sql` landed on main on 2026-04-28 (PR triggerdotdev#3417) and was applied to test cloud ClickHouse on a subsequent deploy. Current goose version on test ClickHouse: **30**. - Migration `029_add_task_kind_to_task_runs_v2.sql` was authored later on 2026-05-10 as part of the Sessions primitive PR series (`be1a6cf8`). - The next test cloud deploy failed because goose strict-mode refused to apply a missing version *before* the current version: ``` goose run: error: found 1 missing migrations before current version 30: version 29: 029_add_task_kind_to_task_runs_v2.sql ``` ## Fix 1. **Rename to `031_*`** (next available number after 030). Goose now treats it as a new migration after 030 and applies it cleanly on test/prod where the column does not yet exist. 2. **Make the DDL idempotent** (`ADD COLUMN IF NOT EXISTS`). The original 029 may have been applied in environments that ran goose with `--allow-missing` (e.g. some local dev databases) — those would have the column already, and the rename causes goose to see 031 as new and re-attempt the ADD. Idempotent DDL keeps that path safe. The `Down` mirrors with `DROP COLUMN IF EXISTS`. ## Test plan - [ ] Test cloud deploy (after this lands) successfully runs the ClickHouse migration step - [ ] `task_kind` column shows up on `trigger_dev.task_runs_v2` post-migration - [ ] Local environments that had previously applied 029 do not error on the next `goose up`
…dotdev#3633) ## Summary Codify two rules for ClickHouse migration authors that came out of the 029/030 ordering incident on the TRI-9367 test cloud deploy: 1. **Number files to `max(existing) + 1`, never slot in below the latest.** Goose runs in strict mode in the cloud deploy pipeline and refuses to apply a missing version below the current version — slotting a file in below an already-applied number blocks the next deploy. 2. **DDL must be idempotent** (`ADD COLUMN IF NOT EXISTS`, `DROP COLUMN IF EXISTS`, `CREATE TABLE IF NOT EXISTS`, etc.) so a retry or out-of-order apply (`goose up --allow-missing` for local recovery, manual fixups) is a no-op rather than an error. ## Where the rules live - `internal-packages/clickhouse/CLAUDE.md` — full rules + example for migration authors (and AI agents writing migrations). - `.claude/REVIEW.md` — added a 🔴 finding under "What makes a 🔴 Important finding" so PR reviewers flag either fault as blocking. The existing migration files are left untouched; the idempotency requirement applies going forward. ## Test plan - [ ] Next ClickHouse migration PR uses `IF NOT EXISTS` / `IF EXISTS` forms - [ ] No new migration files numbered below an already-applied version on test/prod
`dorny/paths-filter` defaults to OR semantics across the pattern array, so the leading `**` matched every file and the `!...` excludes were no-ops. The `code` filter has been returning `true` for every PR since triggerdotdev#3615. Split into two filter steps: `code` moves into its own step with `predicate-quantifier: every` so excludes actually subtract. The two re-include workflow files become a separate `typecheck_self` filter that the `typecheck` job ORs into its `if:`. Side effect: workflow-file-only PRs that don't touch `pr_checks.yml` or `typecheck.yml` no longer trigger typecheck. Previously they did because the filter was broken-true.
…dotdev#3641) `TriggerChatTransport` had a single `baseURL` option covering both the `.in/append` POSTs and the long-lived `.out` SSE subscription. Customers wanting to route the SSE through a proxy (e.g. a Cloudflare worker capturing JA4 fingerprints for bot detection) had to send every append through the proxy too, adding a hop to every user message. New optional `streamBaseURL` overrides the SSE base URL only; appends keep using `baseURL`. Falls back to `baseURL` when unset, so existing transports are unchanged. ```ts const transport = new TriggerChatTransport({ task: "ai-chat", baseURL: "https://api.trigger.dev", streamBaseURL: "https://chat-proxy.example.com", accessToken, startSession, }); ``` Verified with a new test in `chat.test.ts` that asserts `.in/append` routes through `baseURL` and `.out` SSE routes through `streamBaseURL`. All existing tests still pass.
… self-hosted builds (triggerdotdev#3618) ## Summary Local self-hosted deploys (`trigger deploy --local-build --push --builder orbstack` or any other buildx setup using the **docker** driver) fail at the push step with: ``` ERROR: failed to build: failed to solve: exporter option "rewrite-timestamp" conflicts with "unpack" ``` The docker driver auto-enables `unpack=true` when pushing, and that's incompatible with `rewrite-timestamp` (which the CLI sets for reproducible-build hashing). Adds a simple env-var opt-out so contributors can keep using their default builder. The flag is only read by the local-build code path; remote/cloud builds are unaffected. ```bash TRIGGER_BUILD_SKIP_REWRITE_TIMESTAMP=1 \ pnpm exec trigger deploy --profile default --local-build --push --builder orbstack ``` The trade-off: skipping `rewrite-timestamp` means layer timestamps reflect actual build time, so two identical builds produce different layer hashes. Fine for a local-dev registry; the only real consumer of timestamp-stability is registry-layer cache hit rates. ## Test plan - [x] Manual: ran `trigger deploy --profile default --local-build --push --builder orbstack` against the localhost webapp + a local Docker registry on port 5001 — first failed with the rewrite-timestamp/unpack error, then succeeded after setting `TRIGGER_BUILD_SKIP_REWRITE_TIMESTAMP=1`. - [x] Full chat.agent smoke sweep (15 tests, including suspend/resume, deepResearch subtask, AgentChat orchestrator) against the deployed image — all pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
…riggerdotdev#3632) ## Summary - Adds a `beforeSend` rule in `apps/webapp/sentry.server.ts` that collapses Prisma `P1001` ("Can't reach database server") errors into a single Sentry issue regardless of which call site threw, by setting `event.fingerprint = ["prisma-p1001-db-unreachable"]` and tagging `db_unreachable:true`. - Matches both `err.code === "P1001"` (Prisma's `KnownRequestError` when a connection drops mid-query) and `err.errorCode === "P1001"` (`InitializationError` when the client fails to connect at startup). - Implemented as a small extensible `FINGERPRINT_RULES` table so further fan-out errors can be added with one entry. ## Verification End-to-end verified locally with `debug: true` on the SDK: - Real Prisma `P1001` thrown from a loader (DB stopped mid-request) is captured by Sentry's Remix auto-instrumentation - `beforeSend` fires with `originalException.code === "P1001"`, rule matches - `event.fingerprint = ["prisma-p1001-db-unreachable"]` and `tags.db_unreachable = "true"` applied - Event lands in Sentry under the new fingerprint ## Test plan - [ ] Deploy to staging; confirm P1001 events appear under a single `prisma-p1001-db-unreachable` issue rather than fanning out - [ ] Confirm `db_unreachable:true` tag is filterable in Sentry - [ ] Verify non-P1001 errors are unaffected (event passes through `beforeSend` untouched) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gerdotdev#3653) ## Summary The PUT handler at `/realtime/v1/streams/:runId/:target/:streamId` ran `taskRun.update({ realtimeStreams: { push: streamId } })` on every call, even when the `streamId` was already present. SDK call patterns that re-initialize the same stream key on every chunk produce a per-write row UPDATE, duplicate entries pile up in the array, and the row-lock + TOAST rewrite cost grows unbounded on long-running stream sessions. ## Fix Mirror the sibling append handler: read the array first and only push when the `streamId` isn't already present. Identical behavior for first-time stream creation; repeat creates short-circuit to a single indexed read. The dashboard's per-run stream listing keeps working because the first create still records the entry. ## Test plan - [ ] A fresh PUT for a new `(run, streamId)` adds the entry to the array - [ ] A repeat PUT for the same pair leaves the array unchanged - [ ] 404 is returned when the run doesn't exist; 400 when the run is completed
…erdotdev#3644) ## Summary Long-running chat agents were filling `session.out` forever — every `chat.agent` turn appended to the same S2 stream with no trim, and the Sessions dashboard re-streamed the entire history from `seq_num=0` on every page load. After this change the agent appends an S2 `trim` command record after each `trigger:turn-complete`, pointing back at the previous turn-complete's seq_num. `session.out` stays roughly one turn long at steady state, regardless of session age. `trigger:turn-complete` and `trigger:upgrade-required` move from `chunk.type`-shaped data records into header-form control records under a uniform `trigger-control` namespace. Built-in transports (`TriggerChatTransport`, `AgentChat`, the dashboard's `AgentView`) handle the new shape transparently. Custom transports need a one-line filter on the `trigger-control` header — see the rewritten "Records on session.out" section in the client-protocol docs. The Sessions detail page in the dashboard fetches the agent's per-turn S3 snapshot via a presigned URL and seeds the transcript view, then SSE-tails from the snapshot's `lastOutEventId`. Bandwidth and time-to-first-render scale with unread turns instead of session lifetime. Resume contract is now explicit: single-turn-boundary resume always works (the prior turn-complete is still on the stream), the S2 trim is eventually consistent over 10-60s, and multi-turn-away resume falls back to a snapshot reload.
…iggerdotdev#3642) ## Summary Two papercuts new contributors hit running this repo locally: 1. Fresh clones default to v1 (Redis-only) realtime streams, so Sessions and `chat.agent` error with `"S2 configuration is missing"`, even though the `s2` service is already in `docker/docker-compose.yml` and pre-seeds a `trigger-local` basin. Wire `REALTIME_STREAMS_S2_*` to it in `.env.example` so the new-contributor flow just works. (Also drop the s2 healthcheck: the image is distroless, so the `wget` check always reports unhealthy.) 2. Two clones can't both run `pnpm run docker` because ports, project name, and container names are all hardcoded. Parameterize every host port as `${VAR:-default}`, drive the project name via `COMPOSE_PROJECT_NAME` (with a top-level `name:` field as the default), prefix container names with `${CONTAINER_PREFIX:-}`, and pass `--env-file .env` so compose reads the same root `.env` the webapp does. The "Running multiple instances side by side" block in `.env.example` lists every overridable knob. Also split the optional services (`electric-shard-1`, `ch-ui`, `toxiproxy`, `nginx-h2`, `otel-collector`, `prometheus`, `grafana`) into `docker-compose.extras.yml` behind a new `pnpm run docker:full` script. The core stack keeps everything the webapp actually needs to boot: postgres, redis, electric, minio, clickhouse + migrator, s2-lite. Defaults match every previous hardcoded value, so existing setups keep working without touching `.env`. ## Test plan - [x] `pnpm run docker` on a clean clone brings up the core services on the standard ports under the `triggerdotdev-docker` project name. - [x] Setting `COMPOSE_PROJECT_NAME=triggerdotdev-docker-alt` + the `*_HOST_PORT` overrides in `.env` brings up a second stack alongside the default one with no port or container-name clashes. - [x] Webapp boots cleanly against the default `.env.example` values; `/healthcheck` returns 200, no S2 errors. - [x] s2-lite basin `trigger-local` accepts an append + read via the same REST endpoints the webapp uses. - [x] `pnpm run docker:full` brings up the optional services alongside the core ones in the same project.
…gerdotdev#3614) ## Summary - Introduce the Mollifier: a Redis-backed buffer for `trigger()` API calls during traffic spikes, with a per-env trip evaluator and a drainer ack-loop. - Phase 1 is dual-write monitoring — every mollified trigger is buffered to Redis AND continues to `engine.trigger`. No customer-facing behaviour change. - Telemetry events: `mollifier.would_mollify`, `mollifier.buffered`, `mollifier.drained`, plus the `mollifier.decisions` counter. - Gated behind a feature flag (default off). ## Test plan - [x] `pnpm run test --filter @trigger.dev/redis-worker` - [x] `pnpm run test --filter webapp -- mollifier` - [x] Manual: with flag off, no behaviour change vs main - [x] Manual: with flag on + threshold lowered, observe `mollifier.buffered` + `mollifier.drained` log pairs with matching `runId` --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h on writer failure (triggerdotdev#3658) ## Summary Hot-loop writers — `streams.writer` / `streams.pipe` on the run-scoped side, `chat.response.write` / `chat.stream.*` on the session side — were issuing a fresh `PUT` to mint S2 credentials for every chunk. On run streams, each PUT also pushed the streamId onto `TaskRun.realtimeStreams`, so a chat-agent turn writing N chunks produced N PUTs and N duplicate array pushes against the same row. The SDK now caches the initialize response per cache slot: `(runId, key)` for run streams, the session id for session streams. First call PUTs as before; subsequent calls reuse the cached promise. Hot-loop writers do one PUT per slot for the lifetime of the cache. S2 access tokens have a 1-day TTL. If a writer's `wait()` rejects (auth error, expired token, network blip), the cache evicts the matching slot so the next call re-PUTs and mints fresh credentials, identity-checked so a concurrent caller's fresh promise isn't accidentally cleared. ## chat.agent guardrail `streams.pipe / writer / append / read` called inside a `chat.agent` run now logs a one-time warning pointing at `chat.response.write` / `chat.stream.*` — `streams.*` is run-scoped and isn't visible on the chat session. The ai-chat docs are updated to drop the old guidance toward run-scoped streams.
…riggerdotdev#3655) ## Summary `TriggerChatTransport`, `AgentChat`, and `chat.createStartSessionAction` now accept a string-or-function `baseURL` so callers can route per endpoint — e.g. `.in/append` through a trusted edge proxy while keeping `.out` SSE direct. The same surfaces add a `fetch` override for header injection, custom retries, or proxy rewrites that go beyond URL routing. SSE GETs are covered too via a new `fetchClient` option on `SSEStreamSubscription`. ```ts // TriggerChatTransport / AgentChat — endpoints: "in" | "out" baseURL: ({ endpoint }) => endpoint === "out" ? DIRECT : PROXY, fetch: (url, init, ctx) => { init.headers = new Headers(init.headers); init.headers.set("traceparent", currentTraceparent()); return globalThis.fetch(url, init); }, // chat.createStartSessionAction — endpoints: "sessions" | "auth" chat.createStartSessionAction("my-agent", { baseURL: ({ endpoint }) => (endpoint === "sessions" ? PROXY : DIRECT), }); ``` `streamBaseURL` on `TriggerChatTransport` is kept as a backwards-compat alias and continues to win for the `"out"` endpoint when set. Plain-string `baseURL` still applies to every endpoint, matching prior behavior.
…ion (triggerdotdev#3659) Before fix: <img width="1264" height="987" alt="image" src="https://github.com/user-attachments/assets/24b8b85c-b89f-4109-9004-8d6af61d2849" /> After fix: <img width="1264" height="987" alt="image" src="https://github.com/user-attachments/assets/89bbc587-c50a-45ab-b203-dbe91028e918" />
Reject non-email strings at the magic link form instead of accepting any string and proceeding through rate-limit / authenticator steps.
…riggerdotdev#3664) ## Summary Companion to triggerdotdev#3536, which patched routes that already had a leaking `catch (e) { return json({error: e.message}, 500) }`. That pattern can't reach routes which have no catch in the first place — when those throw, Remix's default error path serializes `error.message` into the response body, and the SDK then wraps the leaked string as `TriggerApiError`. Across 28 raw api.v1 loaders/actions plus one dashboard polling endpoint, each handler now: - Wraps its body in `try { ... } catch (error) { ... }`. - Re-throws `Response` instances so auth helpers' `throw json(...)` / `throw redirect(...)` pass through unchanged. - Logs non-Response errors via `logger.error` so server-side visibility is preserved. - Returns a generic body — `{"error": "Internal Server Error"}` 500 for raw API routes, or `{ changelogs: [] }` 200 for the polling widget (degrade silently across transient blips; the consumer hook already coped with empty payloads). For six routes where triggerdotdev#3536 left an inner try/catch covering only a service call (`alertChannels`, `batches.results`, `deployments.finalize`, `deployments.background-workers`, `deployments.promote`, `projects.background-workers`): an outer try/catch is added so auth/parsing failures are also sanitized. Inner typed-error handling (`ServiceValidationError` → 422 with message, etc.) is preserved exactly. For two routes whose existing catch returned 400 + `error.message` (`api.v1.authorization-code`, `api.v1.orgs.\$orgParam.projects` action): the body is sanitized to a generic per-route string. **Status code stays 400** — clients that key on the 4xx/5xx distinction (and the SDK's no-retry-on-4xx behavior) are unaffected. ## Test plan - [x] \`pnpm run typecheck --filter webapp\` - [x] Per-route synthetic-throw probe: inject \`throw new Error("SYNTHETIC ...")\` at the top of each catch'd try, curl the route with a dummy bearer, confirm the response body is the generic shape and that the synthetic message lands server-side via \`logger.error\`. 29 routes verified. - [x] Real-P1001 probe on the envvars loader: \`docker stop database\` mid-flight, confirm response is generic 500 (not the leaked Prisma message). - [x] Sampled legitimate 4xx/2xx paths across each pattern variant (naked-wrap, partial-expanded, 400-preserved) to confirm the wraps don't interfere with normal control flow.
…gerdotdev#3665) ## Summary The prerelease (snapshot) path of the release workflow fails immediately whenever `main` carries an active `.changeset/pre.json` (i.e. during an in-progress RC cycle, like the current v4 RC): ``` 🦋 error Snapshot release is not allowed in pre mode 🦋 To resolve this exit the pre mode by running `changeset pre exit` ``` This blocks `chat-prerelease` snapshots from main even though the snapshots are unrelated to the RC cycle. Adds a conditional `changeset pre exit` step right before `Snapshot version` in the prerelease job. The job runs on a checkout with `persist-credentials: false`, so the `pre.json` deletion stays on the runner's working tree — main's persisted pre-mode state is untouched, and v4 RC publishes keep working normally. ## Test plan - [ ] Re-run the `🦋 Changesets Release` workflow with `type=prerelease`, `ref=main`, `prerelease_tag=chat-prerelease` and confirm it gets past the snapshot step and publishes. - [ ] Confirm `.changeset/pre.json` on `main` is unchanged after the run.
…mic deployments (triggerdotdev#3666) - Ask user if they want to remove TRIGGER_VERSION when they disable atomic deployments, and explain what is the situation if they leave it as it is - Install TRIGGER_SECRET keys as sensitive values in Vercel <img width="1136" height="714" alt="image" src="https://github.com/user-attachments/assets/a7351da1-5b2a-44e5-acdd-d30c9359f3ed" /> <img width="1136" height="714" alt="image" src="https://github.com/user-attachments/assets/e773ede2-74cb-438e-811c-338f678d2f7d" /> <img width="1136" height="714" alt="image" src="https://github.com/user-attachments/assets/c7b235a8-e06d-48d3-ac28-c5c9aacc6069" />
…dotdev#3668) ## Summary The S2 access-token cache key was `${basin}:${streamPrefix}` — purely server-derived but blind to the **scope/ops list** hardcoded one method away. When the ops list changes in code (e.g. triggerdotdev#3644 added `trim` so `chat.agent`'s per-turn trim chain can issue `AppendRecord.trim()`), pre-deploy tokens still in cache get returned to SDK callers for up to the token's TTL (24h default), surfacing as `Operation not permitted` 403s on any op outside the old scope. ## Fix Lift the ops list to a module constant and fold its sorted-join fingerprint into the cache key: ```ts const S2_TOKEN_OPS = ["append", "create-stream", "trim"] as const; const S2_TOKEN_OPS_FINGERPRINT = [...S2_TOKEN_OPS].sort().join(","); // in getS2AccessToken const cacheKey = `${this.basin}:${this.streamPrefix}:${S2_TOKEN_OPS_FINGERPRINT}`; // in s2IssueAccessToken scope: { /* ... */ ops: [...S2_TOKEN_OPS], /* ... */ } ``` The fingerprint is derived from the single source of truth, so any future scope change auto-invalidates without anyone remembering to bump a literal version. The Unkey L1 (in-memory LRU) and L2 (Redis) layers share the same key derivation, so both reset together on the next deploy with no manual cache busting. ## Test plan - [ ] `pnpm run typecheck --filter webapp` - [ ] Run a multi-turn `chat.agent` chat via `references/ai-chat` and confirm no `chat.agent: trim failed; will retry next turn` warn span fires across turn-completes.
| id: release | ||
| uses: softprops/action-gh-release@v1 | ||
| if: github.event_name == 'push' | ||
| uses: softprops/action-gh-release@b4309332981a82ec1c5618f44dd2e27cc8bfbfda # v3.0.0 |
There was a problem hiding this comment.
4 issues found across 1525 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name=".github/workflows/e2e-webapp.yml">
<violation number="1" location=".github/workflows/e2e-webapp.yml:67">
P2: Guard DockerHub login on both username and token so optional secrets don't cause a hard failure when only one is provided.</violation>
</file>
<file name=".github/workflows/publish-worker-v4.yml">
<violation number="1" location=".github/workflows/publish-worker-v4.yml:69">
P2: Semver releases no longer publish the additional `v4-beta` image tag, which regresses the previous tagging behavior.</violation>
</file>
<file name=".github/workflows/claude.yml">
<violation number="1" location=".github/workflows/claude.yml:22">
P1: This workflow now grants repository write permissions on `@claude` comment triggers without restricting who can invoke it, which creates an untrusted-to-write privilege escalation path.</violation>
</file>
<file name=".changeset/agent-skills.md">
<violation number="1" location=".changeset/agent-skills.md:1">
P2: Changeset describes 'Agent Skills for chat.agent' but the PR implements supervisor wide events and warm-start trace propagation. This changeset would create an inaccurate changelog entry and trigger patch bumps for packages that may not have corresponding code changes in this PR.</violation>
</file>
Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.
Re-trigger cubic
| contents: read | ||
| pull-requests: read | ||
| issues: read | ||
| contents: write |
There was a problem hiding this comment.
P1: This workflow now grants repository write permissions on @claude comment triggers without restricting who can invoke it, which creates an untrusted-to-write privilege escalation path.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/claude.yml, line 22:
<comment>This workflow now grants repository write permissions on `@claude` comment triggers without restricting who can invoke it, which creates an untrusted-to-write privilege escalation path.</comment>
<file context>
@@ -19,24 +19,25 @@ jobs:
- contents: read
- pull-requests: read
- issues: read
+ contents: write
+ pull-requests: write
+ issues: write
</file context>
|
|
||
| # ..to avoid rate limits when pulling images | ||
| - name: 🐳 Login to DockerHub | ||
| if: ${{ env.DOCKERHUB_USERNAME }} |
There was a problem hiding this comment.
P2: Guard DockerHub login on both username and token so optional secrets don't cause a hard failure when only one is provided.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/e2e-webapp.yml, line 67:
<comment>Guard DockerHub login on both username and token so optional secrets don't cause a hard failure when only one is provided.</comment>
<file context>
@@ -0,0 +1,97 @@
+
+ # ..to avoid rate limits when pulling images
+ - name: 🐳 Login to DockerHub
+ if: ${{ env.DOCKERHUB_USERNAME }}
+ uses: docker/login-action@4907a6ddec9925e35a0a9e82d7399ccc52663121 # v4.1.0
+ with:
</file context>
| if: ${{ env.DOCKERHUB_USERNAME }} | |
| if: ${{ secrets.DOCKERHUB_USERNAME && secrets.DOCKERHUB_TOKEN }} |
| image_tags=$image_tags,$ref_without_tag:v4-beta | ||
| fi | ||
| ref_without_tag=ghcr.io/triggerdotdev/${STEPS_GET_REPOSITORY_OUTPUTS_REPO} | ||
| image_tags=$ref_without_tag:${STEPS_GET_TAG_OUTPUTS_TAG} |
There was a problem hiding this comment.
P2: Semver releases no longer publish the additional v4-beta image tag, which regresses the previous tagging behavior.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/publish-worker-v4.yml, line 69:
<comment>Semver releases no longer publish the additional `v4-beta` image tag, which regresses the previous tagging behavior.</comment>
<file context>
@@ -62,26 +65,24 @@ jobs:
- image_tags=$image_tags,$ref_without_tag:v4-beta
- fi
+ ref_without_tag=ghcr.io/triggerdotdev/${STEPS_GET_REPOSITORY_OUTPUTS_REPO}
+ image_tags=$ref_without_tag:${STEPS_GET_TAG_OUTPUTS_TAG}
echo "image_tags=${image_tags}" >> "$GITHUB_OUTPUT"
</file context>
| image_tags=$ref_without_tag:${STEPS_GET_TAG_OUTPUTS_TAG} | |
| image_tags=$ref_without_tag:${STEPS_GET_TAG_OUTPUTS_TAG} | |
| # if tag is a semver, also tag it as v4 | |
| if [[ "$STEPS_GET_TAG_OUTPUTS_IS_SEMVER" == true ]]; then | |
| image_tags=$image_tags,$ref_without_tag:v4-beta | |
| fi |
| @@ -0,0 +1,16 @@ | |||
| --- | |||
There was a problem hiding this comment.
P2: Changeset describes 'Agent Skills for chat.agent' but the PR implements supervisor wide events and warm-start trace propagation. This changeset would create an inaccurate changelog entry and trigger patch bumps for packages that may not have corresponding code changes in this PR.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .changeset/agent-skills.md:
<comment>Changeset describes 'Agent Skills for chat.agent' but the PR implements supervisor wide events and warm-start trace propagation. This changeset would create an inaccurate changelog entry and trigger patch bumps for packages that may not have corresponding code changes in this PR.</comment>
<file context>
@@ -0,0 +1,16 @@
+---
+"@trigger.dev/sdk": patch
+"@trigger.dev/core": patch
+"@trigger.dev/build": patch
+"trigger.dev": patch
+---
+
+Add Agent Skills for `chat.agent`. Drop a folder with a `SKILL.md` and any helper scripts/references next to your task code, register it with `skills.define({ id, path })`, and the CLI bundles it into the deploy image automatically — no `trigger.config.ts` changes. The agent gets a one-line summary in its system prompt and discovers full instructions on demand via `loadSkill`, with `bash` and `readFile` tools scoped per-skill (path-traversal guards, output caps, abort-signal propagation).
+
</file context>
Adds wide events system to supervisor for better event propagation across worker restarts.
Changes
Files changed (17)
Test
Closes triggerdotdev#3669
Summary by cubic
Adds wide-event instrumentation to
supervisorand propagatestraceparenton warm-start runs so traces stay connected across worker restarts. Improves visibility of workload routes and run socket lifecycle; feature is disabled by default.TRIGGER_WIDE_EVENTS_ENABLED(off by default).Written for commit 671b137. Summary will update on new commits. Review in cubic