Brief → make media. Correct it → taste stays. Save it → run it again.
op0 is a creative studio where the agent generates images, scores, Remotion music videos, and fresh generative video clips from a brief. When you tell it what worked, it records the rule with your words attached. When you save a run, the sequence becomes a reusable rig. The studio changes itself on the record; nothing is a hidden prompt edit.
ctx is the substrate underneath: a Managed Agents harness where every durable change is a typed tool call with a cited reason. The digest agent is the first proof — coached a Discord summarizer from 86.0% to 93.6% relevance, including an audited retire-then-upsert moment. op0 shows the same pattern becoming a creative product.
Two execution paths, both auditable:
- Live MA runs — interactive; every event mirrored to Postgres so the studio timeline is durable and a refresh replays it from the DB.
- WDK rigs — saved or proposed flows executed via Vercel Workflow DevKit; durable, resumable mid-stream, and replay-safe because only capabilities marked safe for repeat execution are admitted.
See docs/THESIS.md for the long form and docs/ARCHITECTURE.md for the layered diagram.
# 1. install + provision once
bun install
cp apps/op0/.env.example apps/op0/.env.local # fill REPLICATE/ELEVENLABS/ANTHROPIC keys, DATABASE_URL, BLOB_READ_WRITE_TOKEN, BETTER_AUTH_SECRET
bun --filter op0 setup:agent # creates the op0-studio MA agent + environment
# 2. run
bun --filter op0 dev # http://localhost:3000
# 3. record this flow
# /studio?audit=open
# brief: "a 1970s editorial portrait of a daffodil. for future runs,
# prefer warm film grain and sparse arrangements."
# -> instructions(upsert) records the taste rule with the user's words
# -> compose_character creates the image
# -> compose_score creates the score
# -> compose.video renders a Remotion MP4 and mirrors it to Blob
# -> (optional) compose_gen_video for a fresh generative video clip
# -> audit drawer shows the tool trail and reasons
# brief: "save this as the editorial rig"
# -> save_rig derives a reusable WDK workflow from the audited tool calls
# brief: "set up a rig for monochrome character studies"
# -> propose_rig designs a new reusable flow, behind confirmationBoth consumers write to the same instructions(upsert/retire, payload, reason) shape. The audit trail is uniform; the live agent loop is uniform; deterministic rigs reuse the same capability registry.
This repo ships two consumers on one substrate:
@ctx/harness— MA-native primitives (/ma: pause/resume tool loop, session helpers,memory_storesupsert) + capability registry (/registry: Zod-validated dispatch with a built-in audit envelope) + bundled providers (/audio,/image). Lifted into a package so both apps share one implementation.apps/ctx— coachable digest-maintainer agent. Coached precision +13.7 pts (87.6 % vs 73.9 %), 100 % critical recall, one retire moment, n=2 replicate aggregate. The substrate's existing evidence.apps/op0(op0.dev) — the creative multimedia studio. Two execution paths: the live MA agent loop (interactive; every event mirrored to Postgresaudit_eventso a refresh replays the timeline from the DB), and Vercel WDK rigs (deterministic; refresh-resumable mid-stream via the durable WDKrunId). Saved rigs at/rigs; durable history at/runs.
apps/op0 — the creative studio (Next 16 · WDK · shadcn · json-render · Drizzle · Blob · Remotion)
apps/ctx — the digest-agent proof + harness CLIs + replicate evals
apps/wav0-score — score-CLI smoke tests for the audio capability layer
apps/web — usectx.com landing
packages/harness — MA primitives (ma/) · capability registry (registry/) · provider runtimes
Both apps speak the same audit envelope: the studio's tool dispatches and the digest agent's instructions(upsert/retire, payload, reason) calls land as MA custom-tool events with the user's own words attached.
op0.dev is the consumer that proves the harness generalizes past Discord coaching. Brief in; image, score, video, and mail out. Every coaching mutation flows through the same instructions(upsert/retire, payload, reason) envelope the digest agent uses. Every artifact flows through a typed Zod capability handler and is mirrored to Blob when a provider URL or local Remotion MP4 needs persistence. Live MA runs are interactive and audited; deterministic rigs run through Vercel Workflow DevKit.
| Capability | Implementation |
|---|---|
compose.character |
4-provider image router (Imagen 4, flux-pro, flux-dev, flux-schnell) with content-mod retry chain. |
compose.score |
Audio router across Replicate and ElevenLabs providers; live outputs are mirrored to Blob for durable playback. |
compose.video |
Remotion 4 render path — stitches image + score into a music video MP4, mirrored to Blob. Live MA runs only; excluded from rigs until artifact-binding lands. |
compose_gen_video |
Generative video models via Replicate: Veo 3 Fast, Kling 2.0, LTX, Wan. Prompt → clip; output mirrored to Blob. Live MA runs only. |
mail.list / mail.send |
AgentMail-backed inbox capability, exposed to the live studio agent. |
instructions |
Audited rule mutation. Companion customize_studio mutates the user's canvas (theme / grid / density / accent) live, persisted across sessions. |
propose_rig / save_rig |
Self-development primitive — agent designs or saves a re-runnable flow, user reviews, the rig lands in /rigs, and WDK executes it later through registry-safe capabilities. |
What a judge clicks (signed-in flow):
/sign-up→ make an account./studio?audit=open→ drop a brief with a durable preference. The first tool call records the taste rule with a reason.- Music-video run →
compose_character+compose_score+ cost-gatedcompose.video(Remotion MP4); optionallycompose_gen_videofor a raw AI clip (Veo 3 Fast / Kling 2.0 / LTX / Wan). - Audit drawer → every tool call, result, reason, and copied JSON payload is visible.
- Save as rig → the tool sequence becomes a reusable WDK workflow on
/rigs. - Propose a new rig → the agent designs a future flow, user approves, the rig lands in
/rigs. - Inbox → optional AgentMail surface if provisioned; mail is live-agent-only, not a rig step.
Visual identity: a working object — paper, jade #1d5b3f, math-paper grid, fountain-pen ink — not chrome, not glassmorphism, not glowing orbs. The agent can swap themes (paper / dark / blueprint / kraft) live via customize_studio based on the brief.
The original
# ctxframing follows — it's how the digest agent's claims (audit trail, retire moments, coached precision) are testable today on the locked fixture set. op0 sits on the same substrate: live MA runs are audited, and saved/proposed rigs run durably through WDK.
Where agents build themselves around your workflows.
ctx is a coachable Managed Agents harness with a working digest-maintainer agent inside it. You DM ctx in English — to coach the digest agent ("less gm/gn, surface @arth mentions first"), or to spawn a new agent for a different workflow ("create an agent that watches anthropic/claude-code for breaking changes"). Every mutation to the harness flows through one custom tool with a required reason string, captured as a session event in the MA event log. The digest agent is today's proof; instructions(create_agent, ...) is what the surface compounds into.
This README leads with the working digest agent because the audit-trail and retire-moment claims are testable today. The harness reframe is at The harness.
On the locked 50-message, 8-coaching-DM, 10-poll fixture set, single canonical run (committed artifacts):
| metric | baseline (no coaching) | coached | Δ |
|---|---|---|---|
| weighted precision | 73.9% | 87.6% | +13.7 |
| weighted recall | 100.0% | 100.0% | +0.0 |
| weighted F1 | 85.0% | 93.4% | +8.4 |
| relevance (geomean P×R, kept for continuity) | 86.0% | 93.6% | +7.6 |
| critical recall (P0 bugs, team msgs, @-mentions, breaking releases) | 100% | 100% | — |
| retire events | 0 | 1 | — |
Evidence: [artifacts/baseline/](./artifacts/baseline), [artifacts/coached/](./artifacts/coached), [eval/score.ts](./eval/score.ts).
Re-running the harness against the same fixture set produces variance the canonical-only metric hides. With 3 attempted replicates (2 succeeded, 1 timed out mid-coached) using independent MA agents per replicate:
| metric | baseline (n=2) | coached (n=2) | per-replicate F1 |
|---|---|---|---|
| precision (mean ± 95% CI) | 73.9% ± 0.0% | 87.6% ± 0.8% | r2: 93.4% r3: 89.0% |
| recall | 100.0% ± 0.0% | 95.3% ± wide | (n=2 → df=1 → CI dominated by Student's t) |
| F1 | 85.0% ± 0.0% | 91.2% ± wide | |
| critical recall | 100% (2/2 runs) | 100% in 1/2 runs | r3 missed m031 (team_message_daisy in #announcements at poll 6) |
| retire events | 0 / 0 | 1 in 2/2 runs | |
| coached relevance > baseline | — | 2/2 runs |
Why coached F1 spreads 89.0–93.4%: Haiku resolves cross-cutting rules (rule_004 "flag @daisy" vs rule_005 "suppress #announcements") differently across runs when both apply to the same message — documented in LEARNINGS.md. The system prompt's "first matching rule by id" tie-breaker doesn't cleanly separate priority-surface rules from suppression rules; Haiku picks differently on retries.
Why 1/3 attempted runs timed out: the SDK call inside the renderer's poll-4 turn returned The operation timed out. after 378s. This is the second-mode failure documented in LEARNINGS.md (the first being Opus stalling on poll 7 retire detection in batch mode); both are real production-readiness considerations, not eval bugs.
Honest framing: the canonical 93.6% relevance / 100% critical recall / 1 retire is a realized trajectory, not an expected value. Two of three attempted replicates hit the gate; one critical-recall miss in the successful pair. Run the harness yourself: USE_FIXTURES=true bun run replicates --n=3 && bun run eval:aggregate artifacts/replicates. Tier 2 work to tighten the spread (held-out fixture, deterministic rule-conflict resolution, LLM-graded prose) is documented in LEARNINGS.md.
Digest #1 with no coaching (baseline):
# Digest — Poll 1
## Signal
- [m001] [#general] @lurker_d: gm everyone ☀️
- [m002] [#general] @random_e: gm!
- [m003] [#help] @newbie_c: how do I install? the npm command in the README errors out with peer deps
- [m005] [#general] @contrib_a: anyone else seeing flaky CI on main? third failure this morning
- [m006] [#bugs] @contrib_a: opening issue for the CI flake — cache step intermittently fails
## Suppressed by ctx (1)
- (ctx default): docs_bot auto-reply — 1 message(s): m004
Digest #1 after one coaching DM ("less gm/gn in the digest please"):
# Digest — Poll 1
## Signal
- [m003] [#help] @newbie_c: how do I install? the npm command in the README errors out with peer deps
- [m005] [#general] @contrib_a: anyone else seeing flaky CI on the main branch?
- [m006] [#bugs] @contrib_a: opening issue for the CI flake — cache step intermittently fails
## Suppressed by ctx (3)
- rule_001: gm/gn message — 2 message(s): m001, m002
- (ctx default): docs-bot auto-reply — 1 message(s): m004
One DM. Two messages filtered. The cited rule (rule_001) links back to a custom-tool event in the MA event log, which carries the owner's own reason string: "Owner said 'less gm/gn in the digest please' — explicit request to dial down gm/gn chatter." Every decision ctx makes is traceable to a DM the owner sent.
By digest #10, six rules are active and one has been retired. The filtering keeps up.
At poll 6 the owner DMs "#announcements is mostly noise, dial it back." ctx creates rule_005, a broad suppression rule for that channel. One poll later, seeing the cost of that suppression in the poll-6 digest, the owner reverses: "wait actually — release notes in #announcements ARE important, especially breaking changes. reverse what I said about dialing it back."
The reasoner reads this against its current rule set, detects the contradiction, and autonomously fires:
{
"action": "retire",
"payload": { "rule_id": "rule_005" },
"reason": "Owner said 'reverse what I said about dialing it back' — explicit retraction of the #announcements suppression rule."
}Belief revision on internal state — the agent deleting its own prior belief through an audited tool, with the reason captured in the MA event log. Then an upsert for the replacement (rule_006: surface release notes and breaking changes from #announcements as important). Two tool calls, one coaching DM, a visibly different digest the next poll.
Verify end-to-end with bun run events --poll=7.
Managed Agents is built around four concepts — Agent, Environment, Session, Events (docs). ctx makes each legible in the chat you already use:
| Concept | ctx's use | Evidence |
|---|---|---|
| Agent | Opus 4.7 reasoner + Haiku 4.5 renderer + Opus 4.7 narrator. Wired with `instructions(upsert | retire, payload, reason)custom tool, thepreference-coach+observability-narratorskills, andweb_fetch` from the built-in toolset. |
| Environment | Each session's sandboxed container. The renderer calls web_fetch from inside the environment when a Discord message references a URL that IS the substance of the share — citations appear in the digest. |
[artifacts/coached/digest-03.md](./artifacts/coached/digest-03.md) (m015 fetched summary) |
| Session | One session per poll, same agent across polls. state/sessions.jsonl indexes coached-run sessions; client-side state/rules.json is the durable rule store — re-injected as context per session. |
[src/lib/state.ts](./src/lib/state.ts), [state/sessions.jsonl](./state) |
| Events | /observability narrates the session event log (custom tool calls, reasons, agent.thinking traces, rule deltas) via Opus re-reading its own history. No shadow logging. |
[artifacts/observability-opus.md](./artifacts/observability-opus.md) vs [observability-haiku.md](./artifacts/observability-haiku.md) (A/B) |
Client-side state/rules.json is the rule store — not in Managed Agents itself. Mutation flows through the Agent's custom tool primitive; every change is a session event with a reason. The event log is the audit trail.
Discord today. Any chat surface tomorrow. Same substrate.
/observability re-reads the full event log (17 sessions, 174 events from the coached run) and produces a three-section readout. Same prompt, same data, two models.
Opus 4.7 found:
"The rule set is drifting from blunt suppression toward identity- and quality-gated surfacing (daisy, contributors-with-PRs, breaking changes). The blind spot: nothing yet governs #general, which is where @daisy's real asks (m022, m049) actually land — person-priority is carrying that channel alone."
Haiku 4.5 found:
"3 of 6 rules pivot on who sends the message rather than content. Owner is not just filtering noise—they're encoding a social trust model."
Both caught the themes and the retire event. Only Opus named a prescriptive gap — the absence of a rule for a specific channel where the owner's own team messages land. That's the move a cross-theme synthesis step earns over a theme-enumeration step.
Full side-by-side: [artifacts/observability-opus.md](./artifacts/observability-opus.md), [artifacts/observability-haiku.md](./artifacts/observability-haiku.md).
The digest agent is the first proof; it isn't the only thing ctx can do. The same instructions(...) custom tool that owns coaching-rule upsert/retire accepts a third action — create_agent — that registers a new agent intent in the harness when the owner asks ctx (by DM) to spawn an agent for a different workflow.
A creation DM looks like this:
"create an agent that watches anthropic/claude-code for new PRs and DMs me when there's a breaking change"
The reasoner classifies it as a creation request — distinct from coaching the current digest agent, which is a separate category in the preference-coach skill — and fires:
{
"action": "create_agent",
"payload": {
"name": "github-pr-watcher",
"purpose": "Watch anthropic/claude-code for new PRs and surface breaking changes via owner DM."
},
"reason": "Owner asked: 'create an agent that watches anthropic/claude-code for new PRs and DMs me when there's a breaking change'."
}The new AgentSpec lands in state/agent-registry.json with a sequential id (spec_001, spec_002, …). Same audit-trail invariant as upsert/retire: every spec carries a cited reason, every creation is a session event in the MA event log.
Provisioning — materializing each spec into a fully-wired Managed Agent (its own custom tools, its own skills, its own session lifecycle) — is a separate orchestration step the owner controls. v1 records the intent; the decision boundary, the audit, and the contract are the load-bearing pieces, and they ship today.
[fixtures/creation.jsonl](./apps/ctx/fixtures/creation.jsonl) carries the canonical creation DM. bun run test:create-agent exercises the 21-assertion contract — id allocation, default skill_refs, missing-name/purpose rejection, empty-reason rejection, regression on the existing upsert/retire actions.
Same surface (DM → reasoner → audited custom tool → durable state) for every harness mutation. The digest agent earned the surface; create_agent is what the surface compounds into.
OSS maintenance is the first use case because the pain is visceral and universal. The architecture underneath is not Discord-specific:
- The chat surface sits behind a
ChatSurfaceinterface (Discord today; Slack, iMessage, email via Cloudflare Workers, voice via ElevenLabs are adapter swaps, not rewrites). - The durable state (
state/rules.json+state/agent-registry.json) is any curated-preference object — content triage rules today, task priorities or design-review criteria tomorrow, plus the registry of additional agents the owner has spawned. - The multi-modal inputs (text, URLs via env sandbox, image and document content blocks via
src/lib/multimodal.ts) cover most feed-like sources. The renderer accepts attachments as evidence for a message, not a reason to suppress it. - The primitive surfaces (
/observability,/instructions,/skills) are generic to any coachable agent in the harness.
Other use cases that fit this shape without architectural change:
- Music producer triage — you're in 10+ Discords (sample packs, collab servers, label communities). Coach an agent on what a collab ask looks like, what a critique-worthy WIP looks like, and what's noise.
- Design review triage — coach an agent that watches a Figma channel and surfaces review requests that match your current project, ignoring the rest.
- Podcast / long-form intake — coach an agent on what you actually want from a 2-hour recording.
- Investment deal flow — coach an agent on your thesis; it filters the inbound.
Build one product well; the substrate earns the next three by demonstration, not by pitch.
The guided replay needs only Bun. Live SDK commands also need an ANTHROPIC_API_KEY; copy .env.example to .env and fill in the key.
bun install
# 2-minute narrated walkthrough — no API calls, replays committed artifacts.
# Start here if you've never seen ctx before.
bun run demo:guided
# run all 4 SDK probes — bootstrap, custom tool pause/resume, event log, model routing
bun run probes
# single-poll smoke test (fixture mode)
USE_FIXTURES=true bun run poll --n=1 --fresh --output=artifacts/digest-01.md
# end-to-end batch: baseline (no coaching) + coached (10 polls)
USE_FIXTURES=true bun run demo --mode=baseline --output-dir=artifacts/baseline --fresh
USE_FIXTURES=true bun run demo --mode=coached --output-dir=artifacts/coached
# score the gate
bun run eval artifacts/baseline artifacts/coached
# the hero beat — Opus cross-theme synthesis, with a Haiku A/B
bun run observability
# surface tours
bun run instructions # rule store, theme-clustered
bun run skills # registered skills, load triggers
bun run events --poll=7 # event-log cutaway for the retire beat
bun run test:create-agent # 21-assertion contract for the harness create_agent action
bun run live:one-shot # guarded live Discord dry-run; --execute writes artifacts/live-*
bun run helpThe coaching loop is the IP. Fixture mode is deterministic for evals; the live-Discord adapter is one file (src/lib/discord.ts). Live wiring is a feature, not a primitive — the interface is the substrate. USE_FIXTURES=true forces fixtures; set a DISCORD_BOT_TOKEN to point the same interface at a real discord-mcp server.
Only needed for live mode — fixture mode and demo:guided work without any Discord wiring.
Prerequisites (one-time, do by hand):
- Create a bot. Discord Developer Portal → New Application → Bot → Reset Token. Copy the token.
- Grab a guild ID. Enable Developer Mode in Discord (Settings → Advanced). Right-click your test server → Copy Server ID.
- Populate
apps/ctx/.env:
DISCORD_TOKEN=<bot token from step 1>
DISCORD_GUILD_ID=<server id from step 2>
ANTHROPIC_API_KEY=<from console.anthropic.com>
Invite the bot with correct permissions (this is the annoying part — read carefully):
Discord has a quirk: clicking an OAuth URL with new permissions on a bot that's already in the server does not update its role permissions. It preserves the existing role. You must either kick-then-reinvite, or manually toggle permissions in Server Settings.
For a personal test server, the simplest path is Administrator:
https://discord.com/api/oauth2/authorize?client_id=<YOUR_APP_ID>&permissions=8&scope=bot
Replace <YOUR_APP_ID> with your bot's application ID (visible in the Dev Portal, or log one dry-run of bun run setup:discord — it prints both). permissions=8 is Administrator; for a test server you own, that's fine. Never use Administrator on a shared/production server.
If the bot is already in the server with wrong perms: right-click → Kick → click the URL above → Authorize. Fresh role, correct perms.
For a least-privilege invite instead of Administrator, use permissions=117776 (Manage Channels + Send + Read History + View + Embed Links + Attach Files). Same kick-first rule applies.
Run the setup script:
bun run setup:discord # dry-run — prints the plan
bun run setup:discord --apply # creates ctx-demo / ctx-coaching / ctx-digest,
# resolves owner ID via OAuth API, writes .env,
# emits .notes/seed-messages.txtOn a 403, the script prints the exact re-auth URL to click. Clean re-run after kick+reinvite.
Seed the test channels:
.notes/seed-messages.txt (generated by --apply) contains 12 fixture messages + 1 coaching DM formatted for paste. Open #ctx-demo, paste the messages one per line from your own account (not the bot — live-one-shot.ts filters m.author.bot). Paste the coaching line into #ctx-coaching. You're now ready for:
bun run live:one-shot # dry-run plan, no network
bun run live:one-shot --execute # one capped MA pass; no digest send
bun run live:one-shot --execute --send # posts digest to #ctx-digestPer LEARNINGS.md and THESIS.md scope discipline:
- Production live Discord over the existing
discord-mcpserver — the adapter interface is insrc/lib/discord.ts; fixture mode is the deterministic evals path.bun run live:one-shotis a guarded REST sidecar for a single test-channel smoke, not the production adapter. - Vision and PDF passthrough in the canonical eval set — the renderer accepts image and document content blocks via the MA user-message API (
src/lib/multimodal.ts);bun run live:one-shotexercises both with assets infixtures/assets/. The locked n=50 fixture is text-only by design, so the eval-gate metrics don't credit multimodal substrate work. - Provisioner for
create_agentspecs —state/agent-registry.jsonrecords intents today; the provisioner that reads specs and callsclient.beta.agents.create+ wires session resources + vault is a separate post-submission step. - Slack / iMessage / email adapters — interface shape only, no code.
- Dashboard UI — CLI-only surfaces. (Landing canvas in
apps/web/is a separate target.)
Hackathon submission — Apr 26 8pm EST.
Built by Arth Tyagi. OSS background: wav0 (Vercel OSS Fall Cohort, ElevenLabs Startup Grant), devclad (Thiel Fellowship R1, Techstars Final Round), onloop (top 15 of 187 at Opencode Hackathon India).
MIT