Skip to content

arthtyagi/ctx

Repository files navigation

op0 · ctx

Brief → make media. Correct it → taste stays. Save it → run it again.

op0 is a creative studio where the agent generates images, scores, Remotion music videos, and fresh generative video clips from a brief. When you tell it what worked, it records the rule with your words attached. When you save a run, the sequence becomes a reusable rig. The studio changes itself on the record; nothing is a hidden prompt edit.

ctx is the substrate underneath: a Managed Agents harness where every durable change is a typed tool call with a cited reason. The digest agent is the first proof — coached a Discord summarizer from 86.0% to 93.6% relevance, including an audited retire-then-upsert moment. op0 shows the same pattern becoming a creative product.

Two execution paths, both auditable:

  • Live MA runs — interactive; every event mirrored to Postgres so the studio timeline is durable and a refresh replays it from the DB.
  • WDK rigs — saved or proposed flows executed via Vercel Workflow DevKit; durable, resumable mid-stream, and replay-safe because only capabilities marked safe for repeat execution are admitted.

See docs/THESIS.md for the long form and docs/ARCHITECTURE.md for the layered diagram.

The 90-second demo

# 1. install + provision once
bun install
cp apps/op0/.env.example apps/op0/.env.local   # fill REPLICATE/ELEVENLABS/ANTHROPIC keys, DATABASE_URL, BLOB_READ_WRITE_TOKEN, BETTER_AUTH_SECRET
bun --filter op0 setup:agent                   # creates the op0-studio MA agent + environment

# 2. run
bun --filter op0 dev                            # http://localhost:3000

# 3. record this flow
#    /studio?audit=open
#    brief: "a 1970s editorial portrait of a daffodil. for future runs,
#            prefer warm film grain and sparse arrangements."
#       -> instructions(upsert) records the taste rule with the user's words
#       -> compose_character creates the image
#       -> compose_score creates the score
#       -> compose.video renders a Remotion MP4 and mirrors it to Blob
#       -> (optional) compose_gen_video for a fresh generative video clip
#       -> audit drawer shows the tool trail and reasons
#    brief: "save this as the editorial rig"
#       -> save_rig derives a reusable WDK workflow from the audited tool calls
#    brief: "set up a rig for monochrome character studies"
#       -> propose_rig designs a new reusable flow, behind confirmation

Both consumers write to the same instructions(upsert/retire, payload, reason) shape. The audit trail is uniform; the live agent loop is uniform; deterministic rigs reuse the same capability registry.

This repo ships two consumers on one substrate:

  • @ctx/harness — MA-native primitives (/ma: pause/resume tool loop, session helpers, memory_stores upsert) + capability registry (/registry: Zod-validated dispatch with a built-in audit envelope) + bundled providers (/audio, /image). Lifted into a package so both apps share one implementation.
  • apps/ctx — coachable digest-maintainer agent. Coached precision +13.7 pts (87.6 % vs 73.9 %), 100 % critical recall, one retire moment, n=2 replicate aggregate. The substrate's existing evidence.
  • apps/op0 (op0.dev) — the creative multimedia studio. Two execution paths: the live MA agent loop (interactive; every event mirrored to Postgres audit_event so a refresh replays the timeline from the DB), and Vercel WDK rigs (deterministic; refresh-resumable mid-stream via the durable WDK runId). Saved rigs at /rigs; durable history at /runs.
apps/op0          — the creative studio (Next 16 · WDK · shadcn · json-render · Drizzle · Blob · Remotion)
apps/ctx          — the digest-agent proof + harness CLIs + replicate evals
apps/wav0-score   — score-CLI smoke tests for the audio capability layer
apps/web          — usectx.com landing
packages/harness  — MA primitives (ma/) · capability registry (registry/) · provider runtimes

Both apps speak the same audit envelope: the studio's tool dispatches and the digest agent's instructions(upsert/retire, payload, reason) calls land as MA custom-tool events with the user's own words attached.

op0 — the creative studio

op0.dev is the consumer that proves the harness generalizes past Discord coaching. Brief in; image, score, video, and mail out. Every coaching mutation flows through the same instructions(upsert/retire, payload, reason) envelope the digest agent uses. Every artifact flows through a typed Zod capability handler and is mirrored to Blob when a provider URL or local Remotion MP4 needs persistence. Live MA runs are interactive and audited; deterministic rigs run through Vercel Workflow DevKit.

Capability Implementation
compose.character 4-provider image router (Imagen 4, flux-pro, flux-dev, flux-schnell) with content-mod retry chain.
compose.score Audio router across Replicate and ElevenLabs providers; live outputs are mirrored to Blob for durable playback.
compose.video Remotion 4 render path — stitches image + score into a music video MP4, mirrored to Blob. Live MA runs only; excluded from rigs until artifact-binding lands.
compose_gen_video Generative video models via Replicate: Veo 3 Fast, Kling 2.0, LTX, Wan. Prompt → clip; output mirrored to Blob. Live MA runs only.
mail.list / mail.send AgentMail-backed inbox capability, exposed to the live studio agent.
instructions Audited rule mutation. Companion customize_studio mutates the user's canvas (theme / grid / density / accent) live, persisted across sessions.
propose_rig / save_rig Self-development primitive — agent designs or saves a re-runnable flow, user reviews, the rig lands in /rigs, and WDK executes it later through registry-safe capabilities.

What a judge clicks (signed-in flow):

  1. /sign-up → make an account.
  2. /studio?audit=open → drop a brief with a durable preference. The first tool call records the taste rule with a reason.
  3. Music-video runcompose_character + compose_score + cost-gated compose.video (Remotion MP4); optionally compose_gen_video for a raw AI clip (Veo 3 Fast / Kling 2.0 / LTX / Wan).
  4. Audit drawer → every tool call, result, reason, and copied JSON payload is visible.
  5. Save as rig → the tool sequence becomes a reusable WDK workflow on /rigs.
  6. Propose a new rig → the agent designs a future flow, user approves, the rig lands in /rigs.
  7. Inbox → optional AgentMail surface if provisioned; mail is live-agent-only, not a rig step.

Visual identity: a working object — paper, jade #1d5b3f, math-paper grid, fountain-pen ink — not chrome, not glassmorphism, not glowing orbs. The agent can swap themes (paper / dark / blueprint / kraft) live via customize_studio based on the brief.


The original # ctx framing follows — it's how the digest agent's claims (audit trail, retire moments, coached precision) are testable today on the locked fixture set. op0 sits on the same substrate: live MA runs are audited, and saved/proposed rigs run durably through WDK.


ctx

Where agents build themselves around your workflows.

ctx is a coachable Managed Agents harness with a working digest-maintainer agent inside it. You DM ctx in English — to coach the digest agent ("less gm/gn, surface @arth mentions first"), or to spawn a new agent for a different workflow ("create an agent that watches anthropic/claude-code for breaking changes"). Every mutation to the harness flows through one custom tool with a required reason string, captured as a session event in the MA event log. The digest agent is today's proof; instructions(create_agent, ...) is what the surface compounds into.

This README leads with the working digest agent because the audit-trail and retire-moment claims are testable today. The harness reframe is at The harness.

The working digest agent

On the locked 50-message, 8-coaching-DM, 10-poll fixture set, single canonical run (committed artifacts):

metric baseline (no coaching) coached Δ
weighted precision 73.9% 87.6% +13.7
weighted recall 100.0% 100.0% +0.0
weighted F1 85.0% 93.4% +8.4
relevance (geomean P×R, kept for continuity) 86.0% 93.6% +7.6
critical recall (P0 bugs, team msgs, @-mentions, breaking releases) 100% 100%
retire events 0 1

Evidence: [artifacts/baseline/](./artifacts/baseline), [artifacts/coached/](./artifacts/coached), [eval/score.ts](./eval/score.ts).

Reproducibility — variance across replicates

Re-running the harness against the same fixture set produces variance the canonical-only metric hides. With 3 attempted replicates (2 succeeded, 1 timed out mid-coached) using independent MA agents per replicate:

metric baseline (n=2) coached (n=2) per-replicate F1
precision (mean ± 95% CI) 73.9% ± 0.0% 87.6% ± 0.8% r2: 93.4% r3: 89.0%
recall 100.0% ± 0.0% 95.3% ± wide (n=2 → df=1 → CI dominated by Student's t)
F1 85.0% ± 0.0% 91.2% ± wide
critical recall 100% (2/2 runs) 100% in 1/2 runs r3 missed m031 (team_message_daisy in #announcements at poll 6)
retire events 0 / 0 1 in 2/2 runs
coached relevance > baseline 2/2 runs

Why coached F1 spreads 89.0–93.4%: Haiku resolves cross-cutting rules (rule_004 "flag @daisy" vs rule_005 "suppress #announcements") differently across runs when both apply to the same message — documented in LEARNINGS.md. The system prompt's "first matching rule by id" tie-breaker doesn't cleanly separate priority-surface rules from suppression rules; Haiku picks differently on retries.

Why 1/3 attempted runs timed out: the SDK call inside the renderer's poll-4 turn returned The operation timed out. after 378s. This is the second-mode failure documented in LEARNINGS.md (the first being Opus stalling on poll 7 retire detection in batch mode); both are real production-readiness considerations, not eval bugs.

Honest framing: the canonical 93.6% relevance / 100% critical recall / 1 retire is a realized trajectory, not an expected value. Two of three attempted replicates hit the gate; one critical-recall miss in the successful pair. Run the harness yourself: USE_FIXTURES=true bun run replicates --n=3 && bun run eval:aggregate artifacts/replicates. Tier 2 work to tighten the spread (held-out fixture, deterministic rule-conflict resolution, LLM-graded prose) is documented in LEARNINGS.md.

The feel — before vs after

Digest #1 with no coaching (baseline):

# Digest — Poll 1
## Signal
- [m001] [#general] @lurker_d: gm everyone ☀️
- [m002] [#general] @random_e: gm!
- [m003] [#help] @newbie_c: how do I install? the npm command in the README errors out with peer deps
- [m005] [#general] @contrib_a: anyone else seeing flaky CI on main? third failure this morning
- [m006] [#bugs] @contrib_a: opening issue for the CI flake — cache step intermittently fails
## Suppressed by ctx (1)
- (ctx default): docs_bot auto-reply — 1 message(s): m004

Digest #1 after one coaching DM ("less gm/gn in the digest please"):

# Digest — Poll 1
## Signal
- [m003] [#help] @newbie_c: how do I install? the npm command in the README errors out with peer deps
- [m005] [#general] @contrib_a: anyone else seeing flaky CI on the main branch?
- [m006] [#bugs] @contrib_a: opening issue for the CI flake — cache step intermittently fails
## Suppressed by ctx (3)
- rule_001: gm/gn message — 2 message(s): m001, m002
- (ctx default): docs-bot auto-reply — 1 message(s): m004

One DM. Two messages filtered. The cited rule (rule_001) links back to a custom-tool event in the MA event log, which carries the owner's own reason string: "Owner said 'less gm/gn in the digest please' — explicit request to dial down gm/gn chatter." Every decision ctx makes is traceable to a DM the owner sent.

By digest #10, six rules are active and one has been retired. The filtering keeps up.

The retire moment

At poll 6 the owner DMs "#announcements is mostly noise, dial it back." ctx creates rule_005, a broad suppression rule for that channel. One poll later, seeing the cost of that suppression in the poll-6 digest, the owner reverses: "wait actually — release notes in #announcements ARE important, especially breaking changes. reverse what I said about dialing it back."

The reasoner reads this against its current rule set, detects the contradiction, and autonomously fires:

{
  "action": "retire",
  "payload": { "rule_id": "rule_005" },
  "reason": "Owner said 'reverse what I said about dialing it back' — explicit retraction of the #announcements suppression rule."
}

Belief revision on internal state — the agent deleting its own prior belief through an audited tool, with the reason captured in the MA event log. Then an upsert for the replacement (rule_006: surface release notes and breaking changes from #announcements as important). Two tool calls, one coaching DM, a visibly different digest the next poll.

Verify end-to-end with bun run events --poll=7.

Under the hood

Managed Agents is built around four concepts — Agent, Environment, Session, Events (docs). ctx makes each legible in the chat you already use:

Concept ctx's use Evidence
Agent Opus 4.7 reasoner + Haiku 4.5 renderer + Opus 4.7 narrator. Wired with `instructions(upsert retire, payload, reason)custom tool, thepreference-coach+observability-narratorskills, andweb_fetch` from the built-in toolset.
Environment Each session's sandboxed container. The renderer calls web_fetch from inside the environment when a Discord message references a URL that IS the substance of the share — citations appear in the digest. [artifacts/coached/digest-03.md](./artifacts/coached/digest-03.md) (m015 fetched summary)
Session One session per poll, same agent across polls. state/sessions.jsonl indexes coached-run sessions; client-side state/rules.json is the durable rule store — re-injected as context per session. [src/lib/state.ts](./src/lib/state.ts), [state/sessions.jsonl](./state)
Events /observability narrates the session event log (custom tool calls, reasons, agent.thinking traces, rule deltas) via Opus re-reading its own history. No shadow logging. [artifacts/observability-opus.md](./artifacts/observability-opus.md) vs [observability-haiku.md](./artifacts/observability-haiku.md) (A/B)

Client-side state/rules.json is the rule store — not in Managed Agents itself. Mutation flows through the Agent's custom tool primitive; every change is a session event with a reason. The event log is the audit trail.

Discord today. Any chat surface tomorrow. Same substrate.

Why Opus 4.7 (the A/B)

/observability re-reads the full event log (17 sessions, 174 events from the coached run) and produces a three-section readout. Same prompt, same data, two models.

Opus 4.7 found:

"The rule set is drifting from blunt suppression toward identity- and quality-gated surfacing (daisy, contributors-with-PRs, breaking changes). The blind spot: nothing yet governs #general, which is where @daisy's real asks (m022, m049) actually land — person-priority is carrying that channel alone."

Haiku 4.5 found:

"3 of 6 rules pivot on who sends the message rather than content. Owner is not just filtering noise—they're encoding a social trust model."

Both caught the themes and the retire event. Only Opus named a prescriptive gap — the absence of a rule for a specific channel where the owner's own team messages land. That's the move a cross-theme synthesis step earns over a theme-enumeration step.

Full side-by-side: [artifacts/observability-opus.md](./artifacts/observability-opus.md), [artifacts/observability-haiku.md](./artifacts/observability-haiku.md).

The harness

The digest agent is the first proof; it isn't the only thing ctx can do. The same instructions(...) custom tool that owns coaching-rule upsert/retire accepts a third action — create_agent — that registers a new agent intent in the harness when the owner asks ctx (by DM) to spawn an agent for a different workflow.

A creation DM looks like this:

"create an agent that watches anthropic/claude-code for new PRs and DMs me when there's a breaking change"

The reasoner classifies it as a creation request — distinct from coaching the current digest agent, which is a separate category in the preference-coach skill — and fires:

{
  "action": "create_agent",
  "payload": {
    "name": "github-pr-watcher",
    "purpose": "Watch anthropic/claude-code for new PRs and surface breaking changes via owner DM."
  },
  "reason": "Owner asked: 'create an agent that watches anthropic/claude-code for new PRs and DMs me when there's a breaking change'."
}

The new AgentSpec lands in state/agent-registry.json with a sequential id (spec_001, spec_002, …). Same audit-trail invariant as upsert/retire: every spec carries a cited reason, every creation is a session event in the MA event log.

Provisioning — materializing each spec into a fully-wired Managed Agent (its own custom tools, its own skills, its own session lifecycle) — is a separate orchestration step the owner controls. v1 records the intent; the decision boundary, the audit, and the contract are the load-bearing pieces, and they ship today.

[fixtures/creation.jsonl](./apps/ctx/fixtures/creation.jsonl) carries the canonical creation DM. bun run test:create-agent exercises the 21-assertion contract — id allocation, default skill_refs, missing-name/purpose rejection, empty-reason rejection, regression on the existing upsert/retire actions.

Same surface (DM → reasoner → audited custom tool → durable state) for every harness mutation. The digest agent earned the surface; create_agent is what the surface compounds into.

Built to generalize

OSS maintenance is the first use case because the pain is visceral and universal. The architecture underneath is not Discord-specific:

  • The chat surface sits behind a ChatSurface interface (Discord today; Slack, iMessage, email via Cloudflare Workers, voice via ElevenLabs are adapter swaps, not rewrites).
  • The durable state (state/rules.json + state/agent-registry.json) is any curated-preference object — content triage rules today, task priorities or design-review criteria tomorrow, plus the registry of additional agents the owner has spawned.
  • The multi-modal inputs (text, URLs via env sandbox, image and document content blocks via src/lib/multimodal.ts) cover most feed-like sources. The renderer accepts attachments as evidence for a message, not a reason to suppress it.
  • The primitive surfaces (/observability, /instructions, /skills) are generic to any coachable agent in the harness.

Other use cases that fit this shape without architectural change:

  • Music producer triage — you're in 10+ Discords (sample packs, collab servers, label communities). Coach an agent on what a collab ask looks like, what a critique-worthy WIP looks like, and what's noise.
  • Design review triage — coach an agent that watches a Figma channel and surfaces review requests that match your current project, ignoring the rest.
  • Podcast / long-form intake — coach an agent on what you actually want from a 2-hour recording.
  • Investment deal flow — coach an agent on your thesis; it filters the inbound.

Build one product well; the substrate earns the next three by demonstration, not by pitch.

How to run

The guided replay needs only Bun. Live SDK commands also need an ANTHROPIC_API_KEY; copy .env.example to .env and fill in the key.

bun install

# 2-minute narrated walkthrough — no API calls, replays committed artifacts.
# Start here if you've never seen ctx before.
bun run demo:guided

# run all 4 SDK probes — bootstrap, custom tool pause/resume, event log, model routing
bun run probes

# single-poll smoke test (fixture mode)
USE_FIXTURES=true bun run poll --n=1 --fresh --output=artifacts/digest-01.md

# end-to-end batch: baseline (no coaching) + coached (10 polls)
USE_FIXTURES=true bun run demo --mode=baseline --output-dir=artifacts/baseline --fresh
USE_FIXTURES=true bun run demo --mode=coached  --output-dir=artifacts/coached

# score the gate
bun run eval artifacts/baseline artifacts/coached

# the hero beat — Opus cross-theme synthesis, with a Haiku A/B
bun run observability

# surface tours
bun run instructions          # rule store, theme-clustered
bun run skills                # registered skills, load triggers
bun run events --poll=7       # event-log cutaway for the retire beat
bun run test:create-agent     # 21-assertion contract for the harness create_agent action
bun run live:one-shot         # guarded live Discord dry-run; --execute writes artifacts/live-*
bun run help

The coaching loop is the IP. Fixture mode is deterministic for evals; the live-Discord adapter is one file (src/lib/discord.ts). Live wiring is a feature, not a primitive — the interface is the substrate. USE_FIXTURES=true forces fixtures; set a DISCORD_BOT_TOKEN to point the same interface at a real discord-mcp server.

Discord setup (for live:one-shot)

Only needed for live mode — fixture mode and demo:guided work without any Discord wiring.

Prerequisites (one-time, do by hand):

  1. Create a bot. Discord Developer Portal → New Application → Bot → Reset Token. Copy the token.
  2. Grab a guild ID. Enable Developer Mode in Discord (Settings → Advanced). Right-click your test server → Copy Server ID.
  3. Populate apps/ctx/.env:
 DISCORD_TOKEN=<bot token from step 1>
 DISCORD_GUILD_ID=<server id from step 2>
 ANTHROPIC_API_KEY=<from console.anthropic.com>

Invite the bot with correct permissions (this is the annoying part — read carefully):

Discord has a quirk: clicking an OAuth URL with new permissions on a bot that's already in the server does not update its role permissions. It preserves the existing role. You must either kick-then-reinvite, or manually toggle permissions in Server Settings.

For a personal test server, the simplest path is Administrator:

https://discord.com/api/oauth2/authorize?client_id=<YOUR_APP_ID>&permissions=8&scope=bot

Replace <YOUR_APP_ID> with your bot's application ID (visible in the Dev Portal, or log one dry-run of bun run setup:discord — it prints both). permissions=8 is Administrator; for a test server you own, that's fine. Never use Administrator on a shared/production server.

If the bot is already in the server with wrong perms: right-click → Kick → click the URL above → Authorize. Fresh role, correct perms.

For a least-privilege invite instead of Administrator, use permissions=117776 (Manage Channels + Send + Read History + View + Embed Links + Attach Files). Same kick-first rule applies.

Run the setup script:

bun run setup:discord           # dry-run — prints the plan
bun run setup:discord --apply   # creates ctx-demo / ctx-coaching / ctx-digest,
                                # resolves owner ID via OAuth API, writes .env,
                                # emits .notes/seed-messages.txt

On a 403, the script prints the exact re-auth URL to click. Clean re-run after kick+reinvite.

Seed the test channels:

.notes/seed-messages.txt (generated by --apply) contains 12 fixture messages + 1 coaching DM formatted for paste. Open #ctx-demo, paste the messages one per line from your own account (not the bot — live-one-shot.ts filters m.author.bot). Paste the coaching line into #ctx-coaching. You're now ready for:

bun run live:one-shot             # dry-run plan, no network
bun run live:one-shot --execute   # one capped MA pass; no digest send
bun run live:one-shot --execute --send   # posts digest to #ctx-digest

What didn't ship (honesty)

Per LEARNINGS.md and THESIS.md scope discipline:

  • Production live Discord over the existing discord-mcp server — the adapter interface is in src/lib/discord.ts; fixture mode is the deterministic evals path. bun run live:one-shot is a guarded REST sidecar for a single test-channel smoke, not the production adapter.
  • Vision and PDF passthrough in the canonical eval set — the renderer accepts image and document content blocks via the MA user-message API (src/lib/multimodal.ts); bun run live:one-shot exercises both with assets in fixtures/assets/. The locked n=50 fixture is text-only by design, so the eval-gate metrics don't credit multimodal substrate work.
  • Provisioner for create_agent specsstate/agent-registry.json records intents today; the provisioner that reads specs and calls client.beta.agents.create + wires session resources + vault is a separate post-submission step.
  • Slack / iMessage / email adapters — interface shape only, no code.
  • Dashboard UI — CLI-only surfaces. (Landing canvas in apps/web/ is a separate target.)

Status

Hackathon submission — Apr 26 8pm EST.

Author

Built by Arth Tyagi. OSS background: wav0 (Vercel OSS Fall Cohort, ElevenLabs Startup Grant), devclad (Thiel Fellowship R1, Techstars Final Round), onloop (top 15 of 187 at Opencode Hackathon India).

License

MIT

About

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors