Browser-based voice drill app that trains fast staff/system-design interview
reflexes. Implementation of the spec in LOCAL.md (gitignored).
The voice agent runs on OpenAI GPT Realtime over WebRTC. The backend owns the curriculum, rotation, attempts, grading, and weakness state — the model is the voice/interview surface, not the brain (LOCAL.md §18).
For onboarding (architecture, recipes, the test pyramid in detail), see
CONTRIBUTING.md. For the build history grouped by
LOCAL.md milestone, see CHANGELOG.md.
| LOCAL.md section | Status |
|---|---|
| §1 Goal · §2 Architecture | ✓ (SQLite swap for Postgres in MVP — schema is portable) |
| §3 Realtime WebRTC (Option A ephemeral token) | ✓ |
| §4 Session config (gpt-realtime-2, reasoning, voice, tools, prompt id) | ✓ |
| §5 Core product flow | ✓ |
§6 Tool/function interface (get_next_drill, submit_answer_transcript, grade_attempt, save_generated_cards, get_user_skill_summary, end_session_summary) |
✓ all 6 wired + dispatched + smoke-verified |
| §7 Data model | ✓ |
| §8 Rotation engine | ✓ with separate mock_interview formula |
| §9 Question generation (Layer 1 YAML, Layer 2 templates, Layer 3 LLM drafts + activation flow) | ✓ |
| §10 Grading (rubric-first JSON, LLM + offline) | ✓ |
| §11 Voice session behavior | ✓ |
| §12 Backend endpoints | ✓ (see API table) |
| §13 Frontend screens (MVP + admin: drill browse, rubric editor, test-grade) | ✓ |
| §14 Prompt skeleton | ✓ (seeds/realtime-prompt.md) |
| §15 MVP 1 (50–100 drills, rotation, transcripts, grading, history) | ✓ 55 active drills, 17 topics |
| §15 MVP 2 (cards, weak dashboard, templates, Anki CSV, rubric editor) | ✓ |
| §15 MVP 3 (spaced repetition, skill graph, mock interview, pressure mode, compare attempts) | ✓ |
| §16 Seed format · §17 Engineering decisions · §18 Non-negotiable | ✓ (autonomy verified by smoke:realtime:loop) |
Explicitly out of MVP scope per LOCAL.md §15: payments, mobile app, Anki
sync, calendar scheduling, multi-user admin. Documented choice: SQLite
instead of Postgres for MVP — schema is portable, migration file is the
swap point. See docs/POSTGRES_MIGRATION.md
for the exact swap path (docker-compose.yml + migrations/postgres.sql).
apps/
backend/ Express + TypeScript + better-sqlite3
src/
server.ts entry; createApp() factory used by route tests
config.ts env + paths
drill-seed-schema.ts Zod schema (LOCAL.md §16); shared by seed/import/verify
verify-drills.ts CLI lint: schema + duplicate ids + quality warnings
gen-drills.ts LLM Layer-3 draft generator
seed.ts YAML → DB loader + importDrillsFromYaml
seed-templates.ts Layer-2 template expander
db/
migrations.ts SQLite schema (mirrors LOCAL.md §7)
repo.ts drills, sessions, attempts, events, cards,
skillState, usageEvents
engines/
rotation.ts LOCAL.md §8 scoring + mock_interview variant
grading.ts LOCAL.md §10 rubric grading (LLM + offline)
services/
realtime.ts OpenAI Realtime client_secret + tool defs
llm.ts OpenAI SDK wrapper
resources/ optional resource → drill draft pipeline (CLIs)
routes/index.ts REST API (LOCAL.md §12)
migrations/postgres.sql Postgres-flavored schema (doc; not the runtime)
seeds/drills/*.yaml canonical drill bank (LOCAL.md §16 format)
seeds/templates/*.yaml Layer-2 templates
seeds/realtime-prompt.md drill-coach agent prompt (paste into Playground)
frontend/ React + Vite + TypeScript
src/
App.tsx drill UI, history, events timeline, admin panels
useRealtime.ts WebRTC + tool dispatch (LOCAL.md §3 / §6)
api.ts fetch wrapper for /api/* (typed clients)
scripts/
smoke-all.mjs every-layer composite + pass/fail table
realtime-webrtc-smoke.mjs Playwright + fake-mic realtime smoke harness
drill-loop-smoke.mjs offline REST loop smoke
drill-loop-browser-smoke.mjs offline Playwright UI smoke
dev-reset.mjs wipe + reseed local SQLite
doctor.mjs pnpm dev:doctor — environment diagnostic
Requires Node 22+, pnpm 10+, and an OpenAI API key with Realtime access.
pnpm install
cp apps/backend/.env.example .env # or place .env at the repo root
# edit .env: set OPENAI_API_KEY
pnpm dev # starts backend (4000) + frontend (5173)Open http://localhost:5173.
Single commands:
pnpm dev:backend # tsx watch on src/server.ts
pnpm dev:frontend # vite
pnpm --filter @drill/backend seed # re-seed drills from YAML
pnpm --filter @drill/backend seed:templates # expand Layer-2 templates
pnpm --filter @drill/backend gen:drills -- --topic X --count N # LLM Layer-3 drafts
pnpm dev:doctor # environment diagnostic (Node, pnpm, env, sqlite, ports…)
pnpm dev:reset # wipe local SQLite + reseed (refuses while dev is running)
pnpm verify:drills # lint seeds/drills/*.yaml against the schema
pnpm check # build + tests + offline smokes (legacy alias)
pnpm smoke:all # CI gate (with --offline-only) and pre-release every-layer check (~5 min with OPENAI_API_KEY)| Variable | Default | Notes |
|---|---|---|
PORT |
4000 |
backend port |
DATABASE_PATH |
apps/backend/data/drill.db |
SQLite file |
OPENAI_API_KEY |
— | required for realtime + LLM grading |
OPENAI_REALTIME_MODEL |
gpt-realtime-2 |
LOCAL.md §17 default |
OPENAI_REALTIME_TRANSCRIPTION_MODEL |
gpt-4o-mini-transcribe |
ASR model for user audio transcript events |
OPENAI_REALTIME_TRANSCRIPTION_LANGUAGE |
— | optional language hint, e.g. en |
OPENAI_GRADING_MODEL |
gpt-4.1-mini |
text grading after attempt |
OPENROUTER_API_KEY |
— | optional; enables shadow grader benchmarking only |
OPENROUTER_BASE_URL |
https://openrouter.ai/api/v1 |
OpenAI-compatible OpenRouter endpoint |
OPENROUTER_MODEL_TTL_MS |
600000 |
cache TTL for OpenRouter model list |
OPENROUTER_COOLDOWN_MS |
600000 |
temporary skip window for unavailable/rate-limited free models |
OPENROUTER_TIMEOUT_MS |
20000 |
per-model shadow grading timeout |
OPENAI_REALTIME_PROMPT_ID |
— | optional Playground prompt id; when set, the backend sends prompt: { id } instead of inlining DRILL_COACH_INSTRUCTIONS |
OPENAI_REALTIME_PROMPT_VERSION |
— | optional version pin for the Playground prompt above; bump after iterating on seeds/realtime-prompt.md |
OPENAI_REALTIME_VOICE_SPEED |
1.25 |
clamped to [0.25, 1.5]; higher = faster speech |
OPENAI_REALTIME_TOKEN_ATTEMPTS |
3 |
retry budget for the client_secrets mint on retryable upstream errors |
REALTIME_VOICE |
marin |
voice id |
FRONTEND_ORIGIN |
http://localhost:5173 |
CORS allowlist |
USE_OFFLINE_GRADER |
0 |
1 → deterministic keyword grader (no API call) |
- The browser asks the backend for an ephemeral token.
- The backend calls
POST https://api.openai.com/v1/realtime/client_secretswith the drill-coach system instructions, voice, and reasoning effort. - The browser builds an
RTCPeerConnection, adds microphone audio, opens a data channel, and posts its SDP offer directly toPOST /v1/realtime/calls?model=...with the ephemeral token. The OpenAI API key never leaves the backend. - Audio streams over WebRTC; the model's instructions force the strict interview-drill style (LOCAL.md §11, §14).
POST /api/drill-sessions→ session id.POST /api/drill-sessions/:id/nextruns the rotation engine and returns a drill plus a pre-createdattempt_id.- While voice is live the frontend pushes the drill text into the agent's conversation, and the agent asks it aloud.
- The user speaks; transcription comes back over the data channel.
- On Submit, the frontend
POSTs the transcript + duration toPOST /api/drill-attempts/:id/grade. The grader runs rubric-first scoring, persists the attempt, updatesuser_skill_state, and insertsgenerated_cards.
apps/backend/src/engines/rotation.ts implements the full weighted score:
0.35 * due
+ 0.25 * weakness
+ 0.15 * novelty
+ 0.10 * difficultyFit
+ 0.10 * topicBalance
+ 0.05 * trapDiversity
− 0.50 * recentRepeatPenalty
− 0.30 * exactRepeatPenalty
Top 5 candidates are weighted-random-picked so the app is not predictable.
mock_interview mode swaps the formula to prefer variety and high
difficulty over due/weakness:
0.40 * novelty + 0.20 * topicBalance + 0.20 * difficulty
+ 0.10 * weakness + 0.05 * due + 0.05 * trapDiversity
- 0.60 * recentRepeatPenalty - 0.40 * exactRepeatPenalty
and also pre-filters the pool to difficulty ≥ 3 plus drills the user hasn't attempted recently, so a "mock interview" session feels different from a study session.
apps/backend/src/engines/grading.ts has two graders:
- LLM grader (default) — calls
OPENAI_GRADING_MODELwith the rubric and transcript and parses JSON back into the score breakdown. - Offline grader — deterministic keyword matching for tests / no-API
environments. Triggered by
USE_OFFLINE_GRADER=1or absence ofOPENAI_API_KEY.
Final score formula:
0.65 * must_have_coverage
+ 0.20 * answer_clarity
+ 0.10 * tradeoff_coverage
+ 0.05 * speed_score
− red_flag_penalty
Verdict: >= 0.80 pass, 0.60–0.79 borderline, < 0.60 fail.
| Method | Path | Purpose |
|---|---|---|
GET |
/api/health |
drill count + OpenAI configured flag |
POST |
/api/realtime/token |
mint ephemeral Realtime client secret |
POST |
/api/drill-sessions |
start a drill session |
POST |
/api/drill-sessions/:id/next |
pick next drill via rotation |
POST |
/api/drill-sessions/:id/retry |
force a fresh attempt on a specific drill (bypasses rotation) |
POST |
/api/drill-attempts/:id/transcript |
save transcript + duration |
POST |
/api/drill-attempts/:id/grade |
grade an attempt (LLM or offline) |
GET |
/api/drill-attempts/:id |
full attempt detail (transcript, missed points, ideal answer, cards) — owner-scoped |
POST |
/api/drill-attempts/:id/evaluate |
run OpenRouter shadow grading; does not mutate live score/cards |
GET |
/api/drill-attempts/:id/evaluations |
list stored shadow grader evaluations for an attempt |
GET |
/api/cards/due |
due review cards + total/due stats |
POST |
/api/cards/:id/review |
record SM-2-lite review (quality 0/1) |
GET |
/api/cards/export.csv |
Anki-importable CSV (front,back,tags) |
GET |
/api/progress |
per-topic weakness state |
GET |
/api/progress/drills |
per-drill performance (attempts, avg/best/worst score, last verdict) sorted by avg ascending |
GET |
/api/drills |
drill bank browse (active only) |
GET |
/api/drills/drafts |
Layer-3 LLM drafts (is_active=false) |
GET |
/api/drills/export.yaml |
dump active drills as YAML (seed format); ?include_drafts=1 to include drafts |
POST |
/api/drills/import |
upsert drills from YAML body (or { yaml: "…" }); 207 with per-item errors when partial |
GET |
/api/stats |
drill bank distribution: active vs drafts, by topic / difficulty / trap_type |
GET |
/api/sessions |
recent sessions for the user, newest first, with rollup stats (?limit=N, default 25) |
POST |
/api/drills/:id/activate |
promote a draft into the rotation pool |
POST |
/api/drills/:id/deactivate |
pull a drill back out of the rotation pool (mirror of activate) |
PATCH |
/api/drills/:id |
edit rubric / canonical answer / difficulty / question text |
POST |
/api/drills/:id/test-grade |
dry-run grader against a sample answer (no persist) |
DELETE |
/api/drills/:id |
delete a draft (active drills are protected) |
POST |
/api/realtime/tool-call |
dispatch for the voice agent's tool calls |
POST |
/api/realtime/usage |
record token usage from a realtime response (dedupes by response_id) |
GET |
/api/usage/summary |
aggregated token usage for the user (current session + lifetime totals) |
GET |
/api/drill-sessions/:id/summary |
per-session stats (attempts, scores, topics) |
GET |
/api/drill-sessions/:id/events |
audit log (LOCAL.md §7 session_events) |
GET |
/api/admin/events |
admin audit trail — drill imports, draft state changes, rubric edits. Optional ?type= (CSV of drill_imported,draft_activated,draft_deactivated,draft_discarded,rubric_edited), ?since= (ISO 8601), ?actor= (x-user-id value), ?limit=N (default 100, max 500). Every payload includes actor and (where relevant) drill_id / fields_changed. |
POST |
/api/drill-sessions/:id/end |
mark ended + return summary |
YAML files in apps/backend/seeds/drills/ are loaded on every server start
(drill_items.upsert so edits are picked up). Schema follows LOCAL.md §16.
To add a drill: create or edit a YAML file, run pnpm --filter @drill/backend seed,
or just restart the backend.
Templates live in apps/backend/seeds/templates/*.yaml. Each declares a
template_text, a rubric_template, a canonical_answer_template, and a
list of variants with named vars. The expander interpolates the variables
into every field and upserts concrete drill_items rows.
pnpm --filter @drill/backend seed:templatesOne composite-index template currently expands to 4 variants
(orders / events / messages / invoices), all tagged with tmpl:<id> so
template-derived drills are filterable.
pnpm --filter @drill/backend gen:drills -- \
--topic caching --subtopic eviction --count 3 --difficulty 3Inserts drills as is_active=false drafts so the rotation engine never
serves them until a human flips the bit. Tagged with gen:llm for filtering.
Uses OPENAI_GRADING_MODEL (default gpt-4.1-mini).
pnpm --filter @drill/backend bench:grader --source historical --models free-pinned --limit 25
pnpm --filter @drill/backend bench:grader --source attempt --attempt-id <id>Requires OPENROUTER_API_KEY. Results are stored in grading_evaluations and
never overwrite the live attempt score, cards, or skill state. The default
model policy only uses OpenRouter models currently reported as zero-cost.
Review and activate drafts from the UI: click Show drafts in the header
to see every is_active=false drill with rubric preview, then Activate
to promote into the rotation pool or Discard to delete.
Pulls Markdown from GitHub repos listed in the resource manifest
(see path below), splits sections, and emits draft drills
(is_active=false). Useful for bootstrapping a topic area from a
canonical reference doc. Skips the LLM round-trip if you want
deterministic drafts to review by hand. Same activation flow as Layer 3.
Set GITHUB_TOKEN in .env to raise the GitHub API rate limit from
60/hr (anonymous) to 5000/hr (authenticated). Any repo-scope or
public_repo-scope token is enough — the pipeline only reads public
files.
# Phases: assess → extract → generate-drills → all
pnpm extract:resources -- --phase all --resource system-design-primer
pnpm extract:resources -- --phase all --limit 5 --dry-run # preview, no writes
pnpm import:resource-drafts # latest run, all resources
pnpm import:resource-drafts -- --resource system-design-primer --run 20260520T123456ZResource manifest lives at .agents/skills/resource-extraction/resources.json.
Artifacts land under data/resources/<slug>/<run-id>/ (gitignored): an
assessment.json, documents.jsonl, and draft_drills.yaml round-trippable
through the same Zod schema as Layer-1 seeds. import:resource-drafts
defaults to --run latest, so the common case is just running it bare.
In the drill browse panel, expand any drill to see its rubric. Two admin surfaces are wired:
- Edit rubric — opens textareas for must-have / nice-to-have / red flags
/ canonical short answer plus a difficulty selector. Saving issues
PATCH /api/drills/:id, validates with the same Zod schema as YAML seeds, and refreshes the browse list. - Test grade — paste a sample answer, run the grader against the current rubric, see score + verdict + missed-points count, without writing an attempt or touching skill state. Great for tuning rubrics on newly activated Layer-3 drafts.
Header Pressure ON/off toggle. When on, every drill push appends an explicit "interrupt rambling after ~10s; snap 'Default answer now.'; force at least one pressure follow-up" clause to the agent's per-drill instruction. Lets the user dial the intensity from study-buddy to drill-instructor without re-minting the realtime session.
Mapped to LOCAL.md §15:
- Postgres — MVP uses SQLite. Schema is portable; the migration file is the obvious place to swap when you need multi-writer or hosted infra.
- Card-review UI for the spaced-repetition slots already in the schema.
- Layer-2 template generator (
drill_templates) — schema exists, no generator yet. - Admin/content editor, payments, Anki sync, per-user auth — not in MVP.
The Realtime agent has six backend tools attached to the session config:
get_next_drill, submit_answer_transcript, grade_attempt,
save_generated_cards, get_user_skill_summary, end_session_summary.
Tool calls flow over the data channel and are dispatched via a single
backend endpoint POST /api/realtime/tool-call. The frontend hook
(useRealtime) tracks (item_id → name) pairs across
response.output_item.added and response.function_call_arguments.done,
runs the registered handler, and sends back
conversation.item.create with function_call_output plus
response.create.
App.tsx mirrors agent-driven get_next_drill and grade_attempt results
into local state so the UI follows the agent.
# health
curl -s localhost:4000/api/health | jq
# start session, pick a drill, grade an answer
SID=$(curl -s -X POST localhost:4000/api/drill-sessions \
-H 'content-type: application/json' \
-d '{"mode":"db_indexes"}' | jq -r .session.id)
ATT=$(curl -s -X POST localhost:4000/api/drill-sessions/$SID/next \
-H 'content-type: application/json' -d '{}' | jq -r .drill.attempt_id)
curl -s -X POST localhost:4000/api/drill-attempts/$ATT/grade \
-H 'content-type: application/json' \
-d '{"transcript":"composite B-tree on (category_id, price), equality then order, verify with EXPLAIN ANALYZE","duration_seconds":45}' | jqLOCAL.md §5 / §11 call for the agent to "ask the question aloud" and the session to feel like a fast back-and-forth. The frontend gives you four visual confirmations so you can tell at a glance whether the voice loop is healthy, even with audio output muted:
- Voice-first start — clicking Start session / Next drill chains
startSession → nextDrill → realtime.startin one click. No separate "Start voice" step. - Coach (audio) transcript (
data-testid="agent-transcript") — the agent's spoken transcript appears above the answer textarea as it speaks. If your speakers fail, you still see exactly what the coach said. - Voice state badge (
data-testid="voice-state") —🔊 Coach speaking(highlighted) vs🎤 Listening(dim), derived fromresponse.output_audio_buffer.started/response.doneevents on the data channel. - Mic meter (
data-testid="mic-meter") — 5-bar VU pulled from a Web AudioAnalyserNodeon the local mic track, sampled at ~10 Hz. If the bars never light up, your mic is dead. - Voice error banner (
data-testid="voice-error-banner") — when the WebRTC handshake or the ephemeral token mint fails, a red banner appears with the message and a short troubleshooting hint (mic permission ·OPENAI_API_KEYon backend · HTTPS/localhost).
The drill loop is keyboard-first so you can stay typing/talking without reaching for the mouse:
| Keys | Action |
|---|---|
⌘ / Ctrl + Enter |
Submit the typed answer (works from inside the textarea) |
n |
Next drill |
Shift + R |
Retry the current drill (creates a fresh attempt on the same drill) |
e |
End session |
p |
Toggle pressure mode |
Single-key shortcuts (n, e, p) are suppressed while any input,
textarea, select, or contentEditable element has focus, so typing the
answer never triggers them. The hint row sits right under the action
buttons (data-testid="shortcuts-hint").
If clicking Start voice connects but you hear nothing, in order of likelihood:
- Browser autoplay is silently rejecting the
<audio>element'splay(). Click anywhere on the page after voice connects — that's a user gesture. If it still won't play, open dev tools and check forNotAllowedError. - Your output device is muted or routed somewhere unexpected. The Coach (audio) transcript will still update — if it does, audio is arriving, you just can't hear it.
- Stale dev server. Use one Vite instance; if you have multiple tabs on different ports (5173, 5174, …) you may be running older code that pre-dates the autonomy nudge or voice-first flow.
Three layers, fastest to slowest:
| Layer | Command | What it proves |
|---|---|---|
| Unit + route tests | pnpm -r test |
rotation engine, offline grader, AND HTTP routes (session ownership, draft activation, dry-run grader, tool-call dispatch, rubric editing). 20 tests. Runs Express in-process on an ephemeral port, no network. |
| REST drill loop | pnpm smoke:drill-loop |
end-to-end loop over HTTP for N drills with the offline grader — verifies rotation produces variety, weakness state moves, mixed verdicts. Boots its own backend on an isolated DB. |
| Browser drill loop | pnpm smoke:browser |
exercises App.tsx in Chromium (no mic): Start → type answer → Submit → grade panel renders → Next drill → question changes. |
| Realtime WebRTC | pnpm smoke:realtime |
full voice path: launches Chromium with --use-file-for-fake-audio-capture against a Mumbli WAV, asserts the model connects, ASR transcript appears, and at least 1 backend tool gets dispatched. Requires OPENAI_API_KEY. |
| Realtime multi-turn | pnpm smoke:realtime:multi |
same harness, longer wait (~90 s), asserts ≥ 2 distinct tool calls — proves the agent runs the actual drill loop (e.g. submit_answer_transcript then grade_attempt) rather than just calling get_next_drill once and stopping. |
| Realtime autonomy | pnpm smoke:realtime:loop |
strictest — wait up to ~2 min, asserts ≥ 3 total tool calls including get_next_drill. Proves the agent calls submit_answer_transcript → grade_attempt → get_next_drill autonomously. Verifies LOCAL.md §18 ("backend owns curriculum, model drives it"). |
Run everything offline in one shot — exactly what CI runs:
pnpm smoke:all --offline-only # doctor + verify:drills --strict + build + tests + REST smoke + browser smokeOr run all smokes (offline + realtime) with a pass/fail summary:
pnpm smoke:all # 10 layers, ~5 minutes; needs OPENAI_API_KEY
pnpm smoke:all --offline-only # skip the 4 realtime smokesSample output:
▶ dev:doctor…✓ dev:doctor (0.8s)
▶ verify:drills --strict…✓ verify:drills --strict (0.5s)
▶ build…✓ build (1.4s)
▶ test (backend unit + route, frontend pure)…✓ test (backend unit + route, frontend pure) (2.0s)
▶ smoke:drill-loop…✓ smoke:drill-loop (1.3s)
▶ smoke:browser…✓ smoke:browser (2.7s)
▶ smoke:realtime…✓ smoke:realtime (62.0s)
▶ smoke:realtime:multi…✓ smoke:realtime:multi (67.8s)
▶ smoke:realtime:loop…✓ smoke:realtime:loop (73.9s)
▶ smoke:realtime:end…✓ smoke:realtime:end (77.2s)
10/10 passed
You can run the drill linter on its own:
pnpm verify:drills
# verify:drills OK — 51 drills validated across 21 filesSame command CI uses. The realtime smokes are separate because they need
OPENAI_API_KEY:
pnpm -r test # unit + route tests
pnpm smoke:drill-loop # offline REST loop
pnpm smoke:browser # offline browser loop
pnpm smoke:realtime # online realtime (>=1 tool call)
pnpm smoke:realtime:multi # online (>=2 distinct tool names)
pnpm smoke:realtime:loop # online (>=3 calls incl. get_next_drill)
pnpm smoke:realtime:end # online ("Stop" → end_session_summary)# point smokes at an already-running stack instead of starting one
USE_EXISTING_BACKEND=1 USE_EXISTING_FRONTEND=1 pnpm smoke:browser
# specific Mumbli WAV (otherwise picks the latest >32 KB)
REALTIME_SMOKE_AUDIO="/absolute/path/sample.wav" pnpm smoke:realtime
# show the browser
HEADLESS=0 pnpm smoke:realtime
# don't fail the realtime smoke if the agent skips tool calls
REALTIME_SMOKE_REQUIRE_TOOL=0 pnpm smoke:realtime
# how long to wait for the agent's first tool call (default 20s)
REALTIME_SMOKE_TOOL_WAIT_MS=30000 pnpm smoke:realtimeThe realtime smoke output includes a screenshot path and a tail of recent
Realtime server events so you can confirm response.output_audio.done,
input_audio_buffer.speech_stopped, and transcription deltas all arrived.