Narrated product-demo videos as code.
You write a script that drives the screen and emits beat-timestamped checkpoints. You write the narration as plain text, one sentence per beat. The tool clones a voice from a short reference clip, synthesizes each sentence, snaps each one to its beat (auto-spacing so sentences never overlap), muxes audio onto the recording, and trims/speeds the result.
The point: visuals are the source of truth for timing — audio adapts. Re-recording a 60-second demo is cheap; re-scrubbing a timeline to fix one mispronounced word is not. Change one sentence in narration.txt, run build, you have a fresh video in 10 seconds.
Other tools cover pieces of this. None cover the whole loop:
- VHS — terminal-only, no voice
- Remotion — synthetic video, can't drive a real app
- Descript — voice cloning + timeline, but record-first / scrub-to-edit
- Playwright video + ElevenLabs + ffmpeg — what
screencastglues together
The novelty is the inversion: code-as-driver, narration follows, alignment is automatic. You never open a video editor.
your driver script narration.txt voice ref clip
│ │ │
▼ ▼ ▼
ffmpeg capture per-sentence TTS (cached, F5-TTS)
│ │
│ beats.jsonl │ segments/seg_N.wav
▼ ▼
└────────► no-overlap aligner ───────► narration.wav
│
└─► mux + trim + atempo ─► final.mov
- Driver prints
[t=12.34s] beat-namelines to stderr while typing/clicking. Anything that can drive a UI works (Playwright, AppleScript, manual). Recording starts affmpeg avfoundationcapture; beat lines are scraped intobeats.jsonl. - Synth synthesizes each sentence in
narration.txtto its own WAV using the voice clone backend. Cached by sentence text — unchanged sentences are not re-synthesized. - Align computes start times per sentence:
actual_start = max(beat_time, previous_segment_end + gap). No overlap, ever. Output: one mixed-down WAV. - Build muxes audio onto video without
-shortest(so video is never truncated), then optionally trims a leading/trailing slice and appliesatempofor a final time scaling.
git clone <this repo> screencast
cd screencast
npm install # for the driver runner (playwright peer dep optional)You also need:
- ffmpeg with
avfoundation(macOS) orx11grab(Linux) input - F5-TTS (default voice backend) — see
voice/README.md - cliclick if your driver uses
cliclickfor mouse motion (macOS)
# 1. Drop a 30-60s clean recording of yourself in voice/ref.wav and the
# transcript in voice/ref.txt
# 2. Write your driver — see examples/coffee-page/record.mjs
# 3. Write narration.txt — one line per sentence, format:
# beat-name | sentence text
# 4. Build
screencast record --driver examples/coffee-page/record.mjs --output demo.mov
screencast build --plan examples/coffee-page/narration.txt \
--video demo.mov \
--beats beats.jsonl \
--voice voice/ref.wav \
--voice-text voice/ref.txt \
--output final.mov \
--trim 1.5,4.0 --speed 1.08// record.mjs — invoked by `screencast record`. Anything goes; just print
// `[t=X.XXs] beat-name` lines to stderr at each visual checkpoint.
const t0 = Date.now();
const beat = (label) => console.error(`[t=${((Date.now() - t0) / 1000).toFixed(2)}s] ${label}`);
await page.bringToFront();
beat("intro");
await sleep(2000);
beat("first-action");
await page.click("button.add-to-cart");
// ...The framework doesn't care what library you use — Playwright, Puppeteer, robotjs, AppleScript via osascript, or a literal "press start, then press these keys for a human." All it needs is [t=...] lines on stderr while the screen recording runs.
intro | This is Acme Coffee, the fastest way to order beans.
first-action | I tap "Add to cart" and the cart updates instantly.
shipping | Shipping address autofills from my saved profile.
checkout | Checkout is one tap.
beat-name matches the names your driver emitted. The aligner reads beats.jsonl and uses the timestamp of intro as the desired start of sentence 1, etc.
If you'd rather hand-pin times, write 12.5 | sentence instead — float seconds work too.
See docs/design.md for the longer story:
- Why per-sentence synthesis beats slicing one big WAV
- The no-overlap solver
- Why caching by sentence text matters when iterating
- Why
-shortestwill burn you
MIT