screencast

Narrated product-demo videos as code.

You write a script that drives the screen and emits beat-timestamped checkpoints. You write the narration as plain text, one sentence per beat. The tool clones a voice from a short reference clip, synthesizes each sentence, snaps each one to its beat (auto-spacing so sentences never overlap), muxes audio onto the recording, and trims/speeds the result.

The point: visuals are the source of truth for timing — audio adapts. Re-recording a 60-second demo is cheap; re-scrubbing a timeline to fix one mispronounced word is not. Change one sentence in narration.txt, run build, you have a fresh video in 10 seconds.

Why this exists

Other tools cover pieces of this. None cover the whole loop:

VHS — terminal-only, no voice
Remotion — synthetic video, can't drive a real app
Descript — voice cloning + timeline, but record-first / scrub-to-edit
Playwright video + ElevenLabs + ffmpeg — what screencast glues together

The novelty is the inversion: code-as-driver, narration follows, alignment is automatic. You never open a video editor.

How it works

your driver script        narration.txt          voice ref clip
       │                       │                       │
       ▼                       ▼                       ▼
   ffmpeg capture        per-sentence TTS       (cached, F5-TTS)
       │                       │
       │  beats.jsonl          │  segments/seg_N.wav
       ▼                       ▼
       └────────► no-overlap aligner ───────► narration.wav
                          │
                          └─► mux + trim + atempo ─► final.mov

Driver prints [t=12.34s] beat-name lines to stderr while typing/clicking. Anything that can drive a UI works (Playwright, AppleScript, manual). Recording starts a ffmpeg avfoundation capture; beat lines are scraped into beats.jsonl.
Synth synthesizes each sentence in narration.txt to its own WAV using the voice clone backend. Cached by sentence text — unchanged sentences are not re-synthesized.
Align computes start times per sentence: actual_start = max(beat_time, previous_segment_end + gap). No overlap, ever. Output: one mixed-down WAV.
Build muxes audio onto video without -shortest (so video is never truncated), then optionally trims a leading/trailing slice and applies atempo for a final time scaling.

Install

git clone <this repo> screencast
cd screencast
npm install        # for the driver runner (playwright peer dep optional)

You also need:

ffmpeg with avfoundation (macOS) or x11grab (Linux) input
F5-TTS (default voice backend) — see voice/README.md
cliclick if your driver uses cliclick for mouse motion (macOS)

Quickstart

# 1. Drop a 30-60s clean recording of yourself in voice/ref.wav and the
#    transcript in voice/ref.txt
# 2. Write your driver — see examples/coffee-page/record.mjs
# 3. Write narration.txt — one line per sentence, format:
#       beat-name | sentence text
# 4. Build
screencast record --driver examples/coffee-page/record.mjs --output demo.mov
screencast build --plan examples/coffee-page/narration.txt \
                 --video demo.mov \
                 --beats beats.jsonl \
                 --voice voice/ref.wav \
                 --voice-text voice/ref.txt \
                 --output final.mov \
                 --trim 1.5,4.0 --speed 1.08

Anatomy of a driver

// record.mjs — invoked by `screencast record`. Anything goes; just print
// `[t=X.XXs] beat-name` lines to stderr at each visual checkpoint.
const t0 = Date.now();
const beat = (label) => console.error(`[t=${((Date.now() - t0) / 1000).toFixed(2)}s] ${label}`);

await page.bringToFront();
beat("intro");
await sleep(2000);

beat("first-action");
await page.click("button.add-to-cart");
// ...

The framework doesn't care what library you use — Playwright, Puppeteer, robotjs, AppleScript via osascript, or a literal "press start, then press these keys for a human." All it needs is [t=...] lines on stderr while the screen recording runs.

Anatomy of a narration plan

intro          | This is Acme Coffee, the fastest way to order beans.
first-action   | I tap "Add to cart" and the cart updates instantly.
shipping       | Shipping address autofills from my saved profile.
checkout       | Checkout is one tap.

beat-name matches the names your driver emitted. The aligner reads beats.jsonl and uses the timestamp of intro as the desired start of sentence 1, etc.

If you'd rather hand-pin times, write 12.5 | sentence instead — float seconds work too.

Design notes

See docs/design.md for the longer story:

Why per-sentence synthesis beats slicing one big WAV
The no-overlap solver
Why caching by sentence text matters when iterating
Why -shortest will burn you

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
bin		bin
docs		docs
examples/coffee-page		examples/coffee-page
src		src
voice		voice
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

screencast

Why this exists

How it works

Install

Quickstart

Anatomy of a driver

Anatomy of a narration plan

Design notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

screencast

Why this exists

How it works

Install

Quickstart

Anatomy of a driver

Anatomy of a narration plan

Design notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages