Skip to content

chaoxu/screencast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

screencast

Narrated product-demo videos as code.

You write a script that drives the screen and emits beat-timestamped checkpoints. You write the narration as plain text, one sentence per beat. The tool clones a voice from a short reference clip, synthesizes each sentence, snaps each one to its beat (auto-spacing so sentences never overlap), muxes audio onto the recording, and trims/speeds the result.

The point: visuals are the source of truth for timing — audio adapts. Re-recording a 60-second demo is cheap; re-scrubbing a timeline to fix one mispronounced word is not. Change one sentence in narration.txt, run build, you have a fresh video in 10 seconds.

Why this exists

Other tools cover pieces of this. None cover the whole loop:

  • VHS — terminal-only, no voice
  • Remotion — synthetic video, can't drive a real app
  • Descript — voice cloning + timeline, but record-first / scrub-to-edit
  • Playwright video + ElevenLabs + ffmpeg — what screencast glues together

The novelty is the inversion: code-as-driver, narration follows, alignment is automatic. You never open a video editor.

How it works

your driver script        narration.txt          voice ref clip
       │                       │                       │
       ▼                       ▼                       ▼
   ffmpeg capture        per-sentence TTS       (cached, F5-TTS)
       │                       │
       │  beats.jsonl          │  segments/seg_N.wav
       ▼                       ▼
       └────────► no-overlap aligner ───────► narration.wav
                          │
                          └─► mux + trim + atempo ─► final.mov
  1. Driver prints [t=12.34s] beat-name lines to stderr while typing/clicking. Anything that can drive a UI works (Playwright, AppleScript, manual). Recording starts a ffmpeg avfoundation capture; beat lines are scraped into beats.jsonl.
  2. Synth synthesizes each sentence in narration.txt to its own WAV using the voice clone backend. Cached by sentence text — unchanged sentences are not re-synthesized.
  3. Align computes start times per sentence: actual_start = max(beat_time, previous_segment_end + gap). No overlap, ever. Output: one mixed-down WAV.
  4. Build muxes audio onto video without -shortest (so video is never truncated), then optionally trims a leading/trailing slice and applies atempo for a final time scaling.

Install

git clone <this repo> screencast
cd screencast
npm install        # for the driver runner (playwright peer dep optional)

You also need:

  • ffmpeg with avfoundation (macOS) or x11grab (Linux) input
  • F5-TTS (default voice backend) — see voice/README.md
  • cliclick if your driver uses cliclick for mouse motion (macOS)

Quickstart

# 1. Drop a 30-60s clean recording of yourself in voice/ref.wav and the
#    transcript in voice/ref.txt
# 2. Write your driver — see examples/coffee-page/record.mjs
# 3. Write narration.txt — one line per sentence, format:
#       beat-name | sentence text
# 4. Build
screencast record --driver examples/coffee-page/record.mjs --output demo.mov
screencast build --plan examples/coffee-page/narration.txt \
                 --video demo.mov \
                 --beats beats.jsonl \
                 --voice voice/ref.wav \
                 --voice-text voice/ref.txt \
                 --output final.mov \
                 --trim 1.5,4.0 --speed 1.08

Anatomy of a driver

// record.mjs — invoked by `screencast record`. Anything goes; just print
// `[t=X.XXs] beat-name` lines to stderr at each visual checkpoint.
const t0 = Date.now();
const beat = (label) => console.error(`[t=${((Date.now() - t0) / 1000).toFixed(2)}s] ${label}`);

await page.bringToFront();
beat("intro");
await sleep(2000);

beat("first-action");
await page.click("button.add-to-cart");
// ...

The framework doesn't care what library you use — Playwright, Puppeteer, robotjs, AppleScript via osascript, or a literal "press start, then press these keys for a human." All it needs is [t=...] lines on stderr while the screen recording runs.

Anatomy of a narration plan

intro          | This is Acme Coffee, the fastest way to order beans.
first-action   | I tap "Add to cart" and the cart updates instantly.
shipping       | Shipping address autofills from my saved profile.
checkout       | Checkout is one tap.

beat-name matches the names your driver emitted. The aligner reads beats.jsonl and uses the timestamp of intro as the desired start of sentence 1, etc.

If you'd rather hand-pin times, write 12.5 | sentence instead — float seconds work too.

Design notes

See docs/design.md for the longer story:

  • Why per-sentence synthesis beats slicing one big WAV
  • The no-overlap solver
  • Why caching by sentence text matters when iterating
  • Why -shortest will burn you

License

MIT

About

Narrated product-demo videos as code. Visuals first, audio adapts.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors