DemoVoice is an open-source CLI for re-recording the voice track of demo videos using AI text-to-speech while preserving the timing of the original narration.
It is built in Go with Cobra and Viper. It is not a general video editor, a SaaS product, or a GUI. The primary goal is timing-preserved AI re-voicing for software demos.
This is the first working version. It supports OpenAI for transcription, speech synthesis, and optional rewrite attempts when generated speech does not fit the original timing window. More providers can be added through the provider interfaces in internal/providers.
CI, GitHub Actions, release automation, Docker, and goreleaser are intentionally not included yet.
- Go 1.22+
ffmpegandffprobeavailable onPATH- BYOK: set provider API keys through environment variables
OPENAI_API_KEYfor the OpenAI provider
go install github.com/elasticclaw/demovoice/cmd/demovoice@latestFor local development:
make buildCreate a config:
demovoice initThis creates .demovoice/demovoice.yaml and .demovoice/glossary.yaml. DemoVoice automatically loads .demovoice/demovoice.yaml from the current directory.
You can also point DemoVoice at another project directory:
demovoice render demo.mp4 --dir ../my-project --output demo.demovoice.mp4--dir accepts either the project root containing .demovoice/ or the .demovoice/ directory itself.
Set your provider key:
export OPENAI_API_KEY=...Inspect narration timing:
demovoice inspect demo.mp4Render replacement narration:
demovoice render demo.mp4 --output demo.demovoice.mp4Render only the first 20 seconds while tuning voice, timing, or glossary settings:
demovoice render demo.mp4 --preview-duration 20s --output preview.mp4Preview rendering avoids processing the full video, so it is the fastest way to test voice choices and timing parameters.
If transcription gets a sentence wrong, export an editable timed script:
demovoice inspect demo.mp4 --output-script .demovoice/script.yamlEdit the generated YAML:
segments:
- index: 0
start: 4.38
end: 8.46
text: "The next thing I'm going to do is actually create the factory."Then render from that corrected script:
demovoice render demo.mp4 --script .demovoice/script.yaml --output demo.demovoice.mp4The script controls the text and timing windows used for TTS. DemoVoice does not re-transcribe when --script is provided.
demovoice init creates:
profile: default
providers:
stt:
provider: openai
model: whisper-1
tts:
provider: openai
model: gpt-4o-mini-tts
voice: cedar
presets:
- tech-demo
glossaries:
- glossary.yaml
profiles:
default:
pace: original
emotion: neutral
voice_instructions: >-
Use natural technical demo narration.
Preserve conversational inflection.
Emphasize product names lightly.
Pause briefly after commas and section transitions.
Do not rush.
preserve_timing: true
max_segment_stretch: 1.12
max_segment_compress: 0.88
max_tempo_delta: 0.12
max_forced_tempo: 1.3
min_segment_seconds: 2.5
max_segment_seconds: 10.0
max_phrase_seconds: 10.0
silence_padding_ms: 650
rewrite_max_retries: 6
segment_concurrency: 4These defaults are tuned for software demo narration: fewer tiny segments, more context for TTS, moderate timing correction, and the built-in tech-demo preset. --config can point at another config file. --profile selects a profile. App-level settings use the DEMOVOICE_ environment prefix where applicable. Provider secrets are intentionally not stored in config.
The OpenAI TTS voice is selected under providers.tts.voice. For example:
providers:
tts:
provider: openai
model: gpt-4o-mini-tts
voice: cedarUse voice_instructions to control delivery:
profiles:
default:
voice_instructions: >
Sound like a calm, confident technical founder.
Use a North American accent.
Keep energy moderate and delivery clear.
Do not sound salesy.Run short previews while tuning:
demovoice render demo.mp4 --preview-duration 20s --output preview.mp4tech-demo is enabled by default. It provides common software-demo terms and prompt guidance for words like GitHub, repo, Linear, OpenAI, API, CLI, PR, OAuth, SDK, JSON, YAML, TypeScript, Kubernetes, Docker, Postgres, webhooks, frontend, backend, and CI/CD.
.demovoice/glossary.yaml is loaded automatically when present. Use it for product names, technical terms, pronunciations, and common transcription aliases:
terms:
- text: AmazeCRM
pronunciation: "amaze C R M"
aliases:
- "Amaze CRM"
- "Amazed CRM"
- text: GitHub App
pronunciation: "git hub app"
aliases:
- "github app"Additional glossary files can be listed in config or passed with repeatable --glossary flags.
Built-in presets load first. Project glossaries load after presets and can override or extend preset terms.
DemoVoice tries to preserve original timing without making speech sound unnaturally fast. The main controls are:
profiles:
default:
min_segment_seconds: 2.5
max_segment_seconds: 10.0
max_phrase_seconds: 10.0
silence_padding_ms: 650
max_tempo_delta: 0.12
max_forced_tempo: 1.3
rewrite_max_retries: 6
segment_concurrency: 4min_segment_secondsavoids tiny TTS windows.silence_padding_mscontrols how much silence is needed before a new segment is created.max_phrase_secondscontrols the initial phrase target before sentence-fragment repair.max_tempo_deltais the preferred tempo-change window before content fitting is attempted.max_forced_tempocaps worst-case tempo fallback so speech does not become unusably fast.rewrite_max_retriescontrols how many minimal text-fit attempts are made.segment_concurrencycontrols parallel TTS generation.
render validates ffmpeg and ffprobe, extracts the source audio, transcribes it or reads a provided script, splits narration into timed sentence-like segments, synthesizes each segment, fits generated audio into the original timing window, assembles a silent base track with generated narration overlaid at original start times, and muxes the new audio with the original video stream.
Segmentation is sentence-oriented rather than raw silence-only splitting. DemoVoice uses word timestamps, punctuation, pause thresholds, minimum segment durations, and continuation-word repair to avoid fragments such as which will allow being split away from me access to the source code.
When generated speech is outside the preferred tempo window, DemoVoice asks the provider to make the smallest possible text edit for the target duration and retries. If it still cannot fit after retries, it uses stronger tempo adjustment to avoid overlapping narration.
- First version supports OpenAI only.
- Tests do not make real OpenAI calls.
- Timing quality depends on source audio clarity, transcription quality, and TTS pacing.
- It preserves speech timing and silence gaps; it does not edit video content.