Skip to content

elasticclaw/demovoice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DemoVoice

DemoVoice is an open-source CLI for re-recording the voice track of demo videos using AI text-to-speech while preserving the timing of the original narration.

It is built in Go with Cobra and Viper. It is not a general video editor, a SaaS product, or a GUI. The primary goal is timing-preserved AI re-voicing for software demos.

Status

This is the first working version. It supports OpenAI for transcription, speech synthesis, and optional rewrite attempts when generated speech does not fit the original timing window. More providers can be added through the provider interfaces in internal/providers.

CI, GitHub Actions, release automation, Docker, and goreleaser are intentionally not included yet.

Requirements

  • Go 1.22+
  • ffmpeg and ffprobe available on PATH
  • BYOK: set provider API keys through environment variables
  • OPENAI_API_KEY for the OpenAI provider

Install

go install github.com/elasticclaw/demovoice/cmd/demovoice@latest

For local development:

make build

Usage

Create a config:

demovoice init

This creates .demovoice/demovoice.yaml and .demovoice/glossary.yaml. DemoVoice automatically loads .demovoice/demovoice.yaml from the current directory.

You can also point DemoVoice at another project directory:

demovoice render demo.mp4 --dir ../my-project --output demo.demovoice.mp4

--dir accepts either the project root containing .demovoice/ or the .demovoice/ directory itself.

Set your provider key:

export OPENAI_API_KEY=...

Inspect narration timing:

demovoice inspect demo.mp4

Render replacement narration:

demovoice render demo.mp4 --output demo.demovoice.mp4

Render only the first 20 seconds while tuning voice, timing, or glossary settings:

demovoice render demo.mp4 --preview-duration 20s --output preview.mp4

Preview rendering avoids processing the full video, so it is the fastest way to test voice choices and timing parameters.

Editable Scripts

If transcription gets a sentence wrong, export an editable timed script:

demovoice inspect demo.mp4 --output-script .demovoice/script.yaml

Edit the generated YAML:

segments:
  - index: 0
    start: 4.38
    end: 8.46
    text: "The next thing I'm going to do is actually create the factory."

Then render from that corrected script:

demovoice render demo.mp4 --script .demovoice/script.yaml --output demo.demovoice.mp4

The script controls the text and timing windows used for TTS. DemoVoice does not re-transcribe when --script is provided.

Config

demovoice init creates:

profile: default

providers:
  stt:
    provider: openai
    model: whisper-1
  tts:
    provider: openai
    model: gpt-4o-mini-tts
    voice: cedar

presets:
  - tech-demo

glossaries:
  - glossary.yaml

profiles:
  default:
    pace: original
    emotion: neutral
    voice_instructions: >-
      Use natural technical demo narration.
      Preserve conversational inflection.
      Emphasize product names lightly.
      Pause briefly after commas and section transitions.
      Do not rush.
    preserve_timing: true
    max_segment_stretch: 1.12
    max_segment_compress: 0.88
    max_tempo_delta: 0.12
    max_forced_tempo: 1.3
    min_segment_seconds: 2.5
    max_segment_seconds: 10.0
    max_phrase_seconds: 10.0
    silence_padding_ms: 650
    rewrite_max_retries: 6
    segment_concurrency: 4

These defaults are tuned for software demo narration: fewer tiny segments, more context for TTS, moderate timing correction, and the built-in tech-demo preset. --config can point at another config file. --profile selects a profile. App-level settings use the DEMOVOICE_ environment prefix where applicable. Provider secrets are intentionally not stored in config.

Voice Tuning

The OpenAI TTS voice is selected under providers.tts.voice. For example:

providers:
  tts:
    provider: openai
    model: gpt-4o-mini-tts
    voice: cedar

Use voice_instructions to control delivery:

profiles:
  default:
    voice_instructions: >
      Sound like a calm, confident technical founder.
      Use a North American accent.
      Keep energy moderate and delivery clear.
      Do not sound salesy.

Run short previews while tuning:

demovoice render demo.mp4 --preview-duration 20s --output preview.mp4

Glossary

tech-demo is enabled by default. It provides common software-demo terms and prompt guidance for words like GitHub, repo, Linear, OpenAI, API, CLI, PR, OAuth, SDK, JSON, YAML, TypeScript, Kubernetes, Docker, Postgres, webhooks, frontend, backend, and CI/CD.

.demovoice/glossary.yaml is loaded automatically when present. Use it for product names, technical terms, pronunciations, and common transcription aliases:

terms:
  - text: AmazeCRM
    pronunciation: "amaze C R M"
    aliases:
      - "Amaze CRM"
      - "Amazed CRM"

  - text: GitHub App
    pronunciation: "git hub app"
    aliases:
      - "github app"

Additional glossary files can be listed in config or passed with repeatable --glossary flags.

Built-in presets load first. Project glossaries load after presets and can override or extend preset terms.

Timing Settings

DemoVoice tries to preserve original timing without making speech sound unnaturally fast. The main controls are:

profiles:
  default:
    min_segment_seconds: 2.5
    max_segment_seconds: 10.0
    max_phrase_seconds: 10.0
    silence_padding_ms: 650
    max_tempo_delta: 0.12
    max_forced_tempo: 1.3
    rewrite_max_retries: 6
    segment_concurrency: 4
  • min_segment_seconds avoids tiny TTS windows.
  • silence_padding_ms controls how much silence is needed before a new segment is created.
  • max_phrase_seconds controls the initial phrase target before sentence-fragment repair.
  • max_tempo_delta is the preferred tempo-change window before content fitting is attempted.
  • max_forced_tempo caps worst-case tempo fallback so speech does not become unusably fast.
  • rewrite_max_retries controls how many minimal text-fit attempts are made.
  • segment_concurrency controls parallel TTS generation.

How It Works

render validates ffmpeg and ffprobe, extracts the source audio, transcribes it or reads a provided script, splits narration into timed sentence-like segments, synthesizes each segment, fits generated audio into the original timing window, assembles a silent base track with generated narration overlaid at original start times, and muxes the new audio with the original video stream.

Segmentation is sentence-oriented rather than raw silence-only splitting. DemoVoice uses word timestamps, punctuation, pause thresholds, minimum segment durations, and continuation-word repair to avoid fragments such as which will allow being split away from me access to the source code.

When generated speech is outside the preferred tempo window, DemoVoice asks the provider to make the smallest possible text edit for the target duration and retries. If it still cannot fit after retries, it uses stronger tempo adjustment to avoid overlapping narration.

Limitations

  • First version supports OpenAI only.
  • Tests do not make real OpenAI calls.
  • Timing quality depends on source audio clarity, transcription quality, and TTS pacing.
  • It preserves speech timing and silence gaps; it does not edit video content.

About

Automatically re-record voice tracks in demo videos while keeping timing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors