A web-based voice extraction and editing tool. Collect audio from YouTube or system output, generate word-level transcripts with STT, edit with a synchronized waveform + text UI, cut and rearrange segments, and remove background music with AI — all in one place.
- YouTube Audio Extraction — Download audio from any YouTube URL (yt-dlp)
- System Audio Recording — Record system output via BlackHole (macOS)
- File Upload — Drag-and-drop local audio files
- Speech-to-Text — Word-level timestamps via faster-whisper
- Waveform + Text Sync — wavesurfer.js waveform with per-word highlighting and click-to-seek
- Inline Text Editing — Edit transcript text per segment
- Cut & Rearrange — Select regions on the waveform, cut into segments, drag to reorder
- Background Removal — Vocal/music separation with Demucs AI, switch between stems
- Export — Download as WAV/MP3 audio or TXT/SRT transcript
| Layer | Technology |
|---|---|
| Backend | Python + FastAPI + SQLite (SQLAlchemy async) |
| Frontend | React + TypeScript + Vite + TailwindCSS v4 |
| State | Zustand |
| Audio | yt-dlp, sounddevice, ffmpeg |
| STT | faster-whisper (word_timestamps) |
| Separation | Demucs (htdemucs) + torchcodec |
| Waveform | wavesurfer.js + RegionsPlugin |
| Drag & Drop | @dnd-kit/core + @dnd-kit/sortable |
- Python 3.11–3.13 (recommended) or Python 3.14+ (requires separate Python 3.11–3.13 for Demucs)
- Node.js 18+
- ffmpeg (
brew install ffmpeg/sudo apt install ffmpeg) - BlackHole (macOS system audio recording — download)
Python version note: With Python 3.11–3.13, all dependencies (including Demucs) are installed in a single venv. Python 3.14+ is incompatible with Demucs, so the setup script automatically creates a separate venv.
git clone https://github.com/chadingTV/voiceeditor.git
cd voiceeditor
./scripts/setup.shThe setup script automatically:
- Checks prerequisites (python3, node, npm, ffmpeg)
- Detects Python version → single venv or separate Demucs venv
- Creates backend Python venv and installs dependencies
- (Python 3.14+ only) Creates Demucs venv with compatible Python
- Installs SwitchAudioSource (macOS, for system audio recording)
- Installs frontend npm packages
# Start both backend and frontend
./scripts/dev.shOpen http://localhost:5173 in your browser.
Click here for manual installation steps
cd backend
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# If Python 3.11-3.13, also install demucs:
pip install -r requirements-demucs.txtcd backend
python3.12 -m venv .venv-demucs # or python3.11, python3.13
source .venv-demucs/bin/activate
pip install -r requirements-demucs.txt
deactivatecd frontend
npm install# Backend
cd backend && source .venv/bin/activate && uvicorn main:app --reload --port 8000
# Frontend
cd frontend && npm run devvoiceeditor/
├── backend/
│ ├── main.py # FastAPI app
│ ├── config.py # Configuration
│ ├── requirements.txt # Main backend dependencies
│ ├── requirements-demucs.txt # Demucs-specific dependencies
│ ├── routers/ # API routers
│ │ ├── projects.py # Project CRUD
│ │ ├── audio.py # Audio import (YouTube/upload/recording)
│ │ ├── transcription.py # STT + text editing + TXT/SRT download
│ │ ├── separation.py # Background removal (Demucs subprocess)
│ │ └── editor.py # Segment editing & export
│ ├── services/ # Business logic
│ ├── models/ # DB models & schemas
│ └── tasks/ # Background task manager
├── frontend/
│ └── src/
│ ├── api/ # API client modules
│ ├── stores/ # Zustand stores
│ ├── components/
│ │ ├── layout/ # AppShell, Header, Sidebar
│ │ ├── import/ # YouTube, upload, recording UI
│ │ └── editor/ # Waveform editor, transcript panel, segment timeline
│ ├── hooks/ # Custom hooks
│ └── types/ # TypeScript types
└── scripts/
├── setup.sh # Automated setup script
└── dev.sh # Dev server launcher
- Create a project — Click "New Project" in the sidebar
- Import audio — Paste a YouTube URL, upload a file, or record system audio
- Generate transcript — Click "Generate STT" in the editor
- Review & edit text — Click the pencil icon on any segment to edit inline
- Cut segments — Drag-select a region on the waveform → "Cut Selection"
- Reorder — Drag segments in the timeline to rearrange
- Remove background — Click "Remove Background" → select Vocals/No Vocals stem
- Export — Download audio (WAV/MP3) or transcript (TXT/SRT)
Demucs always runs as a subprocess. The Python executable is auto-detected based on the environment:
| System Python | Demucs Strategy |
|---|---|
| 3.11–3.13 | Runs directly from the main venv (single venv) |
| 3.14+ | Runs from .venv-demucs with compatible Python (dual venv) |
Backend (separation.py)
│
├── _find_demucs_python() ← auto-detect
│ │
│ ├── .venv-demucs exists? → .venv-demucs/bin/python3
│ └── otherwise → try current python's demucs
│
└── subprocess.run([python, "-m", "demucs", ...])
- Stem selector reset — Switching between Original/Vocals/No Vocals no longer resets to Original
- Export wrong audio — Export now correctly uses the current audio file, not the first one in the project
- Export ignoring reorder — Exported audio now respects the drag-and-drop segment order
- Export ignoring active stem — Exporting in Vocals mode now exports vocals only, not the original
- pydub crash on Python 3.14 — Replaced pydub (broken
audioopmodule) with direct ffmpeg subprocess - Demucs torchcodec missing — Added torchcodec to demucs dependencies for audio saving
- System recording silence — Auto-switch to multi-output device when recording starts
- Audio output stuck on multi-output — Fallback to built-in speaker when previous output device is disconnected
- Audio output not restored on crash — Added atexit handler to restore output on server shutdown
- DndContext hijacking clicks — Added pointer distance threshold so buttons work alongside drag-and-drop
- Transcript edit not displaying — Edited text now correctly shown instead of original words
- Download encoding error — Fixed Korean filename encoding in Content-Disposition header (RFC 5987)
- STT infinite loading — Added error handling for failed background tasks
- Audio file rename (inline edit) and delete
- Transcript download in TXT and SRT formats
- Audio file download
- Automated setup script with Python version detection
- Cross-platform Demucs path auto-detection
This project is licensed under the MIT License.
If you redistribute or use this project in derivative works, please include the following attribution:
Original project: VoiceEditor by chadingTV https://github.com/chadingTV/voiceeditor