Local Transcription + Diarization
This project runs local transcription + speaker diarization with WhisperX.
- Create
.env.local:
HF_TOKEN=hf_xxxxxxxxxxxxxxxxx- Put your audio in this folder (default input is
sample.mp3). - Run:
make run- Check outputs:
sample.outsample.chunks.json
make run creates/updates .venv and installs dependencies automatically.
- Manual activation is not required for normal use.
make runhandles.venvsetup andpip install -r requirements.txtfor you.- If you want manual control, run:
python3.10 -m venv .venv
.venv/bin/python -m pip install --upgrade pip
.venv/bin/pip install -r requirements.txt
.venv/bin/python transcribe_local.py sample.mp3 --env-file .env.local- Python 3.10
makeffmpeg(used for automatic pre-conversion)- Hugging Face token in an env file (default
.env.local)
ASR runs locally, but diarization models are downloaded from gated Hugging Face repos.
HF_TOKENauthenticates model download/access.- Audio processing remains local.
- First run downloads models to cache; next runs reuse cache.
- Missing access causes 403 errors during diarization model load.
Token source behavior:
HF_TOKENis read fromENV_FILE(default.env.local).- Shell env vars are not used as token source.
- Transcribes audio with Whisper model
base(fixed). - Runs diarization with exactly 2 speakers (fixed).
- Assigns speakers to transcript segments.
- Chunks transcript in sentence-safe groups while preserving transcript tags.
- Writes:
sample.out(human-readable)sample.chunks.json(structured)
make run
make run DIARIZATION_MODEL=3.1
make run PROFILE=mac-arm-mps
make run AUDIO=meeting.wav OUTPUT_FILE=meeting.out CHUNKS_FILE=meeting.chunks.json
make run PRECONVERT=0AUDIO(defaultsample.mp3): input audio path.ENV_FILE(default.env.local): env file containingHF_TOKEN.CACHE_DIR(default~/.cache/whisperx_models): model cache location.OUTPUT_FILE(defaultsample.out): transcript text output.CHUNKS_FILE(defaultsample.chunks.json): structured chunk output.DIARIZATION_MODEL(community-1or3.1): diarization pipeline model.PROFILE(auto,mac-arm,mac-arm-mps,mac-intel,win-cuda,win-cpu-safe): runtime preset.ASR_DEVICE(optional): override transcription device (cpu,mps,cuda).DIARIZATION_DEVICE(optional): override diarization device (cpu,mps,cuda).PRECONVERT(default1): automatic audio normalization.1/true/yes: enable pre-conversion.0/false/no: disable pre-conversion.
community-1- Usually more stable quality.
- Often slower on CPU.
- Good quality-first default.
3.1- Often faster on CPU.
- Quality can vary more by audio.
- Requires access to both:
pyannote/speaker-diarization-3.1pyannote/segmentation-3.0
auto: detects machine and picks defaults.win-cuda: falls back towin-cpu-safeif CUDA is unavailable.mac-arm-mps: uses MPS for diarization when available, otherwise CPU fallback.
At startup the app prints requested profile, resolved profile, and effective runtime settings.
- If input is already mono 16k WAV, conversion is skipped.
- Otherwise input is converted to a temporary mono 16k WAV via
ffmpeg. - ASR + diarization both run on that normalized temp file.
- Temp file is removed automatically at the end.
- Validate input files and settings.
- Resolve runtime profile and device fallbacks.
- Load
HF_TOKENfrom env file. - Pre-convert audio when needed (
PRECONVERT=1). - Transcribe audio (
base). - Run diarization model (
community-1or3.1) with 2 speakers. - Assign speakers to transcript segments.
- Build sentence-safe chunks and keep transcript tags.
- Write
sample.outandsample.chunks.json.
[MM:SS] Speaker: "Text"
Contains:
summarytranscript_tag_countcalculated_chunk_countplanned_tags_per_chunk
chunks[]chunk_id,start,end,texttranscript_tags[]transcripts[]withid,tag,start,end,speaker,text
If DIARIZATION_MODEL=3.1 fails with 403, verify both gated repos are granted for the same account as the token in .env.local:
pyannote/speaker-diarization-3.1pyannote/segmentation-3.0