localt

Local Transcription + Diarization

This project runs local transcription + speaker diarization with WhisperX.

Quick Start

Create .env.local:

HF_TOKEN=hf_xxxxxxxxxxxxxxxxx

Put your audio in this folder (default input is sample.mp3).
Run:

make run

Check outputs:

sample.out
sample.chunks.json

make run creates/updates .venv and installs dependencies automatically.

Python environment (`.venv`)

Manual activation is not required for normal use.
make run handles .venv setup and pip install -r requirements.txt for you.
If you want manual control, run:

python3.10 -m venv .venv
.venv/bin/python -m pip install --upgrade pip
.venv/bin/pip install -r requirements.txt
.venv/bin/python transcribe_local.py sample.mp3 --env-file .env.local

Requirements

Python 3.10
make
ffmpeg (used for automatic pre-conversion)
Hugging Face token in an env file (default .env.local)

Why `HF_TOKEN` is required

ASR runs locally, but diarization models are downloaded from gated Hugging Face repos.

HF_TOKEN authenticates model download/access.
Audio processing remains local.
First run downloads models to cache; next runs reuse cache.
Missing access causes 403 errors during diarization model load.

Token source behavior:

HF_TOKEN is read from ENV_FILE (default .env.local).
Shell env vars are not used as token source.

What the program does

Transcribes audio with Whisper model base (fixed).
Runs diarization with exactly 2 speakers (fixed).
Assigns speakers to transcript segments.
Chunks transcript in sentence-safe groups while preserving transcript tags.
Writes:
- sample.out (human-readable)
- sample.chunks.json (structured)

Run examples

make run
make run DIARIZATION_MODEL=3.1
make run PROFILE=mac-arm-mps
make run AUDIO=meeting.wav OUTPUT_FILE=meeting.out CHUNKS_FILE=meeting.chunks.json
make run PRECONVERT=0

Settings explained

AUDIO (default sample.mp3): input audio path.
ENV_FILE (default .env.local): env file containing HF_TOKEN.
CACHE_DIR (default ~/.cache/whisperx_models): model cache location.
OUTPUT_FILE (default sample.out): transcript text output.
CHUNKS_FILE (default sample.chunks.json): structured chunk output.
DIARIZATION_MODEL (community-1 or 3.1): diarization pipeline model.
PROFILE (auto, mac-arm, mac-arm-mps, mac-intel, win-cuda, win-cpu-safe): runtime preset.
ASR_DEVICE (optional): override transcription device (cpu, mps, cuda).
DIARIZATION_DEVICE (optional): override diarization device (cpu, mps, cuda).
PRECONVERT (default 1): automatic audio normalization.
- 1/true/yes: enable pre-conversion.
- 0/false/no: disable pre-conversion.

Diarization model differences

community-1
- Usually more stable quality.
- Often slower on CPU.
- Good quality-first default.
3.1
- Often faster on CPU.
- Quality can vary more by audio.
- Requires access to both:
  - pyannote/speaker-diarization-3.1
  - pyannote/segmentation-3.0

Profile behavior

auto: detects machine and picks defaults.
win-cuda: falls back to win-cpu-safe if CUDA is unavailable.
mac-arm-mps: uses MPS for diarization when available, otherwise CPU fallback.

At startup the app prints requested profile, resolved profile, and effective runtime settings.

Pre-conversion behavior (enabled by default)

If input is already mono 16k WAV, conversion is skipped.
Otherwise input is converted to a temporary mono 16k WAV via ffmpeg.
ASR + diarization both run on that normalized temp file.
Temp file is removed automatically at the end.

Pipeline steps

Validate input files and settings.
Resolve runtime profile and device fallbacks.
Load HF_TOKEN from env file.
Pre-convert audio when needed (PRECONVERT=1).
Transcribe audio (base).
Run diarization model (community-1 or 3.1) with 2 speakers.
Assign speakers to transcript segments.
Build sentence-safe chunks and keep transcript tags.
Write sample.out and sample.chunks.json.

Output formats

`sample.out`

[MM:SS] Speaker: "Text"

`sample.chunks.json`

Contains:

summary
- transcript_tag_count
- calculated_chunk_count
- planned_tags_per_chunk
chunks[]
- chunk_id, start, end, text
- transcript_tags[]
- transcripts[] with id, tag, start, end, speaker, text

Troubleshooting

If DIARIZATION_MODEL=3.1 fails with 403, verify both gated repos are granted for the same account as the token in .env.local:

pyannote/speaker-diarization-3.1
pyannote/segmentation-3.0

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
transcribe_local.py		transcribe_local.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

localt

Quick Start

Python environment (`.venv`)

Requirements

Why `HF_TOKEN` is required

What the program does

Run examples

Settings explained

Diarization model differences

Profile behavior

Pre-conversion behavior (enabled by default)

Pipeline steps

Output formats

`sample.out`

`sample.chunks.json`

Troubleshooting

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

localt

Quick Start

Python environment (.venv)

Requirements

Why HF_TOKEN is required

What the program does

Run examples

Settings explained

Diarization model differences

Profile behavior

Pre-conversion behavior (enabled by default)

Pipeline steps

Output formats

sample.out

sample.chunks.json

Troubleshooting

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

Python environment (`.venv`)

Why `HF_TOKEN` is required

`sample.out`

`sample.chunks.json`