Skip to content

alesr/localt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

localt

Local Transcription + Diarization

This project runs local transcription + speaker diarization with WhisperX.

Quick Start

  1. Create .env.local:
HF_TOKEN=hf_xxxxxxxxxxxxxxxxx
  1. Put your audio in this folder (default input is sample.mp3).
  2. Run:
make run
  1. Check outputs:
  • sample.out
  • sample.chunks.json

make run creates/updates .venv and installs dependencies automatically.

Python environment (.venv)

  • Manual activation is not required for normal use.
  • make run handles .venv setup and pip install -r requirements.txt for you.
  • If you want manual control, run:
python3.10 -m venv .venv
.venv/bin/python -m pip install --upgrade pip
.venv/bin/pip install -r requirements.txt
.venv/bin/python transcribe_local.py sample.mp3 --env-file .env.local

Requirements

  • Python 3.10
  • make
  • ffmpeg (used for automatic pre-conversion)
  • Hugging Face token in an env file (default .env.local)

Why HF_TOKEN is required

ASR runs locally, but diarization models are downloaded from gated Hugging Face repos.

  • HF_TOKEN authenticates model download/access.
  • Audio processing remains local.
  • First run downloads models to cache; next runs reuse cache.
  • Missing access causes 403 errors during diarization model load.

Token source behavior:

  • HF_TOKEN is read from ENV_FILE (default .env.local).
  • Shell env vars are not used as token source.

What the program does

  • Transcribes audio with Whisper model base (fixed).
  • Runs diarization with exactly 2 speakers (fixed).
  • Assigns speakers to transcript segments.
  • Chunks transcript in sentence-safe groups while preserving transcript tags.
  • Writes:
    • sample.out (human-readable)
    • sample.chunks.json (structured)

Run examples

make run
make run DIARIZATION_MODEL=3.1
make run PROFILE=mac-arm-mps
make run AUDIO=meeting.wav OUTPUT_FILE=meeting.out CHUNKS_FILE=meeting.chunks.json
make run PRECONVERT=0

Settings explained

  • AUDIO (default sample.mp3): input audio path.
  • ENV_FILE (default .env.local): env file containing HF_TOKEN.
  • CACHE_DIR (default ~/.cache/whisperx_models): model cache location.
  • OUTPUT_FILE (default sample.out): transcript text output.
  • CHUNKS_FILE (default sample.chunks.json): structured chunk output.
  • DIARIZATION_MODEL (community-1 or 3.1): diarization pipeline model.
  • PROFILE (auto, mac-arm, mac-arm-mps, mac-intel, win-cuda, win-cpu-safe): runtime preset.
  • ASR_DEVICE (optional): override transcription device (cpu, mps, cuda).
  • DIARIZATION_DEVICE (optional): override diarization device (cpu, mps, cuda).
  • PRECONVERT (default 1): automatic audio normalization.
    • 1/true/yes: enable pre-conversion.
    • 0/false/no: disable pre-conversion.

Diarization model differences

  • community-1
    • Usually more stable quality.
    • Often slower on CPU.
    • Good quality-first default.
  • 3.1
    • Often faster on CPU.
    • Quality can vary more by audio.
    • Requires access to both:
      • pyannote/speaker-diarization-3.1
      • pyannote/segmentation-3.0

Profile behavior

  • auto: detects machine and picks defaults.
  • win-cuda: falls back to win-cpu-safe if CUDA is unavailable.
  • mac-arm-mps: uses MPS for diarization when available, otherwise CPU fallback.

At startup the app prints requested profile, resolved profile, and effective runtime settings.

Pre-conversion behavior (enabled by default)

  • If input is already mono 16k WAV, conversion is skipped.
  • Otherwise input is converted to a temporary mono 16k WAV via ffmpeg.
  • ASR + diarization both run on that normalized temp file.
  • Temp file is removed automatically at the end.

Pipeline steps

  1. Validate input files and settings.
  2. Resolve runtime profile and device fallbacks.
  3. Load HF_TOKEN from env file.
  4. Pre-convert audio when needed (PRECONVERT=1).
  5. Transcribe audio (base).
  6. Run diarization model (community-1 or 3.1) with 2 speakers.
  7. Assign speakers to transcript segments.
  8. Build sentence-safe chunks and keep transcript tags.
  9. Write sample.out and sample.chunks.json.

Output formats

sample.out

[MM:SS] Speaker: "Text"

sample.chunks.json

Contains:

  • summary
    • transcript_tag_count
    • calculated_chunk_count
    • planned_tags_per_chunk
  • chunks[]
    • chunk_id, start, end, text
    • transcript_tags[]
    • transcripts[] with id, tag, start, end, speaker, text

Troubleshooting

If DIARIZATION_MODEL=3.1 fails with 403, verify both gated repos are granted for the same account as the token in .env.local:

  • pyannote/speaker-diarization-3.1
  • pyannote/segmentation-3.0

About

Local audio transcription and diarization

Topics

Resources

License

Stars

Watchers

Forks

Contributors