Self-hosted, open-source audio transcription server with speaker diarization. Powered by faster-whisper and pyannote-audio.
~75x realtime with speaker diarization, ~140x transcription-only on a single NVIDIA RTX 4090. A 2.5-hour podcast transcribes and diarizes in under 2 minutes.
A transcription API server that runs on your hardware. Send it an audio or video file, get back timestamped text with speaker labels. No data leaves your network.
Two ways to use it:
- API — POST files to the server from any language or tool (curl, Python, JS, etc.)
- Web UI (optional) — A Streamlit frontend for uploading files, pasting YouTube URLs, and viewing transcripts in your browser
You need Docker, an NVIDIA GPU with the NVIDIA Container Toolkit, and a HuggingFace token (free — needed to download pyannote diarization models).
git clone https://github.com/arbitrationcity/ARBI-TR.git
cd ARBI-TR
# Set your HuggingFace token
echo "HF_TOKEN=hf_your_token_here" > .env
# Start the server + frontend
docker compose upThat's it. The server is at http://localhost:8000 and the web UI is at http://localhost:8501.
Models download automatically on first start and are cached on the host (at ~/.cache/huggingface by default — override with HF_CACHE in .env). Subsequent starts are fast.
Interactive docs are at http://localhost:8000/docs once the server is running.
Submit a file, get a session ID, poll for results. This is the full pipeline — speech recognition + speaker identification.
# Submit
curl -X POST http://localhost:8000/transcribe/ \
-F "file=@meeting.wav" \
-F "task_str=transcribe" \
-F "size_of_model=large"
# Response:
# {"session_id": "abc-123", "message": "Your request is queued for processing", "queue_position": 1}
# Poll for results
curl http://localhost:8000/task_status/abc-123
# Response (when complete):
# {
# "status": "completed",
# "segments": [
# {"Start": "0:00:00", "End": "0:00:05", "Speaker": "SPEAKER_00", "Text": "Good morning everyone"},
# {"Start": "0:00:05", "End": "0:00:09", "Speaker": "SPEAKER_01", "Text": "Hi, thanks for joining"},
# ...
# ]
# }Optional parameters:
source_language— ISO language code (e.g.en,fr). Omit for auto-detection.speaker_number— Expected number of speakers (1-8). Omit or0for auto-detection.
Drop-in replacements for the OpenAI audio API. These are synchronous (response blocks until done) and return plain text without speaker labels.
# Transcribe
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F "file=@meeting.wav" \
-F "model=whisper-large-v3"
# {"text": "Good morning everyone. Hi, thanks for joining..."}
# Translate (any language to English)
curl -X POST http://localhost:8000/v1/audio/translations \
-F "file=@reunion.wav" \
-F "model=whisper-large-v3"
# {"text": "Good morning everyone..."}These endpoints accept the same parameters as the OpenAI Audio API: model, language, prompt, response_format, temperature.
curl http://localhost:8000/health
# {"status": "ok", "queue_length": 0}| Method | Endpoint | Mode | Description |
|---|---|---|---|
GET |
/health |
— | Health check with queue length |
POST |
/transcribe/ |
async | Transcribe + diarize. Returns session ID to poll. |
GET |
/task_status/{session_id} |
— | Poll job status (queued, completed, failed) |
POST |
/v1/audio/transcriptions |
sync | OpenAI-compatible transcription (no diarization) |
POST |
/v1/audio/translations |
sync | OpenAI-compatible translation to English |
Supported input formats: WAV, MP3, MP4, M4A, FLAC, and anything else FFmpeg can decode.
The Streamlit frontend runs alongside the API server. Open http://localhost:8501 in your browser.
Features:
- Upload audio/video files (WAV, MP3, MP4, M4A) for transcription
- Paste YouTube URLs to download and transcribe videos directly
- Choose between transcription and translation (any language to English)
- Configure model size (small for speed, large for accuracy) and speaker count
- View results as a table with start time, end time, speaker, and text
If you only need the API server without the web UI, use the test compose file:
docker compose -f docker-compose.test.yaml upSet these in your .env file:
| Variable | Description | Default |
|---|---|---|
HF_TOKEN |
HuggingFace token for pyannote models | (required) |
HF_CACHE |
Host path for model cache (bind-mounted into containers) | ~/.cache/huggingface |
WHISPER_MODEL_SIZE |
Model size: tiny, base, small, medium, large-v3 |
large-v3 |
WHISPER_BEAM_SIZE |
Beam search width | 5 |
WHISPER_BATCH_SIZE |
Batched inference chunk count | 24 |
PYANNOTE_MODEL |
Pyannote pipeline model | pyannote/speaker-diarization-community-1 |
PYANNOTE_SEG_BATCH |
Pyannote segmentation batch size | 32 |
PYANNOTE_EMB_BATCH |
Pyannote embedding batch size | 32 |
ENABLE_TF32 |
Enable TF32 matmul on Ampere+ GPUs | 1 |
WHISPER_DEVICE |
GPU for whisper (e.g. cuda:0, cpu) |
auto |
DIARIZE_DEVICE |
GPU for pyannote (e.g. cuda:1, cpu) |
auto |
| File | What it runs |
|---|---|
docker-compose.yaml |
API server + web UI (default) |
docker-compose.test.yaml |
API server only (for testing / API-only deployments) |
# Backend
cd backend
uv sync
HF_TOKEN=hf_xxx uv run uvicorn main:app --host 0.0.0.0 --port 8000
# Frontend (separate terminal)
cd frontend
uv sync
API_ENDPOINT="http://localhost:8000" uv run streamlit run app.py --server.port 8501 --server.address 0.0.0.0Requires Python 3.11+, uv, and NVIDIA CUDA 12.x+.
# Backend unit tests (no GPU needed, models mocked)
cd backend && uv run pytest tests/ -v
# Frontend unit tests
cd frontend && uv run pytest tests/ -v
# Integration tests (requires running server with GPU)
HF_TOKEN=hf_xxx ./scripts/run-integration-tests.sh./scripts/benchmark.sh # bundled 1-min clip
./scripts/benchmark.sh /path/to/long.wav # custom fileThis project uses Commitizen with Conventional Commits for automated versioning and changelog generation.
uv tool install commitizen
cz commit # interactive conventional commit
cz bump # bump version + update CHANGELOG.mdMIT — see LICENSE.
Built on faster-whisper, pyannote-audio, and HuggingFace Transformers.