Apple Silicon native Β· zero outbound traffic on the happy path Β· drop-in replacement for api.anthropic.com β Deepgram
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β 89 ms warm β’ 16 backends β’ one env var β’ zero cloud calls β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Claude Code's /voice mode streams every word you say to api.anthropic.com β Deepgram. Two third-parties, no in-product opt-out. stt-switch flips one env var (VOICE_STREAM_BASE_URL) to redirect the audio to a local WebSocket proxy that speaks the same Anthropic protocol. Your speech is transcribed on-device by your choice of 16 backends β including Whisper-large-v3-turbo on Apple Silicon GPU at 89 ms warm, faster than the network round-trip to Deepgram. One CLI command swaps backends. MIT licensed. β¨
# Install
pip install "git+https://github.com/gitjfmd/stt-switch.git#egg=stt-switch[mlx,moonshine]"
# Wire it up (adds env var + launchd agent)
stt-switch wire
# Pick the engine
stt-switch mlx-whisper-large-v3-turbo
# Open a NEW terminal so the env var is picked up
claude # /voice as usual β audio never leaves your MacRevert any time with stt-switch unwire.
flowchart LR
A([ποΈ Mac mic<br/>AVFoundation]) -->|16 kHz linear16| B
B[ws://127.0.0.1:8113<br/>cc-voice-proxy] -->|TranscriptText<br/>TranscriptEndpoint| A
B --> C{stt-switch<br/>backend}
C -->|swap by env var| D1[π MLX-Whisper<br/>Apple GPU<br/>89 ms]
C --> D2[π Moonshine<br/>CPU<br/>17 ms]
C --> D3[π Remote<br/>OpenAI-compat<br/>200 ms]
D1 -.no network.-> X[(privacy)]
D2 -.no network.-> X
D3 -.your server.-> Y[(your control)]
style A fill:#1f6feb,stroke:#fff,color:#fff
style B fill:#8957e5,stroke:#fff,color:#fff
style D1 fill:#1a7f37,stroke:#fff,color:#fff
style D2 fill:#1a7f37,stroke:#fff,color:#fff
style D3 fill:#fb8500,stroke:#fff,color:#fff
The proxy speaks Anthropic's decoded voice_stream protocol (not Deepgram's wire format β that's the trap most people fall into). Authentication is unchanged; CC still writes its OAuth bearer, the proxy ignores it.
mlx-whisper-base-en β 14 ms ββββββββββββββββββββββββββββββββββββββ π fastest accurate
moonshine-onnx-tiny β 17 ms ββββββββββββββββββββββββββββββββββββββ
moonshine-onnx-base β 30 ms ββββββββββββββββββββββββββββββββββββββ
mlx-whisper-small-en β 40 ms ββββββββββββββββββββββββββββββββββββββ
moonshine-stream-tiny β 73 ms ββββββββββββββββββββββββββββββββββββββ
mlx-whisper-large-v3-turbo β 89 ms ββββββββββββββββββββββββββββββββββββββ β recommended
mlx-whisper-distil-large-v3 β 90 ms ββββββββββββββββββββββββββββββββββββββ
mlx-chain β 93 ms ββββββββββββββββββββββββββββββββββββββ β chain w/ fallback
mlx-whisper-large-v3 β147 ms ββββββββββββββββββββββββββββββββββββββ
remote-moonshine β208 ms ββββββββββββββββββββββββββββββββββββββ
remote-chain β197 ms ββββββββββββββββββββββββββββββββββββββ
moonshine-stream-medium β280 ms βββββββββββββββββββββββββββββββββββββββ + cont
remote-speaches (cloud GPU) β412 ms βββββββββββββββββββββββββββββββββββββββ + cont
ββββ reference ββββ
deepgram via api.anthropic.com (cloud) ~ 300β500 ms typical (Anthropic + Deepgram + network)
Most people should default to
mlx-chain: MLX whisper-large-v3-turbo runs locally at ~89 ms; on the rare empty transcript it cascades to a remotelarge-v3for accuracy. Best privacy + best accuracy + best latency, all at once.
| Backend | Best for | Profiles |
|---|---|---|
|
π MLX-Whisper Apple's native ML framework, runs on the GPU + Neural Engine. |
Lowest latency on Apple Silicon. Whisper accuracy. |
|
|
π Moonshine UsefulSensors' tiny ASR β purpose-built for streaming. |
Sub-30 ms commands. CPU-only, no Apple Silicon required. |
|
|
π faster-whisper CTranslate2-quantized Whisper, cross-platform. |
CPU baseline if MLX isn't an option. |
|
|
π Remote OpenAI-compatible Any server with |
Offload to your own GPU server. Configurable via env. |
|
|
π Chains Try fast/local first; fall back to a stronger model on empty/error. |
Best accuracy + latency + privacy combo. |
|
# Most users β MLX backends + Moonshine
pip install "git+https://github.com/gitjfmd/stt-switch.git#egg=stt-switch[mlx,moonshine]"
# Everything (largest install, all backends ready)
pip install "git+https://github.com/gitjfmd/stt-switch.git#egg=stt-switch[all]"
# MLX only (smallest install, fastest path on Apple Silicon)
pip install "git+https://github.com/gitjfmd/stt-switch.git#egg=stt-switch[mlx]"git clone https://github.com/gitjfmd/stt-switch.git
cd stt-switch
pip install -e ".[all]"brew install ffmpeg # required (audio conversion)
brew install espeak-ng # only if using TTS profiles (not for STT)| Command | What it does |
|---|---|
stt-switch |
Interactive picker (current marked) |
stt-switch list |
Show every profile |
stt-switch current |
Show what's running now |
stt-switch <profile> |
Set + reload (e.g. stt-switch mlx-chain) |
stt-switch reload |
Reload without changing profile |
stt-switch bench |
Latency leaderboard against the catalog |
stt-switch wire |
Install + start the proxy, route Claude Code to it |
stt-switch unwire |
Full revert |
$ stt-switch
STT backend switcher
Current: mlx-chain
βΈ 1. moonshine-onnx-base
2. moonshine-onnx-tiny
3. moonshine-stream-tiny
...
11. mlx-whisper-large-v3-turbo
12. βΈ mlx-chain
...
Pick a number (or blank to cancel):
Claude Code's /voice mode opens a WebSocket to api.anthropic.com/api/ws/speech_to_text/voice_stream. Anthropic wraps Deepgram in their own simpler JSON shape β sending Deepgram-native messages back will silently fail (the client ignores them and renders "no speech detected", which is what we kept hitting at first).
The decoded protocol β verified by reading the publicly leaked Claude Code source:
Full notes: docs/PROTOCOL.md (with credit to the reverse-engineering community).
|
|
- No private hostnames or IPs are baked in. Remote endpoints require explicit env vars (
STT_SWITCH_REMOTE_*_URL); the repo ships none. - Loopback bind only (
127.0.0.1). The proxy never touches a network interface, never trips your firewall. - Optional clone capture (off by default). Set
STT_SWITCH_LUXTTS_URLto opt in. - Audio never lands on disk unless you turn on capture, in which case it's posted to your own server.
When the proxy gets a non-empty transcript with sane duration (3β30 s by default), it can fire-and-forget the audio + transcript to a LuxTTS server you run, growing your voice clone library automatically as you talk to Claude Code.
# Enable
export STT_SWITCH_LUXTTS_URL=https://your-luxtts-host:17860
# Disable
export STT_SWITCH_CAPTURE_ENABLED=0Files land at voices/mac/cc/<timestamp>.wav on your LuxTTS server.
| π | MLX requires Apple Silicon. Intel Macs use Moonshine ONNX or faster-whisper-cpu. |
| π§ | Linux/Windows aren't first-class yet. Proxy itself is plain Python and runs fine; the wire/unwire scripts use macOS launchd. systemd-user PRs welcome. |
| π₯ | First model load downloads weights (large MLX ~1 GB; cached after). Subsequent loads are sub-second. |
| π€ | Letter-acronyms can mishear ("STT" β "ETS"). Use mlx-whisper-large-v3 or remote-speaches for absolute accuracy. |
| π | Anthropic's voice gating still applies. This redirects where audio goes, not whether the binary will send it. |
| π | Restart claude after stt-switch wire β the binary reads the env var once at startup. |
This project rests on a stack of excellent open-source work:
- Anthropic β for shipping Claude Code with the
VOICE_STREAM_BASE_URLenv var that makes this whole thing possible - MLX + mlx-whisper β Apple's ML framework + Whisper port
- useful-moonshine-onnx β UsefulSensors Moonshine ONNX
- faster-whisper β CTranslate2-quantized Whisper
- Speaches β OpenAI-compatible STT/TTS server (used for the
remote-speachesprofile) - Claude Code reverse-engineering community β for decoding the voice_stream protocol from the publicly available source
Built by Dr. Junaid Farooq, M.D. β a board-certified physician (Internal Medicine, Infectious Diseases) who codes, building at the intersection of medicine, AI infrastructure, and security.
He runs hospital medicine by day and ships open-source medical/AI tooling by night through his companies:
- π₯ OpenMedica (IntelMedica) β "reclaiming the future of medicine" with open-source, physician-governed medical AI tools (747+ skills, FDA data, clinical guidelines, drug interactions, the Open Medical Skills marketplace)
- π SECSOLS β MCP-native infrastructure for AI-agent payments and identity (AgenticPay, Ephemeral Cards, Agent Identity Protocol)
- π©Ί RN Scribe β nursing documentation software for clinical environments
stt-switch was born out of the same impulse that drives the rest of his work: infrastructure for autonomous AI agents should be private-first, on-device when possible, and built for the people who actually use it. A physician dictating to a coding agent shouldn't have their voice routed through two third-parties to get a transcript back.
Find him at:
- π jfmd.dev β the personal site
- π¦ @jfmdid on X
- πΌ linkedin.com/in/jfmdid on LinkedIn
- π @gitjfmd on GitHub
PRs welcome. Areas where help would land especially well:
- π§ Linux systemd-user variant of
wire-cc-voice.sh/unwire-cc-voice.sh - πͺ Windows support (proxy already runs; needs equivalent of launchd registration)
- π’ NVIDIA CUDA path for
whisper-faster-large-v3and Moonshine - ποΈ Live partial transcripts (interim
TranscriptTextevents during speech) - π PyPI publish + GitHub Action that runs
stt-bench.pyon every PR