Skip to content

gitjfmd/stt-switch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ™οΈ stt-switch

Route Claude Code voice mode to a local STT engine of your choice

Apple Silicon native Β· zero outbound traffic on the happy path Β· drop-in replacement for api.anthropic.com β†’ Deepgram

License: MIT macOS Python MLX Privacy PRs Welcome

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                        β”‚
β”‚   89 ms warm  β€’  16 backends  β€’  one env var  β€’  zero cloud calls     β”‚
β”‚                                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ The pitch in one paragraph

Claude Code's /voice mode streams every word you say to api.anthropic.com β†’ Deepgram. Two third-parties, no in-product opt-out. stt-switch flips one env var (VOICE_STREAM_BASE_URL) to redirect the audio to a local WebSocket proxy that speaks the same Anthropic protocol. Your speech is transcribed on-device by your choice of 16 backends β€” including Whisper-large-v3-turbo on Apple Silicon GPU at 89 ms warm, faster than the network round-trip to Deepgram. One CLI command swaps backends. MIT licensed. ✨


⚑ Quick demo

# Install
pip install "git+https://github.com/gitjfmd/stt-switch.git#egg=stt-switch[mlx,moonshine]"

# Wire it up (adds env var + launchd agent)
stt-switch wire

# Pick the engine
stt-switch mlx-whisper-large-v3-turbo

# Open a NEW terminal so the env var is picked up
claude        # /voice as usual β€” audio never leaves your Mac

Revert any time with stt-switch unwire.


πŸ—οΈ Architecture

flowchart LR
    A([πŸŽ™οΈ Mac mic<br/>AVFoundation]) -->|16 kHz linear16| B
    B[ws://127.0.0.1:8113<br/>cc-voice-proxy] -->|TranscriptText<br/>TranscriptEndpoint| A
    B --> C{stt-switch<br/>backend}
    C -->|swap by env var| D1[🍎 MLX-Whisper<br/>Apple GPU<br/>89 ms]
    C --> D2[πŸ”Š Moonshine<br/>CPU<br/>17 ms]
    C --> D3[🌐 Remote<br/>OpenAI-compat<br/>200 ms]
    D1 -.no network.-> X[(privacy)]
    D2 -.no network.-> X
    D3 -.your server.-> Y[(your control)]

    style A fill:#1f6feb,stroke:#fff,color:#fff
    style B fill:#8957e5,stroke:#fff,color:#fff
    style D1 fill:#1a7f37,stroke:#fff,color:#fff
    style D2 fill:#1a7f37,stroke:#fff,color:#fff
    style D3 fill:#fb8500,stroke:#fff,color:#fff
Loading

The proxy speaks Anthropic's decoded voice_stream protocol (not Deepgram's wire format β€” that's the trap most people fall into). Authentication is unchanged; CC still writes its OAuth bearer, the proxy ignores it.


⏱️ Latency leaderboard (Apple M5 Max, 3-second clip)

mlx-whisper-base-en           ▏ 14 ms     ▏▏▏░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  πŸ† fastest accurate
moonshine-onnx-tiny           ▏ 17 ms     ▏▏▏▏░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  
moonshine-onnx-base           ▏ 30 ms     ▏▏▏▏▏▏▏░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  
mlx-whisper-small-en          ▏ 40 ms     ▏▏▏▏▏▏▏▏▏░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  
moonshine-stream-tiny         ▏ 73 ms     ▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏░░░░░░░░░░░░░░░░░░░░░░  
mlx-whisper-large-v3-turbo    ▏ 89 ms     ▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏░░░░░░░░░░░░░░░░░░░  ⭐ recommended
mlx-whisper-distil-large-v3   ▏ 90 ms     ▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏░░░░░░░░░░░░░░░░░░░  
mlx-chain                     ▏ 93 ms     ▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏░░░░░░░░░░░░░░░░░░  ⭐ chain w/ fallback
mlx-whisper-large-v3          ▏147 ms     ▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏░░░░░░  
remote-moonshine              ▏208 ms     ▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏  
remote-chain                  ▏197 ms     ▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏  
moonshine-stream-medium       ▏280 ms     ▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏ + cont
remote-speaches (cloud GPU)   ▏412 ms     ▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏ + cont

   ──── reference ────
deepgram via api.anthropic.com (cloud) ~ 300–500 ms typical (Anthropic + Deepgram + network)

Most people should default to mlx-chain: MLX whisper-large-v3-turbo runs locally at ~89 ms; on the rare empty transcript it cascades to a remote large-v3 for accuracy. Best privacy + best accuracy + best latency, all at once.


πŸŽ›οΈ Backend catalog

BackendBest forProfiles

🍎 MLX-Whisper

Apple's native ML framework, runs on the GPU + Neural Engine.

Lowest latency on Apple Silicon. Whisper accuracy.

mlx-whisper-tiny-en mlx-whisper-base-en mlx-whisper-small-en mlx-whisper-medium-en mlx-whisper-medium mlx-whisper-distil-large-v3 mlx-whisper-large-v3-turbo mlx-whisper-large-v3

πŸ”Š Moonshine

UsefulSensors' tiny ASR β€” purpose-built for streaming.

Sub-30 ms commands. CPU-only, no Apple Silicon required.

moonshine-onnx-tiny moonshine-onnx-base moonshine-stream-tiny moonshine-stream-medium moonshine-hf-base

πŸ“š faster-whisper

CTranslate2-quantized Whisper, cross-platform.

CPU baseline if MLX isn't an option.

whisper-faster-base whisper-faster-large-v3

🌐 Remote OpenAI-compatible

Any server with /v1/audio/transcriptions. Speaches, self-hosted Whisper, anything.

Offload to your own GPU server. Configurable via env.

remote-moonshine remote-speaches remote-chain

πŸ”— Chains

Try fast/local first; fall back to a stronger model on empty/error.

Best accuracy + latency + privacy combo.

mlx-chain ⭐ remote-chain


πŸ”§ Installation

# Most users β€” MLX backends + Moonshine
pip install "git+https://github.com/gitjfmd/stt-switch.git#egg=stt-switch[mlx,moonshine]"

# Everything (largest install, all backends ready)
pip install "git+https://github.com/gitjfmd/stt-switch.git#egg=stt-switch[all]"

# MLX only (smallest install, fastest path on Apple Silicon)
pip install "git+https://github.com/gitjfmd/stt-switch.git#egg=stt-switch[mlx]"

From source

git clone https://github.com/gitjfmd/stt-switch.git
cd stt-switch
pip install -e ".[all]"

System dependencies

brew install ffmpeg                 # required (audio conversion)
brew install espeak-ng              # only if using TTS profiles (not for STT)

πŸ’» CLI

Command What it does
stt-switch Interactive picker (current marked)
stt-switch list Show every profile
stt-switch current Show what's running now
stt-switch <profile> Set + reload (e.g. stt-switch mlx-chain)
stt-switch reload Reload without changing profile
stt-switch bench Latency leaderboard against the catalog
stt-switch wire Install + start the proxy, route Claude Code to it
stt-switch unwire Full revert
$ stt-switch
STT backend switcher
Current: mlx-chain

  β–Έ  1. moonshine-onnx-base
     2. moonshine-onnx-tiny
     3. moonshine-stream-tiny
     ...
    11. mlx-whisper-large-v3-turbo
    12. β–Έ mlx-chain
    ...
Pick a number (or blank to cancel):

πŸ§ͺ How the protocol decode works

Claude Code's /voice mode opens a WebSocket to api.anthropic.com/api/ws/speech_to_text/voice_stream. Anthropic wraps Deepgram in their own simpler JSON shape β€” sending Deepgram-native messages back will silently fail (the client ignores them and renders "no speech detected", which is what we kept hitting at first).

The decoded protocol β€” verified by reading the publicly leaked Claude Code source:

// Server β†’ client
{"type":"TranscriptText",      "data":"your transcribed words"}
{"type":"TranscriptEndpoint"}
{"type":"TranscriptError",     "error_code":"...", "description":"..."}

// Client β†’ server
binary linear16 PCM frames
{"type":"KeepAlive"}                 // every ~8 s
{"type":"CloseStream"}               // on key release / finalize

Full notes: docs/PROTOCOL.md (with credit to the reverse-engineering community).


πŸ›‘οΈ Privacy posture

Default (mlx-chain)

    πŸŽ™οΈ  β†’  ws://127.0.0.1:8113  β†’  🍎 MLX-Whisper  β†’  πŸ“
                ↑                          ↑
       loopback only                 in-process

         ZERO OUTBOUND TRAFFIC

What Claude Code does without us

    πŸŽ™οΈ  β†’  wss://api.anthropic.com  β†’  🌐 Deepgram  β†’  πŸ“
                                   
            βœ— Anthropic sees audio
            βœ— Deepgram sees audio
            βœ— no in-product opt-out
  • No private hostnames or IPs are baked in. Remote endpoints require explicit env vars (STT_SWITCH_REMOTE_*_URL); the repo ships none.
  • Loopback bind only (127.0.0.1). The proxy never touches a network interface, never trips your firewall.
  • Optional clone capture (off by default). Set STT_SWITCH_LUXTTS_URL to opt in.
  • Audio never lands on disk unless you turn on capture, in which case it's posted to your own server.

πŸ“¦ Optional: voice-clone capture

When the proxy gets a non-empty transcript with sane duration (3–30 s by default), it can fire-and-forget the audio + transcript to a LuxTTS server you run, growing your voice clone library automatically as you talk to Claude Code.

# Enable
export STT_SWITCH_LUXTTS_URL=https://your-luxtts-host:17860

# Disable
export STT_SWITCH_CAPTURE_ENABLED=0

Files land at voices/mac/cc/<timestamp>.wav on your LuxTTS server.


⚠️ Caveats

🍎 MLX requires Apple Silicon. Intel Macs use Moonshine ONNX or faster-whisper-cpu.
🐧 Linux/Windows aren't first-class yet. Proxy itself is plain Python and runs fine; the wire/unwire scripts use macOS launchd. systemd-user PRs welcome.
πŸ“₯ First model load downloads weights (large MLX ~1 GB; cached after). Subsequent loads are sub-second.
πŸ”€ Letter-acronyms can mishear ("STT" β†’ "ETS"). Use mlx-whisper-large-v3 or remote-speaches for absolute accuracy.
πŸ”“ Anthropic's voice gating still applies. This redirects where audio goes, not whether the binary will send it.
πŸ”„ Restart claude after stt-switch wire β€” the binary reads the env var once at startup.

πŸ™ Credits

This project rests on a stack of excellent open-source work:


πŸ‘€ About the author

Built by Dr. Junaid Farooq, M.D. β€” a board-certified physician (Internal Medicine, Infectious Diseases) who codes, building at the intersection of medicine, AI infrastructure, and security.

He runs hospital medicine by day and ships open-source medical/AI tooling by night through his companies:

  • πŸ₯ OpenMedica (IntelMedica) β€” "reclaiming the future of medicine" with open-source, physician-governed medical AI tools (747+ skills, FDA data, clinical guidelines, drug interactions, the Open Medical Skills marketplace)
  • πŸ” SECSOLS β€” MCP-native infrastructure for AI-agent payments and identity (AgenticPay, Ephemeral Cards, Agent Identity Protocol)
  • 🩺 RN Scribe β€” nursing documentation software for clinical environments

stt-switch was born out of the same impulse that drives the rest of his work: infrastructure for autonomous AI agents should be private-first, on-device when possible, and built for the people who actually use it. A physician dictating to a coding agent shouldn't have their voice routed through two third-parties to get a transcript back.

Find him at:

Website X LinkedIn


🀝 Contributing

PRs welcome. Areas where help would land especially well:

  • 🐧 Linux systemd-user variant of wire-cc-voice.sh / unwire-cc-voice.sh
  • πŸͺŸ Windows support (proxy already runs; needs equivalent of launchd registration)
  • 🟒 NVIDIA CUDA path for whisper-faster-large-v3 and Moonshine
  • πŸŽ›οΈ Live partial transcripts (interim TranscriptText events during speech)
  • πŸ“Š PyPI publish + GitHub Action that runs stt-bench.py on every PR

MIT License Β· made on Apple Silicon Β· audio stays on device

GitHub stars

About

Route Claude Code voice mode to a local STT engine of your choice. Apple Silicon native.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages