Local, open-weight, real-time speech-to-speech AI assistant. Fully offline conversational AI using state-of-the-art open models.
Earshot is a complete voice assistant stack:
┌─────────────────────────────────────────────────────────────────────┐
│ RUST CLIENT (rust-earshot) │
│ Microphone → VAD → STT Client → LLM Client → TTS Client → Speaker │
│ + GUI + Notifications │
└─────────────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
│ STT Server │ │ TTS Server │ │ LLM Server │
│ (Parakeet/Whisper)│ │ (Soprano/Chatter) │ │ (LM Studio/Ollama)│
│ Python/NeMo │ │ Python/PyTorch │ │ Your choice │
│ Port 8765 │ │ Port 8766 │ │ Port 1234 │
└───────────────────┘ └───────────────────┘ └───────────────────┘
Components:
- Rust Client (
rust-earshot/): High-performance audio I/O, VAD, and pipeline coordination with GUI - Python STT Server: Parakeet (NVIDIA NeMo) or faster-whisper for speech-to-text
- Python TTS Server: Soprano (streaming, 64ms latency) or Chatterbox for text-to-speech
- LLM: Any OpenAI-compatible API (LM Studio, Ollama, vLLM, etc.)
- Real-time VAD: Silero VAD with configurable thresholds
- Streaming TTS: ~64ms time-to-first-audio with Soprano
- Interrupt handling: Speak to interrupt AI mid-response
- GUI: Desktop application with audio level meters
- Notifications: HTTP endpoint for push notifications read aloud
- Cross-platform: Linux and Windows support
- Fully local: No cloud dependencies, runs on consumer hardware
# 1. Clone and setup Python environment
git clone <repo-url> earshot
cd earshot
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# 2. Download models
python scripts/install_models.py
# 3. Start servers (in separate terminals)
bash scripts/run_whisper_server.sh # STT on port 8765
bash scripts/run_tts_server.sh # TTS on port 8766
# 4. Start an LLM server (e.g., LM Studio on port 1234)
# 5. Run the Rust client
cd rust-earshot
./run.shSee rust-earshot/README.md for detailed Rust client documentation.
- Real-time VAD: Silero VAD for accurate speech detection with hysteresis
- Async STT: faster-whisper on worker thread (never blocks audio)
- Multilingual: Supports 99 languages with auto-detection
- Smart segmentation: Pre/post padding, hangover, min duration filtering
- JSONL logging: Structured logs with UTC timestamps + word-level timing
- Network streaming: Stream audio from remote devices (laptop/phone to server)
- Soprano backend: Streaming output with ~64ms latency per chunk
- Chatterbox backend: High-quality batch synthesis
- Emoji filtering: Automatic removal of unspeakable characters
- WebSocket API: Binary audio streaming
For local microphone mode:
- macOS (Apple Silicon or Intel)
- Python 3.9+ (
python3 --version) - Homebrew (optional, for system dependencies)
For network streaming mode (server without mic):
- Linux or macOS
- Python 3.9+
- No microphone required (receives audio over network)
# 1. Clone the repository
git clone <your-repo-url> vad-server
cd vad-server
# 2. Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# 3. Install Python dependencies (~2 min)
pip install -r requirements.txt
# 4. Download VAD model
python scripts/install_models.py
# 5. Download STT model (choose one)
# For multilingual (medium): ~1.5GB, best accuracy
python -c "from faster_whisper import WhisperModel; WhisperModel('medium', device='cpu', compute_type='int8')"
# For English only (base.en): ~145MB, faster
python -c "from faster_whisper import WhisperModel; WhisperModel('base.en', device='cpu', compute_type='int8')"
# 6. Configure (optional)
cp .env.example .env
# Edit config.yaml to change model, thresholds, etc.
# 7. Grant microphone access
# Run once to trigger permission prompt:
python scripts/test_mic.py
# 8. Test the pipeline
bash scripts/dev_run.sh
# Speak into your mic - transcriptions appear in terminal
# 9. In another terminal, watch logs
tail -f ~/stt_logs/asr.jsonl# Test microphone capture (visual level meter)
python scripts/test_mic.py
# Test VAD detection (shows speech probability)
python scripts/test_vad.py
# Test STT directly (record then transcribe)
python scripts/test_stt.pyRun the STT pipeline on a server without a microphone by streaming audio from a remote client.
# 1. Use the server config (network audio source)
cp config.server.yaml config.yaml
# OR manually edit config.yaml:
# Set audio.source: "network"
# 2. Start the server
source .venv/bin/activate
python -m src.main
# Server will listen on ws://0.0.0.0:8765On your laptop, phone, or any device with a microphone:
# 1. Install dependencies (only need sounddevice and websockets)
pip install sounddevice websockets numpy
# 2. Stream audio to server
python scripts/stream_client.py --server ws://SERVER_IP:8765
# Example: Stream to server at 192.168.1.100
python scripts/stream_client.py --server ws://192.168.1.100:8765The client will:
- Stream your microphone audio to the server
- Receive and display transcriptions in real-time
- Show transcription latency for each segment
Server (config.yaml or config.server.yaml):
audio:
source: "network" # Use network stream instead of microphone
network:
host: "0.0.0.0" # Listen on all interfaces
port: 8765 # WebSocket port
auth_token: null # Optional: set to "your-secret-token" for authClient:
# With authentication
python scripts/stream_client.py \
--server ws://192.168.1.100:8765 \
--auth-token your-secret-tokenEdit config.yaml to adjust:
- Audio source (mic vs network)
- VAD thresholds and timing
- STT model size (tiny.en, base.en, small.en)
- Log rotation settings
- Heartbeat interval
- Network server settings (host, port, auth)
See TESTING.md for comprehensive testing procedures.
Run automated tests:
pytest tests/Run at login automatically:
# 1. Update paths in plist if needed
# Edit launchd/com.local.stt.daemon.plist to match your install location
# 2. Copy plist to LaunchAgents
cp launchd/com.local.stt.daemon.plist ~/Library/LaunchAgents/
# 3. Load and start
launchctl load ~/Library/LaunchAgents/com.local.stt.daemon.plist
launchctl start com.local.stt.daemon
# 4. Check status
launchctl list | grep stt.daemon
# 5. View logs
tail -f /tmp/stt-daemon.err
tail -f ~/stt_logs/asr.jsonl
# To stop/unload
launchctl stop com.local.stt.daemon
launchctl unload ~/Library/LaunchAgents/com.local.stt.daemon.plistEach line is a JSON object:
{
"type": "asr_segment",
"start_utc": "2025-10-06T07:59:12.184Z",
"end_utc": "2025-10-06T07:59:13.902Z",
"latency_s": 0.17,
"text": "okay let's kick off",
"tokens": [
{"w": "okay", "start_ms": 1696582752184, "end_ms": 1696582752500}
]
}- More sensitive: Lower
thresholdto 0.35-0.40 (catches quieter speech) - Less sensitive: Raise
thresholdto 0.50-0.55 (fewer false positives) - Longer utterances: Increase
hang_msto 800-1000 (holds through pauses) - Shorter segments: Decrease
hang_msto 400-500 (cuts faster) - Smoother detection: Lower
ema_alphato 0.20-0.25 (less reactive to spikes)
- tiny.en - Fastest, lowest accuracy (~75MB)
- base.en - Good balance (~145MB)
- small.en - Better accuracy (~466MB)
- medium - Best multilingual (~1.5GB) ← Current default
- distil-medium.en - 2x faster than medium, similar accuracy
Change model_size in config.yaml and restart.
Server not receiving audio:
- Check firewall allows port 8765 (or your configured port)
- Verify server IP address:
ip addr(Linux) orifconfig(macOS) - Ensure both devices on same network (or port forwarding configured)
- Check server logs for connection messages
Client connection refused:
- Verify server is running and listening on correct port
- Test with:
nc -zv SERVER_IP 8765ortelnet SERVER_IP 8765 - Check firewall on server allows incoming connections
Authentication failed:
- Ensure auth_token matches in both server config and client command
- Token is case-sensitive
- Check microphone permissions: System Settings → Privacy & Security → Microphone
- Run
python scripts/test_mic.pyto verify audio capture - Try lowering
vad.thresholdto 0.35
- Increase
hang_msto 800-1000 inconfig.yaml - Check
min_speech_msisn't too high (try 500)
- Use smaller model:
tiny.enorbase.en - Check if model is quantized:
compute_type: "int8"
- Manually download:
python -c "from faster_whisper import WhisperModel; WhisperModel('medium')" - Models cache to
~/.cache/huggingface/
- Check
/tmp/stt-daemon.errfor errors - Verify paths in
launchd/com.local.stt.daemon.plist - Ensure venv exists:
ls .venv/bin/python
- macOS (Apple Silicon or Intel)
- Python 3.9+
- ~2GB disk space (for models)
- Microphone access