Telepathy is silent audio dictation, AKA mouthreading. Allegedly what Q.ai, the AI company bought by Apple for $2B is solving.
A Tauri desktop app for macOS that works as a hold-to-record hotkey overlay: while the hotkey is held, webcam video is captured; on release, the clip is sent to a local Python server for visual speech (lipreading) inference and pasted as text.
-
Install JS dependencies:
npm install
-
Create the single Python environment (uv-managed):
./scripts/setup_python_env.shOn macOS 26+, this script upgrades torch/torchvision/torchaudio from the
PyTorch nightly index and then prints MPS availability, to handle newer macOS
runtime compatibility before stable wheels catch up.
-
Download/prepare AutoAVSR assets (repo + checkpoints):
./scripts/setup_autoavsr_assets.sh
-
Run the Python server:
.venv/bin/uvicorn server.main:app --reload
-
Run the app bundle in debug mode (recommended for macOS permissions onboarding):
npm run tauri:dev-app
This builds
src-tauri/target/debug/bundle/macos/Telepathy.appand opens it.If you want pure hot-reload webview iteration, you can still run:
npm run tauri dev
On startup, Telepathy opens a small onboarding UI to verify macOS permissions:
- Camera (required to capture video)
- Accessibility (required to auto-paste text via Cmd+V)
After onboarding, use the global hotkey:
CommandOrControl+Shift+Space(press and hold to record, release to transcribe and paste).
A small overlay pill appears while recording/processing and auto-hides after completion.
The capture pipeline targets 25 FPS and the backend normalizes incoming clips to the model FPS.
The backend now supports an AutoAVSR transcriber adapter in server/model.py.
It expects a local checkout of:
https://github.com/mpc001/Visual_Speech_Recognition_for_Multiple_Languages
server/model.py auto-detects:
third_party/Visual_Speech_Recognition_for_Multiple_Languages
No extra environment variables are required for the default local setup.
Optional overrides:
export TELEPATHY_MODEL_BACKEND=autoavsr
export TELEPATHY_AUTOAVSR_REPO=/absolute/path/to/Visual_Speech_Recognition_for_Multiple_Languages
# Optional (defaults shown):
# export TELEPATHY_AUTOAVSR_CONFIG=configs/LRS3_V_WER19.1.ini
# export TELEPATHY_AUTOAVSR_DETECTOR=mediapipe
# export TELEPATHY_AUTOAVSR_DEVICE=mps # default: mps; options: auto|mps|cpu|cuda:0
# export TELEPATHY_AUTOAVSR_GPU_IDX=-1The server now loads AutoAVSR weights during startup and keeps the model warm in-process for subsequent requests.
The app sends an encoded videoDataUrl clip to /transcribe, and polls /health to wait until the model is ready.
- LipNet (GRID‑style sentence lipreading).
- AV‑HuBERT / fairseq (supports video‑only or AV fine‑tuning).
- Visual Speech Recognition for Multiple Languages (LRS2/LRS3/GRID models).
- deep_avsr (LRS2/LRS3 video‑only baselines).
- This starter sends short encoded video clips (
videoDataUrl) to the backend. - For production, enable a CSP instead of leaving it disabled in
src-tauri/tauri.conf.json. - If you see
navigator.mediaDevices.getUserMediamissing, rebuild/restart the Tauri app and allow camera access when prompted. On macOS, verify Camera permission in System Settings -> Privacy & Security -> Camera. - If paste automation fails, verify Accessibility permission in System Settings -> Privacy & Security -> Accessibility.