Voice-to-text desktop app for Windows.
Press a shortcut, talk, press it again. The text appears wherever your cursor is.
Ditto lives in the system tray and shows a floating, draggable pill. Hit the global shortcut, speak, hit it again — Ditto transcribes locally with whisper.cpp and pastes the text into whatever app has focus.
- Local-only. Audio never leaves your machine.
- GPU-accelerated. Bundled CUDA build for NVIDIA GPUs (CPU fallback works too).
- Out of the way. Frameless pill with click-through, always on top, tray icon for control.
- Configurable. Custom shortcut, audio device, model, theme, and more.
Early personal project. Windows-first. macOS and Linux are not targeted yet, but the code avoids Windows-only APIs without fallback where possible.
- Windows 10 or 11 (x64)
- Node.js 20+
- For GPU acceleration: NVIDIA GPU with recent drivers (CUDA Toolkit not required — cuDNN/cuBLAS DLLs are bundled with whisper.cpp)
git clone https://github.com/asantinos/ditto.git
cd ditto
npm install
npm run setup:whisper # downloads whisper.cpp + base model
npm run devDefault shortcut: Ctrl+Shift+Space — press to record, press again to stop and paste.
Open settings from the tray icon (right-click → "Ajustes…", or double-click).
- Electron 39 + React 19 + TypeScript (strict mode)
- electron-vite for bundling, ESM throughout
- whisper.cpp as an external binary, spawned from main (no native bindings)
- @nut-tree-fork/nut-js for simulating Ctrl+V into the active window
- electron-store for settings persistence
| Command | What it does |
|---|---|
npm run dev |
Start in development mode |
npm run typecheck |
TypeScript on both main and renderer configs |
npm run lint |
ESLint with cache |
npm run test:transcribe |
One-shot transcription test on jfk.wav |
npm run build:unpack |
Build to dist/win-unpacked/ (no installer) |
npm run build:win |
Build the full NSIS installer to dist/ |
src/
main/ Electron main process (ESM): windows, tray, IPC, whisper, lifecycle
preload/ contextBridge API exposed to renderers (compiled to .mjs)
renderer/ Two React apps: pill (index.html) and settings (settings.html)
shared/ IPC contracts and types shared between main and renderer
resources/ Tray icons + (after setup) whisper.cpp binary and models
scripts/ Setup and test scripts
The renderer captures audio with getUserMedia + MediaRecorder, decodes and resamples it to 16 kHz mono PCM with the Web Audio API, and sends it to main over a typed IPC channel as an ArrayBuffer. Main writes a temporary WAV, spawns whisper-cli.exe, parses stdout, copies the result to the clipboard, and simulates Ctrl+V with nut-js.
MIT © Alex Santos
- whisper.cpp by Georgi Gerganov — the transcription engine
- OpenAI Whisper — the underlying model
- @nut-tree-fork/nut-js — keyboard simulation