Privacy-first, offline voice dictation for Linux
Dictate text into any application using your voice. 100% offline. 100% private.
Quick Start β’ Features β’ Tech Stack β’ Architecture β’ Settings β’ Contributing
| Feature | Description |
|---|---|
| π Privacy-First | All processing happens locally. No internet required. No data leaves your machine. |
| β‘ Real-Time Streaming | Text appears as you speak β Vosk delivers partial results with sub-200ms latency |
| π― Universal Injection | Works in any text field β browsers, terminals, editors, chat apps, IDEs |
| β¨οΈ Global Hotkey | Toggle with Super+Shift+V from anywhere via pynput |
| ποΈ Push-to-Talk | Hold a configurable key (default: Right Ctrl) β speak β release to inject. Full utterance captured. |
| π Dual ASR Engines | Vosk (fast, streaming) or OpenAI Whisper (accurate, GPU-accelerated) |
| π§ Smart NLP Pipeline | Compound corrections + SymSpell + ASR rules + numbers + grammar + homophones |
| π Custom Dictionary | SQLite DB of 1,400+ tech/AI/Linux terms β injected into SymSpell as correction targets |
| π Compound Corrections | DB-driven multi-word ASR correction: "pie torch"βPyTorch, "engine next"βnginx (35 defaults, user-extensible) |
| ποΈ Three Noise Engines | WebRTC AEC, RNNoise AI denoiser, or EasyEffects β pick your fighter |
| π Voice Punctuation | Say "period", "comma", "new paragraph" β supports cross-batch buffering |
| π’ Number Intelligence | "one hundred twenty three" β 123, "twenty first" β 21st |
| π Live OSD | Floating waveform overlay shows dictation state in real-time |
| ποΈ C Extension | Native librms.so β zero-Python-overhead RMS + PCMβfloat32 conversion |
| π₯οΈ Hardware Auto-Tune | Detects CPU/RAM/GPU at startup and auto-selects optimal engine settings |
| π Flight Recorder | Enterprise SQLite black-box logger with TRACE level + crash artifacts |
| π±οΈ Tray App + Desktop | GTK3 system tray with full settings dialog, mic test, and desktop icon |
| π― Golden Test Suite | Record once, test forever β WER accuracy regression testing with 6 test paragraphs |
# Ubuntu/Debian β required system packages
sudo apt install python3-venv python3-gi python3-gi-cairo \
gir1.2-gtk-3.0 gir1.2-appindicator3-0.1 \
portaudio19-dev xdotool
# Optional (for Wayland-native injection)
sudo apt install ydotool
# Optional (for RNNoise AI denoiser)
sudo apt install libladspa-ocaml-dev
# or install noise-suppression-for-voice from GitHubgit clone https://github.com/BigRigVibeCoder/VoxInput.git
cd VoxInput
bash install.shThe installer:
- Creates a Python virtualenv and installs all dependencies (~50 packages)
- Compiles the C RMS extension (
librms.so) with-O3 -march=native - Downloads the Vosk English model (~50MB)
- Seeds the protected-words database with 1,400+ tech/AI/developer terms
- Installs a
.desktopentry and tray icon system-wide - Configures optional auto-start on login
python3 run.py # CLI
# OR click the VoxInput icon in your app launcher / desktop
# OR toggle with Super+Shift+V# Run the unit test suite
source venv/bin/activate
pytest tests/unit/ -v| Technology | Role | Details |
|---|---|---|
| Vosk | Primary ASR engine | Offline Kaldi-based, real-time streaming, ~50MB model |
| OpenAI Whisper | Alternate ASR engine | GPU-accelerated (CUDA float16/int8), auto-punctuation |
| SymSpell | Spell correction | 1M+ words/sec edit-distance lookup, frequency-ranked |
| Technology | Role | Details |
|---|---|---|
| PyAudio | Audio capture | 16kHz mono, int16 PCM via PortAudio bindings |
| librms.so (C) | RMS + PCM conversion | Custom ctypes extension β zero Python overhead |
| PulseAudio / PipeWire | Device management | pactl for source enumeration, volume, default device |
| WebRTC AEC | Noise suppression | PulseAudio module-echo-cancel with 5 tunable sub-features |
| RNNoise | AI denoiser | LADSPA plugin via module-ladspa-source |
| EasyEffects | Advanced audio DSP | Optional GUI-based effects chain launcher |
| Technology | Role | Details |
|---|---|---|
| ASR Rules Engine | Artifact correction | gonnaβgoing to, wouldaβwould have, 20+ substitution rules |
| Number Parser | Spokenβnumeric | Handles cardinals, ordinals, scales (one hundredβ100, twenty firstβ21st) |
| Homophone Resolver | Context-aware fixes | Regex-based: their/there/they're, to/too/two, its/it's, your/you're, then/than, affect/effect |
| Grammar Engine | Sentence structure | Auto-capitalization, cross-batch state tracking |
| SQLite | Protected words DB | In-memory set[str] for O(1) lookups, WAL mode, 1,400+ seed terms |
| Technology | Role | Details |
|---|---|---|
GTK3 (gi) |
UI framework | System tray, settings dialog, OSD waveform overlay |
| AppIndicator3 | Tray icon | Idle/active SVG state icons |
| pynput | Global hotkey | Super+Shift+V keyboard listener |
| xdotool / ydotool | Text injection | X11 and Wayland-native keystroke simulation |
| pynput (fallback) | Text injection | Pure-Python X11 fallback when neither tool is available |
| fcntl | Singleton lock | /tmp/voxinput.lock β prevents duplicate instances |
| Technology | Role | Details |
|---|---|---|
| SQLite | Flight recorder | logs/voxinput_logging.db β every event, batched writes, auto-trim |
| TRACE level (5) | High-frequency logging | Custom level below DEBUG for audio loop events |
| Crash artifacts | Post-mortem | crash_artifacts table with stack traces + system state snapshots |
sys.excepthook |
Root handler | Catches all unhandled exceptions, writes crash artifact before exit |
| Component | Detection | Impact |
|---|---|---|
| CPU cores | psutil / os.cpu_count() |
Vosk chunk size: 100ms (8+ cores), 150ms (4+), 200ms (2) |
| RAM | psutil / /proc/meminfo |
Memory-aware model selection |
| GPU (CUDA) | torch.cuda / nvidia-smi |
Whisper backend: cuda/float16 (β₯4GB), cuda/int8 (β₯2GB), cpu/int8 (fallback) |
VoxInput/
βββ run.py # Entry point, singleton lock, stale-process cleanup
βββ src/
β βββ main.py # VoxInputApp orchestrator (4 threads)
β βββ audio.py # PyAudio capture (16kHz mono int16)
β βββ recognizer.py # Vosk / Whisper engine abstraction
β βββ spell_corrector.py # SymSpell + ASR rules + numbers + grammar
β βββ homophones.py # Context-aware homophone resolver (regex-based)
β βββ word_db.py # SQLite protected-words DB (in-memory set)
β βββ injection.py # xdotool / ydotool / pynput text injection
β βββ mic_enhancer.py # WebRTC / RNNoise / EasyEffects / auto-calibrate
β βββ pulseaudio_helper.py # PulseAudio/PipeWire source enumeration
β βββ hardware_profile.py # CPU / RAM / GPU auto-detection (singleton)
β βββ ui.py # GTK3 tray + settings dialog + OSD overlay
β βββ config.py # App constants (paths, hotkey, sample rate)
β βββ settings.py # JSON settings manager
β βββ logger.py # Enterprise SQLite flight recorder
β βββ c_ext/
β βββ rms.c # Fast RMS + PCMβfloat32 (gcc -O3)
β βββ librms.so # Compiled shared library
β βββ __init__.py # ctypes bindings
βββ data/
β βββ seed_words.py # 1,400+ initial protected-word seed dataset
β βββ custom_words.db # SQLite protected words (auto-created, gitignored)
βββ assets/
β βββ icon_idle.svg # Tray icon: idle state
β βββ icon_active.svg # Tray icon: listening state
βββ bin/
β βββ toggle.sh # SIGUSR1 toggle script for hotkey binding
β βββ gate_check.sh # Pre-commit quality gate
β βββ ... # Benchmarking and packaging tools
βββ tests/ # Unit, integration, and E2E test suite
βββ CODEX/ # Project documentation (MANIFEST, GOV, BLU, SPR)
βββ logs/ # SQLite flight recorder database
Microphone
β PyAudio (16kHz, mono, int16)
β librms.so β C native RMS level measurement
MicEnhancer
βββ WebRTC AEC (noise gate + AGC + VAD + high-pass)
βββ RNNoise AI denoiser (LADSPA plugin)
βββ EasyEffects (external DSP chain)
β
SpeechRecognizer
βββ Vosk: real-time streaming, partial results every 100β200ms
βββ Whisper: batch mode, GPU-accelerated (CUDA float16/int8)
β raw words (PTT mode: buffered until key release)
VoicePunctuationBuffer
β cross-batch command assembly ("new" + "line" β "\n")
SpellCorrector
βββ 0. Compound corrections (DB: "pie torch"βPyTorch, "engine next"βnginx)
βββ 1. ASR artifact rules (gonnaβgoing to, wouldaβwould haveβ¦)
βββ 2. Number parser (one hundred twenty three β 123)
βββ 3. WordDatabase check (O(1) set lookup β never correct protected words)
βββ 4. SymSpell lookup (edit-distance β€ 2, custom words injected at 1M frequency)
βββ 5. Grammar engine (auto-capitalize, sentence tracking)
β corrected text
HomophoneResolver
β context-aware fixes (their/there/they're, to/too/twoβ¦)
TextInjector (ydotool β xdotool β pynput)
β
Active window receives text β
| Thread | Responsibility |
|---|---|
| GTK Main | UI rendering, tray icon, settings dialog, OSD |
| Audio Capture | PyAudio callback β queue (real-time priority) |
| Process Loop | Recognizer β SpellCorrector β homophone β injection queue |
| Injection Loop | Drains queue β xdotool/ydotool keystroke simulation |
Open with Super+Shift+V β tray icon β Settings, or right-click the tray icon.
| Setting | Description |
|---|---|
| Mode | Always On (toggle with hotkey) or Push-to-Talk (hold key to speak) |
| PTT Key | Configurable keybind β click "Record Key" and press any key/combo |
| PTT Behavior | Hold to record β release to process full utterance β inject text |
| Setting | Description |
|---|---|
| Input Device | PulseAudio/PipeWire source selection |
| Mic Test | Record + playback to verify input quality |
| Noise Suppression | WebRTC AEC with 5 sub-toggles (noise gate, HF filter, VAD, analog/digital gain) |
| RNNoise | AI-powered noise suppression via LADSPA plugin |
| EasyEffects | Launch external DSP effects chain |
| Gain | Input amplification (0.5β4.0Γ) |
| Auto-Calibrate | Sample ambient noise floor β set threshold + volume automatically |
| Setting | Description |
|---|---|
| Engine | Vosk (real-time streaming) or Whisper (accurate, GPU) |
| Vosk Model | Dropdown model selector with validation |
| Whisper Model | tiny / base / small / medium / large |
| Silence Threshold | Seconds of silence before finalizing phrase |
| Speed Mode | fast skips spell correction for lowest latency |
| Setting | Description |
|---|---|
| Spell Correction | Enable/disable SymSpell post-processing |
| Voice Punctuation | Say "period", "comma", "new line" to insert punctuation |
| Homophone Correction | Context-aware their/there/they're, to/too/two fixes |
| Number Parsing | Convert spoken numbers to digits (one hundred β 100) |
Browse, search, add, and remove entries in the Protected Words database.
- Words in this list are never spell-corrected β passed through exactly as spoken
- Custom words are injected into SymSpell as high-frequency correction targets (1M)
- Ships with 1,400+ seed words: tech abbreviations, AI/ML terms, Linux distros & tools, developer frameworks, brands, US places & names, Agile/Scrum vocabulary, futurist/emerging tech
- Search the list by word or category
- Add a word β choose category β Enter or click β Add
- Remove β select a row β click ποΈ Remove
- Changes take effect immediately (no restart needed)
- Database is stored in
data/custom_words.db(SQLite, WAL mode, in-memory for O(1) lookups)
The spell corrector uses a multi-pass approach:
- Compound Corrections β DB-driven multi-word ASR correction (35 defaults, user-extensible)
- ASR Rules β substitution table for common speech-to-text artifacts
- Number Parser β converts spoken numbers to digits with ordinal support
- WordDatabase β SQLite-backed exclusion list loaded into a
set[str]at startup - SymSpell β ultra-fast edit-distance dictionary lookup (custom words injected at 1M frequency)
- Grammar β auto-capitalization and sentence state tracking
Words in the database are never corrected, regardless of what SymSpell suggests. Custom words are also injected into SymSpell so misspellings correct toward your dictionary terms.
Vosk often splits unknown tech terms into phonetically similar English words:
| Vosk Hears | Corrected To |
|---|---|
cooper eighties |
kubernetes |
pie torch |
PyTorch |
engine next |
nginx |
tail scale |
Tailscale |
rough fauna |
Grafana |
pincer flow |
TensorFlow |
and symbol |
Ansible |
a p i |
API |
These are stored in the compound_corrections table in custom_words.db.
Add your own via terminal:
python3 -c "
from src.word_db import WordDatabase
db = WordDatabase('data/custom_words.db')
db.add_compound_correction('my misheard phrase', 'CorrectWord')
"| Category | Examples |
|---|---|
tech |
api, cuda, ebpf, rag, llm, grpc, wasm |
ai |
pytorch, huggingface, ollama, vllm, qlora, dspy, langgraph |
linux |
systemd, ebpf, btrfs, hyprland, flatpak, pipewire, nftables |
dev |
pydantic, fastapi, tokio, duckdb, qdrant, prisma, drizzle |
cloud |
terraform, argocd, eks, gke, cloudflare, fly, hetzner |
agile |
scrum, kanban, retrospective, tdd, bdd, cqrs, asyncio |
org |
nasa, darpa, ietf, cncf, deepmind, openai |
future |
mamba, rwkv, lerobot, qiskit, neuromorphic, crewai |
name |
200+ common American first names |
place |
All 50 US states + major cities |
sports |
All NFL, NBA, MLB teams + sports terms |
Via UI: Settings β π Words tab β type word β select category β β Add
Via terminal (bulk):
cd VoxInput && source venv/bin/activate
python3 -c "
from src.word_db import WordDatabase
db = WordDatabase('data/custom_words.db')
for word in ['mycompany', 'myproject', 'myname']:
db.add_word(word, 'custom')
print(db.count(), 'words protected')
"VoxInput uses fcntl.flock() on /tmp/voxinput.lock to prevent duplicate instances.
On launch, stale processes are detected via pgrep and cleaned up automatically.
A poll-wait mechanism ensures clean handoff when GNOME fires a double-launch from the desktop icon.
The desktop entry is installed to:
~/.local/share/applications/voxinput.desktop~/Desktop/voxinput.desktop
| Version | Change |
|---|---|
| Feb 23 | ποΈ Push-to-Talk mode β hold key to record, release to inject. Full utterance buffering. |
| Feb 23 | π DB-driven compound corrections β 35 multi-word ASR correction rules in SQLite |
| Feb 23 | π SymSpell dictionary injection β 1,437 custom words as correction targets |
| Feb 23 | π― Golden Paragraph F β dictionary test recording + WER regression testing |
| Feb 23 | Fix GNOME desktop-icon race condition in singleton lock |
| Feb 22 | RNNoise AI denoiser + EasyEffects launcher + Processing toggles |
| Feb 22 | Homophone resolver: their/there/they're, to/too/two, its/it's, your/you're |
| Feb 21 | WebRTC sub-feature toggles (5 individual controls) |
| Feb 20 | Enterprise SQLite flight recorder with TRACE level + crash artifacts |
| Feb 20 | Performance Overhaul v2.0 β 10 improvements across speed, memory, quality |
| Feb 19 | Number intelligence: spokenβnumeric conversion with ordinals |
| Feb 19 | Cross-batch voice punctuation buffering |
| Feb 18 | C extension librms.so for zero-overhead RMS computation |
| Feb 18 | Hardware auto-detection (CPU/RAM/CUDA) with engine tuning |
- Zero network calls β all ASR runs locally via Vosk/Whisper
- No telemetry β trace logs stay in
logs/voxinput_logging.dbon your machine .envβ API keys (if any) stored locally, gitignoredsettings.jsonβ gitignored; usesettings.example.jsonas templatedata/custom_words.dbβ gitignored; your word list stays local
See CONTRIBUTING.md.
# Dev setup
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt -r requirements-dev.txt
pytest tests/unit/ -vGood first issues: additional seed words, new ASR correction rules, Wayland injection improvements, Whisper VAD integration, new homophone groups.
MIT Β© BigRigVibeCoder