Skip to content

BigRigVibeCoder/VoxInput

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

131 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VoxInput Logo

VoxInput

Privacy-first, offline voice dictation for Linux

License: MIT Python Platform PRs Welcome

Dictate text into any application using your voice. 100% offline. 100% private.

Quick Start β€’ Features β€’ Tech Stack β€’ Architecture β€’ Settings β€’ Contributing


✨ Features

Feature Description
πŸ”’ Privacy-First All processing happens locally. No internet required. No data leaves your machine.
⚑ Real-Time Streaming Text appears as you speak β€” Vosk delivers partial results with sub-200ms latency
🎯 Universal Injection Works in any text field β€” browsers, terminals, editors, chat apps, IDEs
⌨️ Global Hotkey Toggle with Super+Shift+V from anywhere via pynput
πŸŽ™οΈ Push-to-Talk Hold a configurable key (default: Right Ctrl) β€” speak β€” release to inject. Full utterance captured.
πŸ”„ Dual ASR Engines Vosk (fast, streaming) or OpenAI Whisper (accurate, GPU-accelerated)
🧠 Smart NLP Pipeline Compound corrections + SymSpell + ASR rules + numbers + grammar + homophones
πŸ“– Custom Dictionary SQLite DB of 1,400+ tech/AI/Linux terms β€” injected into SymSpell as correction targets
πŸ”— Compound Corrections DB-driven multi-word ASR correction: "pie torch"β†’PyTorch, "engine next"β†’nginx (35 defaults, user-extensible)
πŸŽ™οΈ Three Noise Engines WebRTC AEC, RNNoise AI denoiser, or EasyEffects β€” pick your fighter
πŸ”Š Voice Punctuation Say "period", "comma", "new paragraph" β€” supports cross-batch buffering
πŸ”’ Number Intelligence "one hundred twenty three" β†’ 123, "twenty first" β†’ 21st
πŸ“Š Live OSD Floating waveform overlay shows dictation state in real-time
🏎️ C Extension Native librms.so β€” zero-Python-overhead RMS + PCMβ†’float32 conversion
πŸ–₯️ Hardware Auto-Tune Detects CPU/RAM/GPU at startup and auto-selects optimal engine settings
πŸ” Flight Recorder Enterprise SQLite black-box logger with TRACE level + crash artifacts
πŸ–±οΈ Tray App + Desktop GTK3 system tray with full settings dialog, mic test, and desktop icon
🎯 Golden Test Suite Record once, test forever β€” WER accuracy regression testing with 6 test paragraphs

πŸš€ Quick Start

Prerequisites

# Ubuntu/Debian β€” required system packages
sudo apt install python3-venv python3-gi python3-gi-cairo \
                 gir1.2-gtk-3.0 gir1.2-appindicator3-0.1 \
                 portaudio19-dev xdotool

# Optional (for Wayland-native injection)
sudo apt install ydotool

# Optional (for RNNoise AI denoiser)
sudo apt install libladspa-ocaml-dev
# or install noise-suppression-for-voice from GitHub

Install

git clone https://github.com/BigRigVibeCoder/VoxInput.git
cd VoxInput
bash install.sh

The installer:

  1. Creates a Python virtualenv and installs all dependencies (~50 packages)
  2. Compiles the C RMS extension (librms.so) with -O3 -march=native
  3. Downloads the Vosk English model (~50MB)
  4. Seeds the protected-words database with 1,400+ tech/AI/developer terms
  5. Installs a .desktop entry and tray icon system-wide
  6. Configures optional auto-start on login

Launch

python3 run.py                   # CLI
# OR click the VoxInput icon in your app launcher / desktop
# OR toggle with Super+Shift+V

Verify Installation

# Run the unit test suite
source venv/bin/activate
pytest tests/unit/ -v

πŸ”§ Technology Stack

Core Speech Engines

Technology Role Details
Vosk Primary ASR engine Offline Kaldi-based, real-time streaming, ~50MB model
OpenAI Whisper Alternate ASR engine GPU-accelerated (CUDA float16/int8), auto-punctuation
SymSpell Spell correction 1M+ words/sec edit-distance lookup, frequency-ranked

Audio Pipeline

Technology Role Details
PyAudio Audio capture 16kHz mono, int16 PCM via PortAudio bindings
librms.so (C) RMS + PCM conversion Custom ctypes extension β€” zero Python overhead
PulseAudio / PipeWire Device management pactl for source enumeration, volume, default device
WebRTC AEC Noise suppression PulseAudio module-echo-cancel with 5 tunable sub-features
RNNoise AI denoiser LADSPA plugin via module-ladspa-source
EasyEffects Advanced audio DSP Optional GUI-based effects chain launcher

Text Processing

Technology Role Details
ASR Rules Engine Artifact correction gonna→going to, woulda→would have, 20+ substitution rules
Number Parser Spoken→numeric Handles cardinals, ordinals, scales (one hundred→100, twenty first→21st)
Homophone Resolver Context-aware fixes Regex-based: their/there/they're, to/too/two, its/it's, your/you're, then/than, affect/effect
Grammar Engine Sentence structure Auto-capitalization, cross-batch state tracking
SQLite Protected words DB In-memory set[str] for O(1) lookups, WAL mode, 1,400+ seed terms

Desktop Integration

Technology Role Details
GTK3 (gi) UI framework System tray, settings dialog, OSD waveform overlay
AppIndicator3 Tray icon Idle/active SVG state icons
pynput Global hotkey Super+Shift+V keyboard listener
xdotool / ydotool Text injection X11 and Wayland-native keystroke simulation
pynput (fallback) Text injection Pure-Python X11 fallback when neither tool is available
fcntl Singleton lock /tmp/voxinput.lock β€” prevents duplicate instances

Observability

Technology Role Details
SQLite Flight recorder logs/voxinput_logging.db β€” every event, batched writes, auto-trim
TRACE level (5) High-frequency logging Custom level below DEBUG for audio loop events
Crash artifacts Post-mortem crash_artifacts table with stack traces + system state snapshots
sys.excepthook Root handler Catches all unhandled exceptions, writes crash artifact before exit

Hardware Intelligence

Component Detection Impact
CPU cores psutil / os.cpu_count() Vosk chunk size: 100ms (8+ cores), 150ms (4+), 200ms (2)
RAM psutil / /proc/meminfo Memory-aware model selection
GPU (CUDA) torch.cuda / nvidia-smi Whisper backend: cuda/float16 (β‰₯4GB), cuda/int8 (β‰₯2GB), cpu/int8 (fallback)

πŸ—οΈ Architecture

VoxInput/
β”œβ”€β”€ run.py                      # Entry point, singleton lock, stale-process cleanup
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.py                 # VoxInputApp orchestrator (4 threads)
β”‚   β”œβ”€β”€ audio.py                # PyAudio capture (16kHz mono int16)
β”‚   β”œβ”€β”€ recognizer.py           # Vosk / Whisper engine abstraction
β”‚   β”œβ”€β”€ spell_corrector.py      # SymSpell + ASR rules + numbers + grammar
β”‚   β”œβ”€β”€ homophones.py           # Context-aware homophone resolver (regex-based)
β”‚   β”œβ”€β”€ word_db.py              # SQLite protected-words DB (in-memory set)
β”‚   β”œβ”€β”€ injection.py            # xdotool / ydotool / pynput text injection
β”‚   β”œβ”€β”€ mic_enhancer.py         # WebRTC / RNNoise / EasyEffects / auto-calibrate
β”‚   β”œβ”€β”€ pulseaudio_helper.py    # PulseAudio/PipeWire source enumeration
β”‚   β”œβ”€β”€ hardware_profile.py     # CPU / RAM / GPU auto-detection (singleton)
β”‚   β”œβ”€β”€ ui.py                   # GTK3 tray + settings dialog + OSD overlay
β”‚   β”œβ”€β”€ config.py               # App constants (paths, hotkey, sample rate)
β”‚   β”œβ”€β”€ settings.py             # JSON settings manager
β”‚   β”œβ”€β”€ logger.py               # Enterprise SQLite flight recorder
β”‚   └── c_ext/
β”‚       β”œβ”€β”€ rms.c               # Fast RMS + PCMβ†’float32 (gcc -O3)
β”‚       β”œβ”€β”€ librms.so           # Compiled shared library
β”‚       └── __init__.py         # ctypes bindings
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ seed_words.py           # 1,400+ initial protected-word seed dataset
β”‚   └── custom_words.db         # SQLite protected words (auto-created, gitignored)
β”œβ”€β”€ assets/
β”‚   β”œβ”€β”€ icon_idle.svg           # Tray icon: idle state
β”‚   └── icon_active.svg         # Tray icon: listening state
β”œβ”€β”€ bin/
β”‚   β”œβ”€β”€ toggle.sh               # SIGUSR1 toggle script for hotkey binding
β”‚   β”œβ”€β”€ gate_check.sh           # Pre-commit quality gate
β”‚   └── ...                     # Benchmarking and packaging tools
β”œβ”€β”€ tests/                      # Unit, integration, and E2E test suite
β”œβ”€β”€ CODEX/                      # Project documentation (MANIFEST, GOV, BLU, SPR)
└── logs/                       # SQLite flight recorder database

Speech Pipeline

Microphone
    ↓ PyAudio (16kHz, mono, int16)
    ↓ librms.so β€” C native RMS level measurement
MicEnhancer
    β”œβ”€β”€ WebRTC AEC (noise gate + AGC + VAD + high-pass)
    β”œβ”€β”€ RNNoise AI denoiser (LADSPA plugin)
    └── EasyEffects (external DSP chain)
    ↓
SpeechRecognizer
    β”œβ”€β”€ Vosk: real-time streaming, partial results every 100–200ms
    └── Whisper: batch mode, GPU-accelerated (CUDA float16/int8)
    ↓ raw words  (PTT mode: buffered until key release)
VoicePunctuationBuffer
    ↓ cross-batch command assembly ("new" + "line" β†’ "\n")
SpellCorrector
    β”œβ”€β”€ 0. Compound corrections (DB: "pie torch"β†’PyTorch, "engine next"β†’nginx)
    β”œβ”€β”€ 1. ASR artifact rules  (gonnaβ†’going to, wouldaβ†’would have…)
    β”œβ”€β”€ 2. Number parser       (one hundred twenty three β†’ 123)
    β”œβ”€β”€ 3. WordDatabase check  (O(1) set lookup β€” never correct protected words)
    β”œβ”€β”€ 4. SymSpell lookup     (edit-distance ≀ 2, custom words injected at 1M frequency)
    └── 5. Grammar engine      (auto-capitalize, sentence tracking)
    ↓ corrected text
HomophoneResolver
    ↓ context-aware fixes (their/there/they're, to/too/two…)
TextInjector (ydotool β†’ xdotool β†’ pynput)
    ↓
Active window receives text βœ“

Threading Model

Thread Responsibility
GTK Main UI rendering, tray icon, settings dialog, OSD
Audio Capture PyAudio callback β†’ queue (real-time priority)
Process Loop Recognizer β†’ SpellCorrector β†’ homophone β†’ injection queue
Injection Loop Drains queue β†’ xdotool/ydotool keystroke simulation

βš™οΈ Settings Reference

Open with Super+Shift+V β†’ tray icon β†’ Settings, or right-click the tray icon.

🎀 Dictation Mode Tab

Setting Description
Mode Always On (toggle with hotkey) or Push-to-Talk (hold key to speak)
PTT Key Configurable keybind β€” click "Record Key" and press any key/combo
PTT Behavior Hold to record β†’ release to process full utterance β†’ inject text

🎀 Audio Tab

Setting Description
Input Device PulseAudio/PipeWire source selection
Mic Test Record + playback to verify input quality
Noise Suppression WebRTC AEC with 5 sub-toggles (noise gate, HF filter, VAD, analog/digital gain)
RNNoise AI-powered noise suppression via LADSPA plugin
EasyEffects Launch external DSP effects chain
Gain Input amplification (0.5–4.0Γ—)
Auto-Calibrate Sample ambient noise floor β†’ set threshold + volume automatically

🧠 Engine Tab

Setting Description
Engine Vosk (real-time streaming) or Whisper (accurate, GPU)
Vosk Model Dropdown model selector with validation
Whisper Model tiny / base / small / medium / large
Silence Threshold Seconds of silence before finalizing phrase
Speed Mode fast skips spell correction for lowest latency

✏️ Processing Tab

Setting Description
Spell Correction Enable/disable SymSpell post-processing
Voice Punctuation Say "period", "comma", "new line" to insert punctuation
Homophone Correction Context-aware their/there/they're, to/too/two fixes
Number Parsing Convert spoken numbers to digits (one hundred β†’ 100)

πŸ“– Words Tab

Browse, search, add, and remove entries in the Protected Words database.

  • Words in this list are never spell-corrected β€” passed through exactly as spoken
  • Custom words are injected into SymSpell as high-frequency correction targets (1M)
  • Ships with 1,400+ seed words: tech abbreviations, AI/ML terms, Linux distros & tools, developer frameworks, brands, US places & names, Agile/Scrum vocabulary, futurist/emerging tech
  • Search the list by word or category
  • Add a word β†’ choose category β†’ Enter or click βž• Add
  • Remove β€” select a row β†’ click πŸ—‘οΈ Remove
  • Changes take effect immediately (no restart needed)
  • Database is stored in data/custom_words.db (SQLite, WAL mode, in-memory for O(1) lookups)

πŸ“– Protected Words Database

The spell corrector uses a multi-pass approach:

  1. Compound Corrections β€” DB-driven multi-word ASR correction (35 defaults, user-extensible)
  2. ASR Rules β€” substitution table for common speech-to-text artifacts
  3. Number Parser β€” converts spoken numbers to digits with ordinal support
  4. WordDatabase β€” SQLite-backed exclusion list loaded into a set[str] at startup
  5. SymSpell β€” ultra-fast edit-distance dictionary lookup (custom words injected at 1M frequency)
  6. Grammar β€” auto-capitalization and sentence state tracking

Words in the database are never corrected, regardless of what SymSpell suggests. Custom words are also injected into SymSpell so misspellings correct toward your dictionary terms.

Compound Corrections

Vosk often splits unknown tech terms into phonetically similar English words:

Vosk Hears Corrected To
cooper eighties kubernetes
pie torch PyTorch
engine next nginx
tail scale Tailscale
rough fauna Grafana
pincer flow TensorFlow
and symbol Ansible
a p i API

These are stored in the compound_corrections table in custom_words.db. Add your own via terminal:

python3 -c "
from src.word_db import WordDatabase
db = WordDatabase('data/custom_words.db')
db.add_compound_correction('my misheard phrase', 'CorrectWord')
"

Seed Categories

Category Examples
tech api, cuda, ebpf, rag, llm, grpc, wasm
ai pytorch, huggingface, ollama, vllm, qlora, dspy, langgraph
linux systemd, ebpf, btrfs, hyprland, flatpak, pipewire, nftables
dev pydantic, fastapi, tokio, duckdb, qdrant, prisma, drizzle
cloud terraform, argocd, eks, gke, cloudflare, fly, hetzner
agile scrum, kanban, retrospective, tdd, bdd, cqrs, asyncio
org nasa, darpa, ietf, cncf, deepmind, openai
future mamba, rwkv, lerobot, qiskit, neuromorphic, crewai
name 200+ common American first names
place All 50 US states + major cities
sports All NFL, NBA, MLB teams + sports terms

Adding Your Own Words

Via UI: Settings β†’ πŸ“– Words tab β†’ type word β†’ select category β†’ βž• Add

Via terminal (bulk):

cd VoxInput && source venv/bin/activate
python3 -c "
from src.word_db import WordDatabase
db = WordDatabase('data/custom_words.db')
for word in ['mycompany', 'myproject', 'myname']:
    db.add_word(word, 'custom')
print(db.count(), 'words protected')
"

πŸ”§ Singleton & Desktop Integration

VoxInput uses fcntl.flock() on /tmp/voxinput.lock to prevent duplicate instances. On launch, stale processes are detected via pgrep and cleaned up automatically. A poll-wait mechanism ensures clean handoff when GNOME fires a double-launch from the desktop icon.

The desktop entry is installed to:

  • ~/.local/share/applications/voxinput.desktop
  • ~/Desktop/voxinput.desktop

πŸ“‹ Recent Upgrades

Version Change
Feb 23 πŸŽ™οΈ Push-to-Talk mode β€” hold key to record, release to inject. Full utterance buffering.
Feb 23 πŸ”— DB-driven compound corrections β€” 35 multi-word ASR correction rules in SQLite
Feb 23 πŸ“Š SymSpell dictionary injection β€” 1,437 custom words as correction targets
Feb 23 🎯 Golden Paragraph F β€” dictionary test recording + WER regression testing
Feb 23 Fix GNOME desktop-icon race condition in singleton lock
Feb 22 RNNoise AI denoiser + EasyEffects launcher + Processing toggles
Feb 22 Homophone resolver: their/there/they're, to/too/two, its/it's, your/you're
Feb 21 WebRTC sub-feature toggles (5 individual controls)
Feb 20 Enterprise SQLite flight recorder with TRACE level + crash artifacts
Feb 20 Performance Overhaul v2.0 β€” 10 improvements across speed, memory, quality
Feb 19 Number intelligence: spoken→numeric conversion with ordinals
Feb 19 Cross-batch voice punctuation buffering
Feb 18 C extension librms.so for zero-overhead RMS computation
Feb 18 Hardware auto-detection (CPU/RAM/CUDA) with engine tuning

πŸ”’ Privacy & Security

  • Zero network calls β€” all ASR runs locally via Vosk/Whisper
  • No telemetry β€” trace logs stay in logs/voxinput_logging.db on your machine
  • .env β€” API keys (if any) stored locally, gitignored
  • settings.json β€” gitignored; use settings.example.json as template
  • data/custom_words.db β€” gitignored; your word list stays local

🀝 Contributing

See CONTRIBUTING.md.

# Dev setup
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt -r requirements-dev.txt
pytest tests/unit/ -v

Good first issues: additional seed words, new ASR correction rules, Wayland injection improvements, Whisper VAD integration, new homophone groups.


πŸ“„ License

MIT Β© BigRigVibeCoder