Lightning-fast voice-to-text for macOS
Hold Option+Space, speak, release to paste. It's that simple.
Features • Installation • Usage • Voice Commands • Building
- Push-to-talk — Hold
⌥ Spaceto record, release to transcribe and paste - 100% local — All processing on-device with Metal GPU acceleration
- Multiple STT engines — Moonshine (ONNX), Whisper (whisper.cpp), and Qwen3-ASR
- Multiple LLM backends — mistral.rs and llama.cpp for wide model coverage
- Smart formatting — Automatic punctuation, capitalization, em-dashes, bullet lists
- Voice commands — Say "new paragraph", "bullet point", "question mark" and more
- Filler word removal — "um", "uh", "hmm" removed automatically
- Customizable dictionary — Editable replacement file for technical terms and proper nouns
- Context-aware — Reads cursor context via Accessibility API for seamless spacing
- Application detection — Adapts formatting for email, Slack, code editors
- Menu bar app — Minimal footprint, no dock icon
VoiceFlow supports two pipeline modes:
STT + LLM (default) — Separate speech-to-text and formatting stages:
Audio → STT Engine (Moonshine/Whisper) → Prosody Processing → LLM Formatting → Output
Consolidated — A single model handles both transcription and formatting:
Audio → Qwen3-ASR (Python daemon) → Post-processing → Output
┌──────────────────────────────────────────────────────┐
│ macOS SwiftUI App │
│ Menu bar, hotkeys, audio recording, settings │
└─────────────────┬────────────────────────────────────┘
│ FFI (C bindings)
┌─────────────────▼────────────────────────────────────┐
│ voiceflow-ffi │
│ C-compatible API, panic safety │
└─────────────────┬────────────────────────────────────┘
│
┌─────────────────▼────────────────────────────────────┐
│ voiceflow-core │
│ │
│ Pipeline: Audio → STT → Prosody → LLM → Output │
│ │
│ STT Engines: LLM Backends: Prosody: │
│ - Moonshine (ONNX) - mistral.rs - Voice cmds │
│ - Whisper (cpp) - llama.cpp - Pause det. │
│ - Qwen3-ASR (ext.) - Pitch det. │
│ - Filler rem. │
│ - Replacements│
└──────────────────────────────────────────────────────┘
- macOS 13.0 (Ventura) or later
- Apple Silicon (M1/M2/M3/M4) recommended for best performance
- Download the latest
VoiceFlow.dmgfrom Releases - Open the DMG and drag VoiceFlow to Applications
- Launch VoiceFlow from Applications
- Grant Microphone and Accessibility permissions when prompted
⚠️ Important: VoiceFlow is not yet notarized by Apple.macOS Gatekeeper will block the app on first launch. To open it:
Option A: Right-click
VoiceFlow.app→ click Open → click Open again in the dialogOption B: Go to System Settings → Privacy & Security, scroll down, and click Open Anyway
This only needs to be done once. After that, the app opens normally.
- Rust 1.70+ — Install via rustup
- Xcode Command Line Tools —
xcode-select --install - Swift 5.9+ — Included with Xcode
- For Qwen3-ASR mode: Python 3.10+ with
pip install qwen-asr torch soundfile
# Clone the repository
git clone https://github.com/Era-Laboratories/voiceflow.git
cd voiceflow
# Build the Rust library
cargo build --release
# Build the macOS app
cd VoiceFlowApp
./build.sh
# Run
open build/VoiceFlow.app# Download default models (Moonshine Base + Qwen3 1.7B)
cargo run -p voiceflow-cli -- setup
# Specify different models
cargo run -p voiceflow-cli -- setup --whisper base --llm qwen3-4b
# Download all models for benchmarking
cargo run -p voiceflow-cli -- setup --benchmark- Launch VoiceFlow — Look for the icon in your menu bar
- Hold
⌥ Space— Start speaking - Release
⌥ Space— Text is transcribed, formatted, and pasted at your cursor
The menu bar icon changes color to show status:
| Color | State |
|---|---|
| Default | Ready |
| Red | Recording |
| Yellow | Processing |
Access settings from the menu bar icon. Configure your STT engine, LLM model, pipeline mode, and hotkey preferences.
The CLI provides full access to VoiceFlow's capabilities:
cargo run -p voiceflow-cli -- <command>| Command | Description | Key Flags |
|---|---|---|
record |
Record from microphone and transcribe | --clipboard, --context <type>, --raw |
file <path> |
Transcribe an audio file | --context <type>, --raw |
setup |
Download required models | --whisper <size>, --llm <model>, --benchmark |
config show |
Show current configuration | |
config set-model <model> |
Set the LLM model | |
config set-whisper <size> |
Set the Whisper model size | |
config set-mode <mode> |
Set pipeline mode | stt-plus-llm or consolidated |
config set-consolidated-model <model> |
Set the consolidated model | qwen3-asr-0.6b or qwen3-asr-1.7b |
config add-word <word> |
Add to personal dictionary | |
config path |
Show config file path | |
bench |
Run performance benchmark | --iterations <n>, --file <path> |
eval |
Evaluate transcription quality (LibriSpeech) | --limit <n>, --samples, --raw, --analyze, --stt <model>, --llm <model>, --benchmark |
models |
List available models |
All commands support --verbose for debug output and --config <path> for a custom config file.
| Say | Output |
|---|---|
| "period" / "full stop" | . |
| "comma" | , |
| "question mark" | ? |
| "exclamation mark" / "bang" | ! |
| "colon" | : |
| "semicolon" | ; |
| "ellipsis" / "dot dot dot" | ... |
| Say | Output |
|---|---|
| "new line" / "line break" | Line break |
| "new paragraph" | Paragraph break |
| "open quote" / "close quote" | " |
| "apostrophe" | ' |
| "dash" / "em dash" | — |
| "hyphen" | - |
| Say | Output |
|---|---|
| "open paren" / "close paren" | ( ) |
| "open bracket" / "close bracket" | [ ] |
| "open brace" / "close brace" | { } |
| Say | Output |
|---|---|
| "ampersand" | & |
| "at sign" | @ |
| "hashtag" / "hash" | # |
| "dollar sign" | $ |
| "percent" | % |
| "asterisk" / "star" | * |
| "underscore" | _ |
| "slash" / "forward slash" | / |
| "backslash" | \ |
| Say | Output |
|---|---|
| "equals" | = |
| "plus" | + |
| "minus" | - |
| "greater than" | > |
| "less than" | < |
| "pipe" | | |
| "tilde" | ~ |
| "caret" | ^ |
- Punctuation — Added automatically based on speech patterns
- Capitalization — Sentences capitalized after punctuation
- Lists — Enumerated items converted to bullet points
- Em-dashes — Mid-sentence pauses become
—dashes - Filler removal — "um", "uh", "ah", "hmm", "er" removed
- Spelled-out words — "S M O L L M" becomes "SMOLLM"
- Technical terms — Configurable replacement dictionary (e.g., "G P T" → "GPT")
Moonshine (ONNX Runtime — default):
| Model | Parameters | Size | Notes |
|---|---|---|---|
| Moonshine Tiny | 27M | ~190 MB | Fastest |
| Moonshine Base | 62M | ~400 MB | Default, best balance |
Whisper (whisper.cpp):
| Model | Parameters | Size | Expected WER |
|---|---|---|---|
| Tiny | 39M | 75 MB | ~7.5% |
| Base | 74M | 142 MB | ~5.0% |
| Small | 244M | 466 MB | ~4.2% |
| Medium | 769M | 1.5 GB | ~3.5% |
| Large V3 | 1.5B | 3.0 GB | ~2.9% |
| Large V3 Turbo | 809M | 1.6 GB | ~3.0% |
| Distil-Large V3 | 756M | 1.5 GB | ~3.5% |
Qwen3-ASR (Python daemon, consolidated mode):
| Model | Parameters | Size |
|---|---|---|
| Qwen3-ASR 0.6B | 0.6B | ~1.2 GB |
| Qwen3-ASR 1.7B | 1.7B | ~3.4 GB |
| Model | Backend | Size (Q4_K_M) | License |
|---|---|---|---|
| Qwen3 1.7B | mistral.rs | 1.3 GB | Apache 2.0 |
| Qwen3 4B | mistral.rs | 2.5 GB | Apache 2.0 |
| SmolLM3 3B | llama.cpp | 1.9 GB | Apache 2.0 |
| Gemma 2 2B | mistral.rs | 1.7 GB | Gemma |
| Gemma 3n E2B | llama.cpp | 1.8 GB | Gemma |
| Gemma 3n E4B | llama.cpp | 3.2 GB | Gemma |
| Phi-4 Mini 3.8B | llama.cpp | 2.4 GB | MIT |
| Phi-2 | mistral.rs | 1.6 GB | MIT |
Custom GGUF models are also supported via llama.cpp.
VoiceFlow stores its configuration in a TOML file:
~/Library/Application Support/com.era-laboratories.voiceflow/config.toml
# Pipeline mode: "stt-plus-llm" (default) or "consolidated"
pipeline_mode = "stt-plus-llm"
# STT engine: "moonshine" (default), "whisper", or "qwen3-asr"
stt_engine = "moonshine"
moonshine_model = "base"
whisper_model = "base"
# Consolidated mode model (used when pipeline_mode = "consolidated")
consolidated_model = "qwen3-asr-0-6b"
# LLM model for formatting
llm_model = "qwen3-1-7b"
# LLM generation parameters
[llm_options]
max_tokens = 512
temperature = 0.3
top_p = 0.9
n_gpu_layers = -1
enable_thinking = false
# Audio settings
[audio]
sample_rate = 44100
vad_threshold = 0.01
silence_duration_ms = 800
# Default context for formatting
default_context = "default"
# Auto-copy to clipboard
auto_clipboard = true| Context | Behavior |
|---|---|
default |
General dictation with standard formatting |
email |
Email-appropriate tone and structure |
slack |
Casual, chat-style formatting |
code |
Code-aware formatting, preserves technical terms |
All settings can be overridden via environment variables prefixed with VOICEFLOW_:
VOICEFLOW_STT_ENGINE=whisper
VOICEFLOW_LLM_MODEL=qwen3-4b
VOICEFLOW_PIPELINE_MODE=consolidated
VOICEFLOW_LLM_TEMPERATURE=0.5| Path | Contents |
|---|---|
~/Library/Application Support/com.era-laboratories.voiceflow/config.toml |
Configuration |
~/Library/Application Support/com.era-laboratories.voiceflow/models/ |
Downloaded ML models |
~/Library/Application Support/com.era-laboratories.voiceflow/prompts/ |
Custom prompt templates |
voiceflow/
├── crates/
│ ├── voiceflow-core/ # Core Rust library
│ │ ├── src/
│ │ │ ├── config.rs # Configuration management
│ │ │ ├── pipeline.rs # Main processing pipeline
│ │ │ ├── llm/ # LLM inference (mistral.rs + llama.cpp backends)
│ │ │ ├── transcribe/ # STT engines (Whisper, Moonshine)
│ │ │ ├── prosody/ # Voice commands, filler removal, pauses, pitch
│ │ │ └── audio/ # Audio capture and resampling
│ │ └── Cargo.toml
│ ├── voiceflow-cli/ # Command-line interface
│ │ └── src/commands/ # record, file, setup, config, bench, eval
│ └── voiceflow-ffi/ # C FFI for Swift bindings
├── VoiceFlowApp/ # macOS SwiftUI application
│ ├── Sources/VoiceFlowApp/ # Swift UI, audio recording, hotkeys
│ └── build.sh # App bundle build script
├── prompts/ # LLM system prompts & replacement dictionary
│ ├── default.txt # Default formatting prompt
│ └── replacements.toml # Customizable word replacements
├── scripts/
│ └── qwen3_asr_daemon.py # Python daemon for Qwen3-ASR
└── Cargo.toml # Workspace configuration
See CONTRIBUTING.md for development setup, architecture details, and guidelines.
MIT License — Era Laboratories 2024
See LICENSE for details.
Made with care for fast typists who'd rather talk
