Release v0.3.9 — OmniVoice Studio · debpalash/OmniVoice-Studio

The dictation release — and a deep reliability pass driven by live-testing the entire app. Dictation is rebuilt end-to-end: instant feedback with a live waveform, words that commit about half a second after you stop speaking, clean punctuation, and text insertion that never lies about success. LLM providers get one-click connection testing with real diagnostics and model discovery, in all 21 languages. The app now always opens maximized, bottom buttons can't hide under the footer at small window sizes, and a wave of "out of memory / can't reach the backend / stuck at preparing" reports were traced to their real causes and fixed — including the silent VRAM crash on 8 GB cards, dead-IPC startup hangs after a Windows BSOD, and misleading error labels. Intel-Mac support status is now stated honestly, Confucius4-TTS is validated end-to-end, and Parakeet — roughly 20× faster than the default transcriber on CPU — is unlocked for every machine.

Added

Sponsor OmniVoice. A new SPONSORS.md (tiers, logo guidelines, how to sponsor), a README Sponsors section, and an in-app Sponsors area (Support page + a footer link) let people back the project — with a one-click "Become a sponsor" that opens a structured GitHub issue form, no account or token needed. Sponsorship is a thank-you, not a paywall: OmniVoice stays free and AGPL-3.0. (#923, #924)
OpenAPI reference in Settings. A new Settings → OpenAPI page embeds an interactive Scalar reference for OmniVoice's local backend API, with a one-click footer button. Fully local — Scalar is bundled, not loaded from a CDN, and phones home to nothing. (#928)
Engine Self-test. The Engines matrix gains a "Self-test" button for in-process TTS engines that runs a tiny real synthesis and reports duration + sample rate — proving an engine actually makes audio, not just imports — plus a copy-paste export OMNIVOICE_*_DIR=… setup line for opt-in engines right in the "Why unavailable?" panel. (#930)
One canonical HuggingFace-token store + incomplete-download visibility. The Model Store token field now saves to and is cleared from the same encrypted store as Settings → Credentials (no more two-stores split), and a truncated model cache shows an "incomplete · N MB" state with one-click Repair and Delete instead of masquerading as "not installed". (#927)
Launchpad, reimagined as a deck of cards. The seven feature cards now fan out with animated waveform faces in each card's accent color; hover or keyboard-focus any card and it comes forward while the rest tuck underneath, and the layout stays usable down to the minimum window size. (#904)
See exactly what OmniVoice keeps on disk — and get warned before space runs out. Settings → Storage shows real usage for the model cache (with your largest models), app data, engine environments and temp files, plus a free-space gauge and low-disk / near-full-volume warnings with one-click paths to open folders or reclaim space. (#906)
A "What's new" changelog reader in Settings → Updates. The available update's real release notes now render in-app, alongside an offline changelog viewer and a one-time "what's new" note after each update. (#909)
Route each AI feature to its own LLM — or switch it off. A new Settings → LLM Skills panel lists every LLM-powered capability (Cinematic/Autofit translation, slot fitting, glossary auto-extract, direction parsing, dictation cleanup) with a per-skill toggle and provider picker, so sensitive work can stay on a local model while heavier jobs use a remote one. Disabled skills fall back to the exact non-LLM behavior. (#912)
A small thank-you moment, done right. After a successful export, dub, audiobook, or batch run, OmniVoice may — rarely — show a friendly, dismissible note by the footer heart about supporting development: never more than once a session, at most every 7 days, never for brand-new users, with a permanent "don't ask again". The logs bar also gained an icon and the footer icons now share one size. (#898)
Dictation, rebuilt. The dictation pill now shows a live waveform the moment the mic opens, streams words as you speak with real download/loading progress on first use, and finishes what you say in about half a second of silence instead of two-and-a-half. Transcripts come out properly capitalized and punctuated. Text insertion is now honest and safe: your clipboard is preserved and restored, failures show what to do (including a one-click jump to macOS Accessibility settings when permission is missing) instead of a false "Pasted", and Esc cancels cleanly at any point. The dictation model also pre-warms in the background after launch, so the first press of the hotkey no longer sits on a cold model load.
LLM Providers: one-click connection testing with real diagnostics. The Test button in Settings → LLM Providers now measures round-trip latency and turns failures into plain-language guidance — bad key (401/403), wrong model or URL (404), rate-limited (429), or unreachable server — instead of a raw exception dump. A new "Fetch models" button lists every model your key can access so you pick from real names instead of guessing. The whole panel is now translated into all 21 languages, provider error messages never echo your API key, and the settings API gained full test coverage.

Changed

A "Get in touch" page that actually guides you. The Contact page is now clearly-labelled cards (report a bug, request a feature, get community help, support the project, report a security issue) with a sentence each on when to use them, instead of a flat link list. (#925)
Release titles are version-first. GitHub's release-list sidebar truncates the title, so "OmniVoice Studio v0.3.8" hid the version; releases are now named "vX.Y.Z — OmniVoice Studio" so the version is always visible. (#922)
Launchpad feature cards now fill the window. The seven cards (Voice Clone, Voice Design, Video Dubbing, Stories, Audiobook, Voice Gallery, Transcripts) span the full content width on a maximized display instead of a fixed ~780px fan, and reflow responsively (7→3→1 columns) down to the 900×600 minimum — driven by the shell's own width, keeping the animated card faces, hover/keyboard-focus raise, and reduced-motion fallback. (#915)
LLM Providers settings, de-confused. The old inline "LLM endpoint" box in Translation is gone — LLM Providers is now the one place that owns it. Fields pinned by an environment variable are shown disabled with an explainer instead of silently reverting, the make-active button explains when a provider is env-pinned, and the Cloudflare Account ID is remembered and editable. (#907)
Intel Macs: honestly unsupported for the local backend. PyTorch no longer ships Intel-Mac builds, so the backend cannot run there; instead of a cryptic dependency error, Intel users now get a clear explanation up front (with the remote-backend option), and the README/docs say so plainly. (#889, #891)
The app now always opens maximized (not fullscreen). Window size and position are no longer carried over from the previous session — one manual resize used to make every later launch reopen at that smaller size, overriding the intended maximized default. Same behavior on macOS (zoomed window, not a fullscreen Space), Windows, and Linux.

Fixed

Sherpa-ONNX "model not set" now reads as a setup problem, not out-of-memory. Selecting the sherpa-onnx engine without OMNIVOICE_SHERPA_MODEL configured used to fail with a misleading "ran out of memory — press Flush" 500; it now names the exact variable, points at Settings → Engines, and the engine is marked unavailable-with-a-reason in the picker (with a copy-paste setup line) instead of selectable-but-broken. Generalized so any env-gated engine surfaces actionable setup guidance. (#919)
Cinematic & Autofit now actually run on every translation engine. Picking Cinematic or Autofit on the default Argos engine (or NLLB) used to silently fall back to Fast with a success toast; it now runs the full LLM refine + fit pass, the Autofit fit pass is bounded by the same wall-clock budget as Cinematic, and provider errors are scrubbed of keys/user-ids. (#910)
Dictation no longer freezes on a slow or dead LLM. Transcript refinement is now hard-bounded (default 4s): a placeholder key or unreachable endpoint falls back to clean unrefined text instead of stalling the paste ~51 seconds. The dictation model is genuinely pre-warmed and reused across sessions, REST transcription is polished like live dictation, and Settings flags a configured-but-failing LLM. (#911)
Model installs fail loudly, not silently. Failed downloads keep their mirror-aware reason on the row with Retry/Dismiss instead of vanishing after a moment; installs check free disk space up front before overrunning it; in-progress installs get a Cancel button; and the HF-mirror setting only asks for a restart when it actually changed. (#908)
Engines settings, sharper and honest. The Supertonic license "Accept" button works again (it was inert since it shipped), the engine matrix refreshes the instant you pick an engine, picking a GPU engine that lands on CPU now warns you with the reason, CPU-only engines stop being mislabelled "CPU fallback", and an in-process "Test engine" pass reads as a dependency check instead of a fake "0 ms" latency. (#905)
Updates can no longer cost you data. Before any database migration runs on first launch of a new version, the database is snapshotted next to itself (newest three kept), and a failed migration stops with the backup path named instead of silently running on a half-upgraded database; the environment self-heal now verifies it's actually broken before rebuilding. (#909)
CUDA transcription now works on packaged NVIDIA installs — the cuDNN 8 compat libraries install automatically at launch. The install step only existed in the dev-loop scripts/setup.py, which isn't bundled into the packaged app, so real installs never got the libs and WhisperX / faster-whisper failed with Could not locate cudnn_ops_infer64_8.dll. The Rust bootstrap now side-loads them on CUDA machines; CPU/AMD/ROCm boxes skip the download and cache the result so their launches stay instant. (#827, #869)
scripts/setup.py no longer fails with No module named pip when installing the cuDNN 8 libs in the dev loop. uv venv doesn't seed pip into the venv, so python -m pip install always broke; the script now uses uv pip install --python instead. (#869)
Generation timeouts now give device-honest advice. A CPU-only machine is no longer told the GPU is "VRAM-starved" or to "set the engine to CPU" — CPU hosts get compute-bound guidance (shorter text, the CPU-tuned GGUF/Supertonic-3 engines, the OMNIVOICE_GENERATE_TIMEOUT_S knob) while GPU hosts keep the VRAM-contention explanation. (#896)
Model-download failures now name the mirror that failed. When a Hugging Face mirror is configured and unreachable, every affected surface (generate, dub, Model Store installs) names the mirror and points at the exact setting instead of leaking a raw network error; auto-repair failures now say why the repair failed. (#874, #890)
No more infinite "preparing" after an unclean shutdown. If Windows corrupts the WebView cache (e.g. after a BSOD), the splash detects the dead IPC channel, proceeds via a direct backend health check, and — if truly stuck — offers a one-click "Repair and restart". (#879, #892)
"Out of memory" is no longer the default excuse. A failed model download mid-generation was mislabeled as OOM with useless "flush VRAM" advice; network failures are now classified honestly, only real OOM signatures get the OOM treatment, and first-use engine downloads retry once with a fresh connection. (#880, #893)
Hung transcriptions recover the same way everywhere. Chunked dub transcription now shares the same guarded-timeout + GPU-pool reset as the rest of the app, and repeated timeouts recommend the crash-isolated ASR engine — now properly selectable in Settings. (#730, #895)
A raw [Errno 22] transcribe error now tells you what to fix. When the OS rejects the temporary WAV write during dub transcription (a missing, read-only, or full temp directory, or antivirus interference), the stream used to dead-end as "Transcription produced no segments. [Errno 22] Invalid argument" with no next step; it now classifies the EINVAL and appends an actionable temp-dir/disk/AV hint — the same treatment the ffmpeg and compute-type failure classes already get. (#763)
Buttons can no longer hide under the logs footer on small windows. The bottom status/logs bar was a fixed overlay that pages had to compensate for with padding — any view that missed it (voice-card grids in Gallery and Community, bottom action rows) clipped under the bar at small window sizes, a class previously patched one page at a time (#476, #504). The footer is now a real row of the app shell, so content physically ends at its top edge at every window size, collapsed or expanded — guarded by a new layout test plus a 900×600 Playwright check at the app's minimum window size.
Confucius4-TTS is now validated end-to-end — and actually loads. The opt-in engine's first live run (Apple Silicon, CPU) caught three scaffold-era faults: the sidecar could never import confuciustts (upstream ships no packaging, so the documented pip install -e fails — the sidecar and bootstrap probe now put the clone on sys.path, like upstream's own example), the assumed 24 kHz sample rate was wrong (confirmed 22 050 Hz, now regression-tested), and the docs demanded an Amphion/MaskGCT install that doesn't exist (all weights auto-download from HuggingFace). CPU is ~17× realtime, so CUDA stays the recommended path; gpu_compat now advertises ("cuda", "cpu"). (#590)
Parakeet TDT transcription now works without an NVIDIA GPU. The nemo-parakeet ASR engine (parakeet-tdt-0.6b-v3, 25 languages, word timestamps) was hard-gated behind CUDA — but a live measurement on an Apple Silicon M2 shows it transcribing at ~10× realtime on CPU, roughly 20× faster than the default whisper-large-v3 on the same machine at equal accuracy. The false GPU gate is removed, so Mac and CPU-only users can now pick the dramatically faster engine in Settings → Engines.
8 GB GPUs: voice-clone/dub transcription no longer kills the backend. On cards where the TTS model already held most of the VRAM (e.g. RTX 4060 Ti 8 GB), loading whisper large-v3 in float16 for a reference-clip or dub transcription died as a native CUDA out-of-memory abort — the whole backend process vanished with no error logged, and the app showed "Can't reach the local OmniVoice backend." A new VRAM preflight re-checks free GPU memory right before the ASR load and steps down float16 → int8 → CPU instead of attempting a load that can't fit (opt-out: OMNIVOICE_ASR_VRAM_PREFLIGHT=0). (#723)

CI

A migration can no longer silence the app's logs. Alembic's startup config was disabling every existing logger process-wide (a latent bug the new pre-migration backup logging exposed); fixed, and the migration-safety tests are now immune to full-suite ordering. (#909, #917)
Deterministically green tests + real install proof. Tests can no longer read the developer's real .env or app data (the order-dependent flake class, #878, #894), and a new cross-platform install-test workflow builds all four installers and proves a real first run — model download plus verified synthesis — on macOS, Windows, and Linux runners.

Windows x64 artifacts

38fb4e6a51dd351b6fdb2238099aa7b2fda225baffa145f65ca83eb4b8d4a818 *OmniVoice Studio_0.3.9_x64_en-US.msi
55627c5faadda152dfc92443eaf4f9c4c185fe5ec5ad58b68cd625879243bd02 *OmniVoice Studio_0.3.9_x64_en-US.msi.sig

Linux x64 artifacts

2f64c90e1c828899f2cab317b918990fcf90f3c3efe0314aaa9a958bee742a8b  OmniVoice Studio_0.3.9_amd64.AppImage
5a1f08360b4fc0529728c89959b2cea6f90dcb28ff73ea704fe32405015b2da7  OmniVoice Studio_0.3.9_amd64.AppImage.sig

macOS Intel artifacts

f8edc968e9ff2177933ccf2eead07722eff111572693573c5d4dec1ba1e1377f  OmniVoice Studio_0.3.9_x64.dmg
a1344e00310b969b2fbe235e1fd8ae2c5e40c1e8a59ffa9e83e3bca1f8d1bf5b  OmniVoice Studio.app.tar.gz
db62eb084ec12f4097ebed238fb03c50c395f3c383207fb2c0eeac89b3c03fb2  OmniVoice Studio.app.tar.gz.sig

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.3.9 — OmniVoice Studio

Choose a tag to compare

Sorry, something went wrong.