VoiceQA — Hallucination detection for voice agents

The independent audit layer for your STT, TTS, and full-duplex pipeline.

Voice agents fabricate content at every layer — and most teams only check one:

STT (Whisper, Deepgram, AssemblyAI, …): invents words during silence, substitutes rare/domain words with common phonetic neighbors (metoprolol → metoprolole, apuntado → atontado). Worse in non-English and accented audio.
TTS (ElevenLabs, Cartesia, Play.ht, OpenAI TTS, …): mispronounces medication names, drops syllables, garbles numbers. Pipeline logs show correct text was generated; audio rendered something different.
Full-duplex models (Claude voice, ChatGPT voice, Gemini Live): hallucinate conversational content with no separate STT/TTS boundary to instrument.

VoiceQA is the independent witness. It doesn't trust any vendor in your pipeline — it re-derives signal from the audio itself. In healthcare, finance, insurance, and legal voice agents, "approximately right" is a compliance violation, and vendor confidence scores are biased toward saying the vendor was right.

Mental model

VoiceQA is the audit layer between your voice pipeline and production:

Provide audio (+ optional expected script)
Run layered, deterministic checks — VoiceQA acts as an independent witness, not a vendor self-report
Return a structured verdict: PASS | REVIEW | FAIL | LOW_CONFIDENCE
Save reports locally and run curated suites to catch regressions over time

Why this is different

VoiceQA is designed to fail loudly on the things vendors hide:

Vendor-agnostic: works with any STT (Whisper, Deepgram, …), any TTS (ElevenLabs, Cartesia, …), any full-duplex model — VoiceQA never trusts the vendor under test
Deterministic-first: entities/vitals/terms/pauses/artifacts don't depend on an LLM judge
Audio-side verification: forced alignment + phoneme checks re-derive what was actually said, independent of whatever transcript the vendor returned
Local-first: practical for sensitive workflows (use synthetic scripts for public repos; keep audio local)
Regression-oriented: curated suites + baselines, not one-off inspection
Graceful degradation: if Ollama or optional tooling isn't available, the system still runs end-to-end

Transcript accuracy is necessary, not sufficient. VoiceQA is built for the rest.

What it checks

Accepts an audio file plus the expected script, runs a layered pipeline, and returns a structured quality report:

Transcription + confidence gate (Whisper)
Transcript accuracy — WER/MER/WIL + diff (jiwer)
Audio artifacts — clipping, DC offset, noise, pops/clicks (deterministic)
Pauses — silence gaps with timestamps (deterministic)
Pause naturalness — classify pauses as within‑phrase vs between‑phrase (deterministic)
Speaking rate (beta) — per-segment speed checks; current too-fast/too-slow thresholds are intentionally conservative and should be recalibrated per voice/domain
Prosody — F0 mean, jitter, shimmer, HNR (Praat/Parselmouth; graceful skip)
MOS prediction — DNSMOS via SpeechMOS (graceful skip)
Entity fidelity — numbers/codes/dates mismatches (deterministic)
Vitals fidelity — BP / SpO2 / temperature parsing + comparison (deterministic, suite-only)
Term fidelity — “must preserve” symptom/medication words with criticality weights (suite-only)
Name fidelity — proper noun/name mismatches (conservative extraction + fuzzy match)
Faithfulness (optional) — semantic-only LLM-as-judge (Ollama; advisory; low confidence)
History — local SQLite (voiceqa.db)

Design principles

Deterministic first. High-signal checks (entities, vitals, audio artifacts) should not depend on an LLM.
LLM-as-judge is advisory. Faithfulness is low-confidence by default and should not be the only safety gate.
Curated eval > random corpora. Suite manifests are committed; audio is local. You iterate on your failure modes.
No PHI in public. Use synthetic scripts and voices for open-source eval sets.

Who this is for

Teams building voice agents in regulated industries — healthcare, finance, insurance, legal, collections — where a hallucinated medication, dosage, or disclosure is a compliance violation, not a UX bug
Teams shipping multilingual voice agents where STT phonetic substitution gets worse on non-English audio
Teams using full-duplex voice models (Claude voice, ChatGPT voice) and finding existing eval tools have no answer
Builders who need repeatable, local/private evaluation workflows without sending audio to a third-party cloud

Stack

Python 3.11, FastAPI, LangChain, Whisper, Ollama, scipy, soundfile, jiwer, praat-parselmouth, SpeechMOS, SQLite

Quick demo (recommended)

Run the UI and try an eval suite:

symptom-triage — “should pass” baseline suite (terms + vitals)
hallucination-demo — intentional additions/omissions/contradictions (uses audio_script)
prosody-demo — deterministic weird pauses/pacing (uses postprocess)

Setup

Create/activate a Python 3.11 venv (example: source ~/venvs/voiceqa/bin/activate)
pip install -r requirements.txt
Optional (recommended): install and run Ollama
- brew install ollama
- ollama serve
- ollama pull phi3.5
- ollama pull llama3.2
cp .env.example .env (or edit .env directly)
Run the API: uvicorn main:app --reload --port 8000

Quickstart (UI)

source ~/venvs/voiceqa/bin/activate
uvicorn main:app --reload --port 8000

Open http://127.0.0.1:8000/ui.

Curated eval suites (recommended)

Suites live in eval_set/suites/<suite_id>/manifest.jsonl.

Typical workflow:

Put scripts (and expected terms) in the manifest (committed)
Generate local TTS WAVs (not committed) or record real audio samples
Run the suite from the UI or /eval/run
Inspect flagged clips (audio playback + jump-to flags + JSON) and iterate

For local regression tracking, you can save a suite baseline from the UI and compare future runs. Baselines are stored as eval_set/suites/<suite_id>/baseline.local.json and are gitignored.

Demo: generate a suite with ElevenLabs

This is the fastest way to get a realistic eval loop without committing audio.

source ~/venvs/voiceqa/bin/activate
export ELEVENLABS_API_KEY="..."        # do not commit
export ELEVENLABS_VOICE_ID="..."       # see generator output for options
python tools/generate_eval_audio_elevenlabs.py --suite symptom-triage

Then run symptom-triage from http://127.0.0.1:8000/ui.

Endpoints

GET /health
GET /ui — web UI
POST /analyse — analyse one audio file
POST /analyse/batch — analyse many audio files in one request (paired audio_files[i] + expected_scripts[i])
GET /eval/suites — list available eval suites from eval_set/suites/
POST /eval/run — run an eval suite (server-side, on local audio files)
GET /eval/audio/{suite_id}/{path} — serve eval audio clips to the UI
POST /eval/baseline/save — save a local baseline snapshot
POST /eval/baseline/compare — compare current run to baseline

Usage

Single file:

curl -X POST http://localhost:8000/analyse \
  -F "audio=@your_tts_output.wav" \
  -F "expected_script=Your expected script here"

Batch:

curl -X POST http://localhost:8000/analyse/batch \
  -F "audio_files=@audio1.wav" -F "audio_files=@audio2.wav" \
  -F "expected_scripts=Expected for audio 1" -F "expected_scripts=Expected for audio 2"

Recommended models

Transcription: Whisper small (default) or large-v3 for better accuracy (slower)
QA report + faithfulness: phi3.5 / llama3.2 via Ollama (see .env)

Notes for healthcare

Use synthetic test scripts in public repos (no PHI).
Treat expected_terms (rare symptoms + medication names) as “must preserve” and curate them into suites (supports criticality).
Vitals are checked deterministically so “one eighty over one ten” still matches 180/110.

Screenshots

The UI is intentionally lightweight and local. If you publish this repo, add a screenshot/GIF of:

an eval suite run showing per-case audio playback + jump-to flags
a hallucination-demo case being flagged
a prosody-demo case being flagged

Tests

PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest tests -v

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
docs		docs
eval_set		eval_set
generate_audios		generate_audios
plans		plans
tests		tests
tools		tools
ui		ui
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
ROADMAP.md		ROADMAP.md
STATUS.md		STATUS.md
agent.py		agent.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoiceQA — Hallucination detection for voice agents

Mental model

Why this is different

What it checks

Design principles

Who this is for

Stack

Quick demo (recommended)

Setup

Quickstart (UI)

Curated eval suites (recommended)

Demo: generate a suite with ElevenLabs

Endpoints

Usage

Recommended models

Notes for healthcare

Screenshots

Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VoiceQA — Hallucination detection for voice agents

Mental model

Why this is different

What it checks

Design principles

Who this is for

Stack

Quick demo (recommended)

Setup

Quickstart (UI)

Curated eval suites (recommended)

Demo: generate a suite with ElevenLabs

Endpoints

Usage

Recommended models

Notes for healthcare

Screenshots

Tests

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages