Skip to content

Demo: Voice Agent — memos, meeting transcription, and voice queries (ASR + TTS + DatabaseMixin) #389

@kovtcharov

Description

@kovtcharov

Summary

Create a Voice Agent that combines voice memo dictation, meeting transcription, and voice-based querying into a single domain agent. This agent handles all voice-first workflows — it records, transcribes, uses LLM to automatically clean and improve transcriptions (similar to Wispr Flow), exports notes as markdown, auto-labels and categorizes entries, stores in a database, answers questions about stored content, and provides a simple web UI for browsing and viewing notes. Uses Lemonade v9.4.1 streaming ASR, TTS, and reranking.


LLM-Powered Transcription Enhancement

Raw ASR output is noisy — filler words, missing punctuation, run-on sentences, misheard terms. The Voice Agent pipes every transcription through an LLM post-processing step before storage:

Enhancement Pipeline

Microphone → Lemonade ASR (raw transcript) → LLM Enhancement → Markdown Formatting → Database Storage + .md File Export

What the LLM Fixes

Issue Raw ASR After LLM Enhancement
Filler words "So um we decided to uh use the Flux model" "We decided to use the Flux model"
Punctuation "launch target is march 15th budget approved for two gpus" "Launch target is March 15th. Budget approved for two GPUs."
Grammar "me and the team was discussing" "The team and I were discussing"
Proper nouns "we're using lennon aid server" "We're using Lemonade Server"
Technical terms "the cue wen model" "the Qwen model"

Enhancement Modes

Mode Behavior Use Case
clean (default) Remove fillers, fix punctuation/grammar, preserve meaning exactly Quick memos
structured Clean + organize into sections/bullet points with headings Meeting minutes
verbatim No LLM processing, raw ASR output Legal/compliance recording

Context-Aware Enhancement

  • Domain vocabulary: Custom word list (e.g., "Lemonade", "Qwen", "GAIA", "NPU") stored in vocabulary table
  • Previous entries: Recent entries provide context for ambiguous terms
  • User corrections: When user manually corrects a transcription, agent learns the correction

Command Mode

Voice-edit stored content:

User: [speaks] "Edit memo 12 — make it more concise"
Agent: [rewrites via LLM] "Updated memo #12."

User: [speaks] "Turn memo 13 into bullet points"
Agent: [reformats via LLM] "Done — memo #13 reformatted."

User: [speaks] "Fix the spelling of Lemonade in all my memos"
Agent: [batch-corrects] "Fixed 3 occurrences across memos #8, #11, and #12."

Markdown Note Export

All entries are stored both in the database AND exported as markdown files for portability and readability.

Markdown Output Format

Memos (~/.gaia/voice/notes/memo_012.md):

---
id: 12
type: memo
title: Design Team Meeting
labels: [gpu, budget, launch, infrastructure]
category: engineering
created: 2026-02-27T14:05:00
enhancement: clean
---

# Design Team Meeting

Meeting with design team. Decided to use the Flux model for the image
pipeline. Launch target is March 15th. Budget approved for two additional GPUs.

**Tags:** gpu, budget, launch, infrastructure

Meetings (~/.gaia/voice/notes/meeting_007.md):

---
id: 7
type: meeting
title: Q2 Planning
labels: [roadmap, npu, gpu, budget, action-items]
category: planning
duration: 31 min
word_count: 4230
created: 2026-02-27T14:01:00
enhancement: structured
---

# Q2 Planning — Feb 27, 2026

## Attendees
(auto-detected from transcript if speaker identification available)

## Discussion

### NPU Optimization
- Work is ahead of schedule
- Performance targets exceeded on Ryzen AI 300 series

### Infrastructure Budget
- Approved two additional GPUs for inference cluster
- Sarah to handle procurement

## Action Items
- [ ] Sarah: Prepare customer demo by March 10th
- [ ] Team: Finalize Q2 milestones by March 3rd

**Tags:** roadmap, npu, gpu, budget, action-items

Export Behavior

  • Auto-export: Every entry automatically saved as .md in ~/.gaia/voice/notes/
  • Sync: Database is source of truth; markdown files regenerated on edit
  • Batch export: gaia voice --export ./my-notes/ exports all entries as markdown
  • Custom template: Users can override the markdown template

Auto-Labeling and Categorization

The LLM automatically assigns labels and a category to each entry upon creation.

Label Generation

LLM prompt: "Given this note, generate 3-6 short labels (1-2 words each)
that capture the key topics. Return as comma-separated list."

Input: "Meeting with design team. Decided to use the Flux model..."
Output: "gpu, budget, launch, flux-model, design-team, infrastructure"

Category System

Predefined categories (LLM selects the best match):

Category Description Example
engineering Technical decisions, code, architecture "Decided to use Flux model"
planning Roadmaps, timelines, milestones "Q2 planning meeting"
action-items Tasks, to-dos, assignments "Sarah to prepare demo"
ideas Brainstorming, feature proposals "What if we added voice to SD agent"
reference Facts, specs, documentation notes "NPU supports INT8 quantization"
personal Personal notes, reminders "Pick up groceries"
other Anything else Fallback

Database Schema for Labels

CREATE TABLE labels (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL UNIQUE,
    color TEXT,              -- hex color for UI display
    entry_count INTEGER DEFAULT 0
);

CREATE TABLE entry_labels (
    entry_id INTEGER REFERENCES entries(id),
    label_id INTEGER REFERENCES labels(id),
    PRIMARY KEY (entry_id, label_id)
);

Querying by Label

$ gaia voice --search --label gpu
  #12  Design team meeting    [gpu, budget, launch]         Feb 27
  #7   Q2 Planning            [roadmap, npu, gpu, budget]   Feb 27

$ gaia voice --search --category planning
  #7   Q2 Planning            [roadmap, npu, gpu, budget]   Feb 27
  #3   Sprint Retrospective   [sprint, velocity, planning]  Feb 21

Simple Web UI for Viewing Notes

A lightweight web viewer served by GAIA's existing FastAPI server, following the same HTML template pattern used by the summarize app (src/gaia/apps/summarize/templates/).

UI Features

┌──────────────────────────────────────────────────────────────────┐
│  GAIA Voice Notes                              🔍 Search...      │
├──────────────┬───────────────────────────────────────────────────┤
│              │                                                   │
│  CATEGORIES  │  # Design Team Meeting                           │
│  ───────────│  📅 Feb 27, 2026  ·  memo  ·  42 words           │
│  All (24)    │                                                   │
│  engineering │  Meeting with design team. Decided to use the    │
│  planning    │  Flux model for the image pipeline. Launch       │
│  action-items│  target is March 15th. Budget approved for two   │
│  ideas       │  additional GPUs.                                │
│  reference   │                                                   │
│              │  Labels: gpu  budget  launch  infrastructure      │
│  LABELS      │                                                   │
│  ───────────│  ┌─────────┐ ┌──────────┐ ┌────────┐            │
│  gpu (3)     │  │ ✏️ Edit  │ │ 📋 Copy MD│ │ 🗑 Del  │            │
│  budget (2)  │  └─────────┘ └──────────┘ └────────┘            │
│  launch (2)  │                                                   │
│  npu (1)     │───────────────────────────────────────────────── │
│              │                                                   │
│  RECENT      │  # Q2 Planning                                   │
│  ───────────│  📅 Feb 27, 2026  ·  meeting  ·  31 min          │
│  Meeting #7  │                                                   │
│  Memo #13    │  ## NPU Optimization                             │
│  Memo #12    │  - Work is ahead of schedule...                  │
│              │                                                   │
└──────────────┴───────────────────────────────────────────────────┘

Implementation

File Content
src/gaia/apps/voice/webui/index.html Single-page app with sidebar (categories, labels, recent) + main content area (rendered markdown)
src/gaia/apps/voice/webui/style.css Clean, minimal styling — dark/light mode
src/gaia/apps/voice/app.config.json Electron app config (window size, dev server port)
src/gaia/api/voice_endpoints.py FastAPI endpoints for the UI

REST Endpoints (served by GAIA API)

Method Path Description
GET /api/voice/entries List all entries (filterable by type, category, label)
GET /api/voice/entries/{id} Get single entry with rendered markdown
GET /api/voice/entries/{id}/raw Get raw markdown source
PUT /api/voice/entries/{id} Update entry (edit content, labels, category)
DELETE /api/voice/entries/{id} Delete entry
GET /api/voice/labels List all labels with counts
GET /api/voice/categories List categories with counts
GET /api/voice/search?q=...&label=...&category=... Search entries
GET /api/voice/export Export all entries as zip of markdown files

UI Technology

  • Pure HTML/CSS/JS — no React/build step required (matches summarize app pattern)
  • Markdown rendered client-side via lightweight library (e.g., marked.js)
  • Responsive layout for desktop and mobile
  • YAML frontmatter displayed as metadata badges
  • Access via: gaia voice --ui opens in browser, or load in Electron via app config

Demo Scenarios

Voice Memos

$ gaia voice

User: [speaks] "New memo um meeting with design team so we decided to use
       the flux model for the image pipeline uh launch target is march 15th
       and budget approved for two additional gpus"
Agent: [transcribes → LLM cleans → auto-labels → saves to DB + markdown]
       "Saved memo #12 — Design team meeting
        Labels: gpu, budget, launch, infrastructure
        Category: engineering
        Exported to ~/.gaia/voice/notes/memo_012.md"

Meeting Transcription

$ gaia voice --meeting "Q2 Planning"

Agent: Recording... (live captions displayed)
  [14:01] "Welcome everyone. Today we're discussing the Q2 roadmap."
  [14:15] "Budget approved for two additional GPUs."
  [Ctrl+C to stop]

Agent: Meeting saved — 4,230 words, 31 minutes.
       Labels: roadmap, npu, gpu, budget, action-items
       Category: planning
       Exported to ~/.gaia/voice/notes/meeting_007.md

Browse UI

$ gaia voice --ui
Agent: Voice Notes UI running at http://localhost:8080/voice

List, Search & Edit

$ gaia voice --list
  TYPE      ID   TITLE                  CATEGORY      LABELS                        DATE
  meeting   #7   Q2 Planning            planning      roadmap, npu, gpu, budget     Feb 27
  memo      #13  Customer demo prep     action-items  demo, customer, preparation   Feb 27
  memo      #12  Design team meeting    engineering   gpu, budget, launch           Feb 27

$ gaia voice --search "GPU budget"
  Meeting #7 [14:15]: "Budget approved for two additional GPUs..."

$ gaia voice --search --label gpu
  #12  Design team meeting    [gpu, budget, launch]
  #7   Q2 Planning            [roadmap, npu, gpu, budget]

$ gaia voice --export ./my-notes/
  Exported 24 entries to ./my-notes/ (14 memos, 10 meetings)

Architecture

class VoiceAgent(Agent, DatabaseMixin):
    """Voice-first agent with LLM-enhanced transcription, auto-labeling,
    markdown export, database storage, and semantic search."""

    def __init__(self, db_path=".gaia/voice.db", notes_dir="~/.gaia/voice/notes", **kwargs):
        super().__init__(**kwargs)
        self.init_db(db_path)
        self.notes_dir = Path(notes_dir).expanduser()
        self.notes_dir.mkdir(parents=True, exist_ok=True)

        if not self.table_exists("entries"):
            self.execute("""
                CREATE TABLE entries (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    type TEXT NOT NULL,              -- 'memo' or 'meeting'
                    title TEXT,
                    content_raw TEXT NOT NULL,        -- original ASR output
                    content TEXT NOT NULL,            -- LLM-enhanced version
                    content_markdown TEXT NOT NULL,   -- full markdown with frontmatter
                    category TEXT DEFAULT 'other',
                    enhancement_mode TEXT DEFAULT 'clean',
                    duration_seconds INTEGER,
                    word_count INTEGER,
                    created_at TEXT DEFAULT CURRENT_TIMESTAMP,
                    updated_at TEXT DEFAULT CURRENT_TIMESTAMP
                )
            """)

        if not self.table_exists("segments"):
            self.execute("""
                CREATE TABLE segments (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    entry_id INTEGER REFERENCES entries(id),
                    timestamp_offset REAL,
                    text_raw TEXT NOT NULL,
                    text TEXT NOT NULL,
                    speaker TEXT
                )
            """)

        if not self.table_exists("labels"):
            self.execute("""
                CREATE TABLE labels (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    name TEXT NOT NULL UNIQUE,
                    color TEXT,
                    entry_count INTEGER DEFAULT 0
                )
            """)

        if not self.table_exists("entry_labels"):
            self.execute("""
                CREATE TABLE entry_labels (
                    entry_id INTEGER REFERENCES entries(id),
                    label_id INTEGER REFERENCES labels(id),
                    PRIMARY KEY (entry_id, label_id)
                )
            """)

        if not self.table_exists("vocabulary"):
            self.execute("""
                CREATE TABLE vocabulary (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    term TEXT NOT NULL UNIQUE,
                    correction TEXT,
                    context TEXT
                )
            """)

Agent Capabilities

Capability Description
Memo dictation Voice notes → LLM-enhanced transcription → auto-title → auto-label → categorize → store + export md
Meeting recording Long-form recording with live captions, timestamped segments, structured markdown output
LLM enhancement Filler removal, punctuation, grammar, proper noun correction, formatting
Auto-labeling LLM generates 3-6 topic labels per entry, stored in normalized label table
Auto-categorization LLM assigns category (engineering, planning, action-items, ideas, reference, personal)
Markdown export Every entry exported as .md with YAML frontmatter; batch export supported
Command mode Voice-edit stored content ("make this concise", "relabel this", "change category")
Custom vocabulary Domain-specific word list for proper noun correction
Voice queries Ask questions about stored content via reranking + LLM
File import Transcribe pre-recorded WAV/audio files via REST endpoint
Web UI Browse, search, filter by label/category, view rendered markdown
List & search CLI: browse and search with label/category filters

Demo Deliverables

File Content
src/gaia/agents/voice/agent.py VoiceAgent(Agent, DatabaseMixin) — full agent implementation
src/gaia/agents/voice/prompts.py System prompts for enhancement, labeling, categorization
src/gaia/agents/voice/markdown.py Markdown generation with YAML frontmatter
src/gaia/apps/voice/webui/index.html Single-page note viewer (HTML/CSS/JS)
src/gaia/apps/voice/webui/style.css UI styling (dark/light mode)
src/gaia/apps/voice/app.config.json Electron app config
src/gaia/api/voice_endpoints.py FastAPI REST endpoints for UI
src/gaia/cli.py gaia voice subcommand with all flags
examples/voice_agent_demo.md Walkthrough of all workflows
tests/unit/test_voice_agent.py Unit tests (mocked ASR, LLM, database)

What This Exercises

  • Streaming ASR (real-time transcription via WebSocket) — new v9.4.1
  • Streaming TTS (voice responses via audio/speech) — new v9.4.1
  • REST audio transcription (file import via /audio/transcriptions) — new v9.4.1
  • Reranking (accurate search over stored content) — new v9.4.1
  • LLM chat completions (enhancement, labeling, categorization, command mode) — existing
  • DatabaseMixin (structured storage with SQLite, normalized labels) — existing
  • FastAPI (REST endpoints for UI) — existing
  • Electron (optional desktop app wrapper) — existing
  • Auto-detection of Lemonade audio backends — new (Extend TalkSDK and AudioClient with Lemonade ASR+TTS auto-detection #386)

LLM Prompts

Enhancement Prompt

You are a transcription enhancer. Given raw speech-to-text output, produce
clean, well-formatted text.

Rules:
- Remove filler words (um, uh, like, you know, so)
- Add proper punctuation and capitalization
- Fix obvious grammar errors while preserving the speaker's intent
- Correct known terms from the vocabulary list: {vocabulary}
- Do NOT add information that wasn't spoken
- Do NOT change the meaning or omit substantive content

Enhancement mode: {mode}
- clean: Fix errors, preserve structure
- structured: Fix errors + organize into sections/bullets
- verbatim: Return as-is (no changes)

Raw transcription:
{raw_text}

Labeling Prompt

Given this note, generate 3-6 short labels (1-2 words each) that capture
the key topics. Return as a JSON array of strings.

Note: {content}
Output: ["label1", "label2", ...]

Categorization Prompt

Classify this note into exactly one category.
Categories: engineering, planning, action-items, ideas, reference, personal, other

Note: {content}
Output: category_name

Dependencies

Acceptance Criteria

  • Single VoiceAgent handles memos, meetings, editing, and queries
  • LLM enhancement pipeline cleans raw ASR output before storage
  • Both content_raw and content (enhanced) stored in database
  • Three enhancement modes: clean, structured, verbatim
  • Auto-labeling generates 3-6 topic labels per entry
  • Auto-categorization assigns one of 7 categories
  • Labels stored in normalized table with counts
  • Every entry exported as markdown with YAML frontmatter
  • Batch export via gaia voice --export
  • Command mode allows voice-editing stored content
  • Custom vocabulary table improves domain-specific transcription
  • Web UI serves via FastAPI, browsable at /voice
  • UI supports: list, search, filter by label/category, view rendered markdown
  • REST endpoints for CRUD on entries, labels, categories
  • CLI supports --list, --search, --label, --category, --export, --ui
  • Unit tests pass with mocked ASR, LLM, and database
  • Demo walkthrough documented

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentaudioAudio (ASR/TTS) changesdemoDemo/example showcasing capabilitiesdomain:multimodalVoice (ASR/TTS), Vision (VLM), Image gen (SD), CUAlemonade 🍋p1medium priorityragRAG system changestalkTalk agent changestrack:consumer-appHermes-competitor consumer product — mobile-first, voice + messaging + memory + skills

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions