Skip to content

feat: Gemini Live real-time voice, web image search, and follow-up fixes#35

Merged
frapeti merged 31 commits intomasterfrom
feat/gemini-live-voice
Apr 1, 2026
Merged

feat: Gemini Live real-time voice, web image search, and follow-up fixes#35
frapeti merged 31 commits intomasterfrom
feat/gemini-live-voice

Conversation

@frapeti
Copy link
Copy Markdown
Collaborator

@frapeti frapeti commented Mar 30, 2026

Summary

This branch adds Gemini Live real-time voice: WebSocket session, PCM capture and playback, LiveAgentLoop, Riverpod wiring, and chat UI (live voice button and overlay). It also includes follow-up fixes and related features merged on this branch (web image search, terminal/sandbox reliability, session and usage fixes, localization, and CI).

Base: master
Merge strategy: ordinary merge or squash per repo preference.

What changed

Gemini Live (core)

  • Models: supportsLive flag and Gemini Live entries in the model catalog.
  • Services: Gemini Live WebSocket client, PCM streaming, playback queue (in-memory WAV source), voice selection (settings UI, set_live_voice tool, config), connect/end SFX, high-sensitivity server-side speech detection where applicable.
  • Agent: LiveAgentLoop and hooks in AgentLoop; system prompt and transcript alignment with chat; buffered transcript persistence on stop/disconnect; instruction to speak before request_user_action.
  • Providers: Riverpod providers for live voice state and lifecycle.
  • UI: Chat input bar, live voice overlay (glass chrome, haptics), subtle border on the message stack during live voice, session/tool wiring.

Related features and fixes

  • Web: web_image_search tool (headless browser + Brave API fallback), markdown image extraction in fetch; agent guidance to use web_image_search for photo requests.
  • OpenAI provider: Request usage metadata in streamed chat completions.
  • Agent/session: Merge usage stats (sum cache read/write tokens); session_status accepts injected __session_key; invalidate model capability providers after config edits.
  • Terminal / sandbox: Android sandbox drains process output before closing streams; streamed shell output trimmed only on final CLEAR payload; terminal replaces buffer from CLEAR JSON and shows streaming wait state.
  • Localization & tooling: Live voice overlay strings, locale sweep, credentials/security/auth-related strings, CI checks.

How to test

  • flutter pub get && flutter analyze && flutter test
  • Manual: start a chat, enable live voice, verify mic/playback, disconnect/reconnect, and that transcripts persist as expected.
  • Optional: exercise web_image_search and terminal/sandbox flows if your review scope includes those paths.

Notes

  • Project-wide analyzer infos/warnings may still appear; this branch should not introduce new analyzer errors.
  • Large diff (~10k additions) is expected given new service layer, UI, and tools.

frapeti added 7 commits March 29, 2026 23:13
- Serialize WAV queue, dedicated AudioPlayer with load config, larger preroll/segments
- Mic suppress during assistant PCM and local playback; failsafe unmute; playback-end detection (saw playing before idle/complete)
- Exclude agent CRUD tools from Live API setup to avoid session instability
- reloadHistory after stopSession from ChatScreen/LiveVoiceOverlay (fix Riverpod cycle)
- ChatNotifier: stream live transcripts to chat list; session messageStream dedupe during calls
- l10n: liveVoiceBargeInHint; regenerate localizations
- Plus existing branch changes (gateway, catalog, onboarding, settings, etc.)

Made-with: Cursor
- Use buildSystemPromptForAgent for all Live sessions; voiceBootstrap only
  triggers bootstrap sendText
- Serialize full getContextMessages with tail-priority character budget
- Pass userLanguage from chat when starting a call
- Add live_session_transcript helper and unit tests

Made-with: Cursor
@frapeti frapeti force-pushed the feat/gemini-live-voice branch from 38edf23 to 672ed00 Compare March 30, 2026 02:13
frapeti and others added 22 commits March 30, 2026 01:30
- app_providers: call sessionManager.getOrCreate() before starting
  LiveAgentLoop so addMessage() no longer silently drops all transcripts
  and tool calls (was the root cause of session loss and missing tool pills)

- live_voice_overlay: add _networkTurnComplete + _playlistLoadedIntoPlayer
  flags to prevent premature pipeline reset when the player exhausts its
  preroll segments before the next segment arrives (fixes first-turn cut)

- live_voice_overlay: add _resumeFromGap() — seeks to the newly queued
  segment and calls play() without re-calling setAudioSource (which would
  restart from the beginning), eliminating mid-turn audio gaps

- live_voice_overlay: add onError handler to agentEvents.listen() so
  stream errors don't silently kill the overlay

- live_voice_overlay: disable automaticallyWaitsToMinimizeStalling on iOS,
  reduce Android bufferForPlaybackDuration to 250ms for near-instant start

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
In voice sessions the model was silently calling web_browse/
request_user_action (CAPTCHA, login prompts) without first telling
the user what to do. Added an explicit rule to voiceNote so the
model always announces the required action aloud before opening
the browser.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eliminates WAV file I/O and ConcatenatingAudioSource segment boundaries
in favour of a single continuous StreamAudioSource (_LivePcmStreamSource)
that streams 24kHz PCM directly to just_audio.

Key changes:
- Remove dart:io / path_provider imports and all temp-file logic
- Drop _pcmBuffer, _livePlaylist, _kFlushBytes, _kStartThreshold,
  _flushSegment, _writeAndQueueWavInternal, _resumeFromGap, _buildWav
- Add _LivePcmStreamSource (StreamAudioSource) with streaming WAV header
  (0xFFFFFFFF sizes for unknown-length) and iOS range-request support
- _feedPcm buffers PCM in-memory and triggers _startStreamPlayback once
  _kPrerollBytes (1s) are available; _liveAudioGeneration prevents stale
  callbacks across back-to-back turns
- LiveTurnComplete seals the stream so the player reaches ProcessingState
  .completed naturally — no seek, no gap logic, no debounce races
- _armPlaybackEndListener simplified: no mid-turn gap/_networkTurnComplete
  needed; debounce 180ms on completion before re-enabling mic

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add AgentsDefaults.liveVoiceName (default 'Puck', persisted as
  live_voice_name in JSON) with full fromJson/toJson/copyWith support
- Pass liveVoiceName to LiveSessionConfig in startSession() so the
  chosen voice is used on every call
- Add voice dropdown to Settings → Providers & Models → Voice call
  section (30 Gemini Live voices with personality labels)
- Add SetLiveVoiceTool (set_live_voice) so the model can change the
  voice on the user's behalf mid-conversation; takes effect next call
- Register SetLiveVoiceTool in toolRegistryProvider
- Add liveVoiceNameLabel l10n key (en + all 18 generated locales)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add missing/updated translations across locales, regenerate generated localizations, and improve the l10n audit script to catch untranslated strings.

Made-with: Cursor
Replace temp-file based live audio segment handling with an in-memory StreamAudioSource pipeline, improving playback continuity and mic suppression recovery during live turns.

Made-with: Cursor
Register WebImageSearchTool; default session_status to active session key;
invalidate live capability when credentials reload; prefer CLEAR chunk text for tool pills.

Made-with: Cursor
Add CallSfxService hooks, frosted header, accent bar, longer preroll segments,
pending-flush handling, and refined speaking/tool states.

Made-with: Cursor
Set PYTHONUNBUFFERED for sandbox shells; wait for reader futures before
closing pipes, with a forced close path on timeout.

Made-with: Cursor
frapeti added 2 commits March 31, 2026 22:04
…ep, CI checks

- Add ARB keys for live voice HUD (status, badge, tooltip, fallback title) across locales.
- Localize credentials, security settings, Bedrock auth segments, QR scan, OAuth, browser overlay, and app init error.
- Add scripts/check_l10n_keys.py for ARB parity and report_l10n_untranslated.py for review.
- Add GitHub workflow to run key parity on l10n changes.
- Regenerate app_localizations.

Made-with: Cursor
@frapeti frapeti changed the title feat: Gemini Live voice (real-time) feat: Gemini Live real-time voice, web image search, and follow-up fixes Apr 1, 2026
@frapeti frapeti merged commit c7bc4ee into master Apr 1, 2026
3 checks passed
@frapeti frapeti deleted the feat/gemini-live-voice branch April 1, 2026 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant