Skip to content

Implement Text-to-Speech (TTS) UI and Enhanced Audio Playback

Choose a tag to compare

@damianvtran damianvtran released this 09 Jun 03:57
· 194 commits to main since this release
9559fa8

What's Changed

This release introduces comprehensive Text-to-Speech (TTS) capabilities to the Local Operator UI, allowing users to generate speech from agent messages and selected text. It includes new UI components for audio playback, integrates with the backend's new speech API, and enhances user interaction with new keyboard shortcuts and controls.

  • What does this change address? This is a new feature implementation to enable audio output for agents and improve language interaction within the UI.
  • What are the key improvements or modifications?
    • TTS Playback Controls: Added play/pause, stop, and replay functionality for agent messages, with caching of audio blobs.
    • Text Selection to Speech: Implemented a new control that appears on text selection, allowing users to generate and play speech for selected text.
    • New AudioAttachment Component: A dedicated React component for playing audio files, including progress, volume, and playback rate controls.
    • Keyboard Shortcuts: Introduced Cmd/Ctrl + Shift + S to start speech-to-text recording and spacebar hold to record.
    • Cross-Origin Policy Update: Modified index.html to allow blob: for media-src in Content Security Policy.
    • Platform Detection: Added platform detection to display correct keyboard shortcuts (Cmd for macOS, Ctrl for others).
    • Integration with Backend API: Connected the UI to the new /v1/tools/speech and /v1/agents/{agent_id}/speech endpoints.
    • Zustand Speech Store: Created a new Zustand store (useSpeechStore) to manage speech playback state, audio caching, and loading.

Impact

  • Does this change introduce any breaking changes? No breaking changes; it's additive and backward-compatible.
  • Are there any dependency updates? No new external dependencies.
  • Are there any performance or security implications? Performance is dependent on the backend API and network; security considerations include proper handling of audio data and API key authentication for speech generation.

PRs

  • feat: Implement Text-to-Speech (TTS) UI and Enhanced Audio Playback by @damianvtran in #67

Full Changelog: v0.11.0...v0.11.1