Skip to content

Conversation

@ammar-agent
Copy link
Collaborator

Adds voice dictation capability to the chat input using OpenAI's gpt-4o-transcribe model.

Features

  • Voice recording via MediaRecorder API (webm/opus format)
  • OpenAI transcription via backend IPC (API key stays server-side)
  • Recording overlay replaces textarea with animated waveform visualization
  • Multiple shortcuts:
    • Space on empty input → start recording
    • Space during recording → stop and send immediately
    • Ctrl+D / Cmd+D → toggle recording anytime
    • Escape → cancel recording (discard audio)
  • Global keybinds during recording work regardless of focus
  • User education when OpenAI key not configured (disabled button with tooltip)

UI States

State Appearance
Idle Subtle gray mic icon in textarea corner
Recording Blue border overlay with animated waveform
Transcribing Amber border overlay, waiting for API
No API Key Disabled mic with explanatory tooltip

Implementation

  • useVoiceInput hook with clean state enum (idle / recording / transcribing)
  • VoiceInputButton floating component
  • WaveformBars reusable animated component
  • IPC channel voice:transcribe for backend API calls
  • Hidden on mobile (native keyboards have built-in dictation)

Files Changed

  • src/browser/hooks/useVoiceInput.ts - Core hook
  • src/browser/components/ChatInput/VoiceInputButton.tsx - Button component
  • src/browser/components/ChatInput/WaveformBars.tsx - Animation component
  • src/browser/components/ChatInput/index.tsx - Integration
  • src/node/services/ipcMain.ts - Backend transcription handler
  • src/common/constants/ipc-constants.ts - IPC channel
  • src/common/types/ipc.ts - Type definitions

Generated with mux

- Add Ctrl+D / Cmd+D keybind to toggle voice recording
- Add mic button next to send button (hidden on mobile or when no OpenAI key)
- Add command palette command for voice input toggle
- Record audio via MediaRecorder, transcribe via Whisper API
- Show recording indicator while capturing, spinner while transcribing
- Append dictated text to existing input
- Handle errors with user-friendly toast messages

Requires OpenAI API key to be configured in Settings > Providers.

_Generated with mux_
- Show overlay during both recording AND transcribing states
  (prevents jarring snap-back to empty textarea when waiting for API)
- Change colors from red (error-like) to blue (recording) and amber (transcribing)
- Disable overlay button while transcribing to prevent double-clicks
- Space key during recording: stops and sends immediately
- Ctrl+D/Cmd+D: stops recording, keeps text in input (existing)
- Reduced border from border-2 to border (less crowded near controls)
- Updated overlay text to show both shortcuts
- Auto-focus the recording overlay button so spacebar works
- Add mb-1 margin to prevent border touching controls below
- Add onSend callback to useVoiceInput options
- Add stopListeningAndSend method that sets a flag before stopping
- When transcription completes, if flag was set, call onSend
- Use setTimeout(0) to let React flush state update before sending
- Simplifies ChatInput code by moving logic into the hook
- Reduce gap between control rows from gap-1 to gap-0.5
- Reduce vertical wrap gap from gap-y-2 to gap-y-1
- Reduce send button padding from px-2 py-1 to px-1.5 py-0.5
- Change ToggleGroup padding from px-2 py-1 to px-1.5 py-0.5
- Keeps mode selector and send button visually consistent
User education:
- Show mic button even without OpenAI key (disabled with tooltip)
- Tooltip explains: 'Configure in Settings → Providers'
- Toast error when trying to use keybind/command without key

DRY improvements:
- Extract WaveformBars component for reusable animated bars
- Remove unused Web Speech API error message mappings

Code quality:
- Add isApiKeySet to hook result for explicit checking
- shouldShowUI now only checks platform support, not API key
- Verified no race conditions in hook logic
- Replace isListening/isTranscribing booleans with single state enum
- Merge stopListening/stopListeningAndSend into stop(options?)
- Rename methods: startListening→start, toggleListening→toggle
- Consolidate callback refs into single callbacksRef object
- Move platform checks (isMobile, isSupported) to module scope
- Simplify VoiceInputButton with STATE_CONFIG lookup table
- Inline simple callbacks in ChatInput (no separate handlers)
- Pressing space on empty chat input starts voice recording
  (convenient alternative to Cmd+D)
- Pressing escape during recording cancels without transcribing
- Add cancel() method to voice hook that sets flag to skip transcribe
- Updated overlay text to show all shortcuts
- Add focus:outline-none to recording overlay button
- Update tooltip to document all shortcuts:
  - Space on empty input to start
  - Cmd+D anytime to toggle
  - Space during recording to send
  - Escape to cancel
- Add window-level keydown listener active only during recording
- Space and Escape work even if overlay button loses focus
- Removed redundant local onKeyDown and auto-focus from button
- Add providersConfig option to setupSimpleChatStory helper
- Add VoiceInputNoApiKey story showing disabled mic with tooltip
- Documents user education for missing OpenAI key
Ensures start() is a no-op on mobile even if somehow called
directly, complementing the UI-layer shouldShowUI guards.
Merge command palette toggle and global recording keybinds into
single useEffect with shared cleanup.
- Rename isMobile to HAS_TOUCH_DICTATION with clear doc comment
- Remove screen size check (iPads have dictation regardless of size)
- Add section headers for visual organization
- Extract releaseStream helper to reduce duplication
- Improve variable names (recorder, chunks, buffer)
- Add early returns to reduce nesting in transcribe()
- Rename refs for clarity (shouldSendRef, wasCancelledRef)
- Better comments explaining the state machine and logic
@ammario ammario linked an issue Dec 2, 2025 that may be closed by this pull request
@ammario ammario merged commit 2820a98 into main Dec 2, 2025
13 checks passed
@ammario ammario deleted the voice-input-mode branch December 2, 2025 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Voice to text input

2 participants