Skip to content

feat: voice-typing using external Whisper / ALM API#9264

Open
heimoshuiyu wants to merge 7 commits intoanomalyco:devfrom
heimoshuiyu:voice-typing
Open

feat: voice-typing using external Whisper / ALM API#9264
heimoshuiyu wants to merge 7 commits intoanomalyco:devfrom
heimoshuiyu:voice-typing

Conversation

@heimoshuiyu
Copy link

@heimoshuiyu heimoshuiyu commented Jan 18, 2026

This pull request implements the voice typing feature. It uses external Whisper API or multimodal large models (such as gpt-4o, qwen3-omni) for voice input. It uses the last message as a prompt (context) to improve contextual recognition accuracy (key!).

Related issues and PR:

This feature follows the frontend-backend separation design of opencode.

  • TUI: Uses tools like ffmpeg, sox, rec, arecord to record the microphone (I only tested ffmpeg on Linux, as I don't have other devices)
  • APP: Uses the browser's native microphone recording interface

Two types of speech recognition services can be configured, with whisper as the default:

  • whisper: Compatible with the OpenAI /v1/audio/transcriptions interface
  • alm: Uses the speech input capability of multimodal large models for transcription

It uses the last assistant message in the current session + the text in the input box as a prompt for speech transcription. This contextual understanding ability allows you to directly voice input special terms like code paths, variable names, etc.!

For testing convenience, I have set up a web frontend with voice input functionality at https://d3ir6x3lfy3u68.cloudfront.net. And replaced the hardcoded https://app.opencode.ai in opencode (in the third commit Add web deploy skill and configurable web proxy)

Disclaimer: Most of the code was vibe coded and then roughly checked by me. So this might be a relatively rough implementation (or even just a POC). Welcome to provide various improvement suggestions, or take this idea and implement it yourself.

External resources:

Here is the demo

tui.mp4
app.mp4

@github-actions
Copy link
Contributor

Thanks for your contribution!

This PR doesn't have a linked issue. All PRs must reference an existing issue.

Please:

  1. Open an issue describing the bug/feature (if one doesn't exist)
  2. Add Fixes #<number> or Closes #<number> to this PR description

See CONTRIBUTING.md for details.

@github-actions
Copy link
Contributor

The following comment was made by an LLM, it may be inaccurate:

Potential Duplicate Found

PR #3827: Add voice-to-text transcription feature

Why it's related: This PR is already referenced in the current PR's description as a related issue. It appears to be a previous attempt or related work on voice-to-text transcription functionality. The current PR (9264) may be an updated implementation or continuation of this feature.

@calebdw
Copy link

calebdw commented Jan 18, 2026

See also https://github.com/goodroot/hyprwhspr, which I've started using for system wide STT, not just in opencode

@heimoshuiyu
Copy link
Author

@calebdw Great project. However, integrating voice input in OpenCode is necessary because it can capture context to significantly improve transcription accuracy.

@heimoshuiyu
Copy link
Author

rebased, ready to merge

@telnet2
Copy link

telnet2 commented Jan 22, 2026

I cannot wait to see it working

@Mikec78660
Copy link

This works great. Set it up with a local speaches server and it works perfect. Using model Systran/faster-distil-whisper-large-v3.

@heimoshuiyu
Copy link
Author

@Mikec78660 Glad you got it working! Could you share what issues you encountered during configuration? We should show a toast message when configuration errors occur.

@Mikec78660
Copy link

@heimoshuiyu I had https instead of http in my url for my stt server like an idiot. I just coped and pasted my server name into the url field and didn't even notice. But it was weird because the Record "button" was just ghosted out and would not let me press it like I had not setup voice at all. I was expecting it would give me some sort of error. Like let me hit the button and then get an error that the server was unreachable or something. But it was completely my mistake that I even had an issue.

Here is my entry in opencode.json

  "voice": {
    "type": "whisper",
    "whisper": {
      "url": "http://speaches.lan:8000/v1/audio/transcriptions",
      "apiKey": "1234",
      "model": "Systran/faster-distil-whisper-large-v3",
      "language": "en"
    }
  }

Mikec78660 pushed a commit to Mikec78660/opencode that referenced this pull request Jan 22, 2026
move whisper config into config

document whisper voice config

remove tui voice enabled

Fix voice error handling and whisper context
When voice transcription completes, addPart now checks if the current
selection is within the prompt editor. If the selection is outside the
editor (e.g., user clicked on an assistant message during recording),
it focuses the editor and restores the cursor position from prompt.cursor()
before inserting the transcribed text. This prevents transcription results
from being inserted into unintended locations like assistant messages.

Also fixes cursor position logic to prefer real DOM position when
selection is inside the editor, only falling back to prompt.cursor()
when selection is outside.
Copy link

@mohamedbouddi7777-dev mohamedbouddi7777-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

اوكي مرحبا بسم الله الرحمن الرحيم توكلنا على الله انشر في المكان يعني هو هنا يعني لا تنشر هذه الامور في جيميل

@heimoshuiyu
Copy link
Author

@Mikec78660 Thank you for sharing. I just fixed an issue where clicking the gray RECORD button in the TUI was unresponsive.

Mikec78660 pushed a commit to Mikec78660/opencode that referenced this pull request Jan 22, 2026
- Your work (PR anomalyco#9264, agent sidebar) preserved
- Conflicts in prompt-input.tsx, server.ts, icon.tsx resolved in favor of our changes
@Mikec78660
Copy link

@heimoshuiyu Have you considered adding wake word capability?

@heimoshuiyu
Copy link
Author

@Mikec78660 That might be another PR. Whisper and ALM do not have wake word functionality. To support wake word, it may be necessary to use other smaller models to continuously monitor the microphone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Speech-to-Text Voice Input for Lazy People in OpenCode feat: first party support for voice conversing

5 participants