feat: voice-typing using external Whisper / ALM API#9264
feat: voice-typing using external Whisper / ALM API#9264heimoshuiyu wants to merge 7 commits intoanomalyco:devfrom
Conversation
|
Thanks for your contribution! This PR doesn't have a linked issue. All PRs must reference an existing issue. Please:
See CONTRIBUTING.md for details. |
|
The following comment was made by an LLM, it may be inaccurate: Potential Duplicate FoundPR #3827: Add voice-to-text transcription feature Why it's related: This PR is already referenced in the current PR's description as a related issue. It appears to be a previous attempt or related work on voice-to-text transcription functionality. The current PR (9264) may be an updated implementation or continuation of this feature. |
|
See also https://github.com/goodroot/hyprwhspr, which I've started using for system wide STT, not just in opencode |
|
@calebdw Great project. However, integrating voice input in OpenCode is necessary because it can capture context to significantly improve transcription accuracy. |
596c1f5 to
a3b6a2c
Compare
|
rebased, ready to merge |
a3b6a2c to
7200cbc
Compare
|
I cannot wait to see it working |
|
This works great. Set it up with a local speaches server and it works perfect. Using model Systran/faster-distil-whisper-large-v3. |
|
@Mikec78660 Glad you got it working! Could you share what issues you encountered during configuration? We should show a toast message when configuration errors occur. |
|
@heimoshuiyu I had https instead of http in my url for my stt server like an idiot. I just coped and pasted my server name into the url field and didn't even notice. But it was weird because the Record "button" was just ghosted out and would not let me press it like I had not setup voice at all. I was expecting it would give me some sort of error. Like let me hit the button and then get an error that the server was unreachable or something. But it was completely my mistake that I even had an issue. Here is my entry in opencode.json |
move whisper config into config document whisper voice config remove tui voice enabled Fix voice error handling and whisper context
When voice transcription completes, addPart now checks if the current selection is within the prompt editor. If the selection is outside the editor (e.g., user clicked on an assistant message during recording), it focuses the editor and restores the cursor position from prompt.cursor() before inserting the transcribed text. This prevents transcription results from being inserted into unintended locations like assistant messages. Also fixes cursor position logic to prefer real DOM position when selection is inside the editor, only falling back to prompt.cursor() when selection is outside.
7200cbc to
e271bb1
Compare
mohamedbouddi7777-dev
left a comment
There was a problem hiding this comment.
اوكي مرحبا بسم الله الرحمن الرحيم توكلنا على الله انشر في المكان يعني هو هنا يعني لا تنشر هذه الامور في جيميل
|
@Mikec78660 Thank you for sharing. I just fixed an issue where clicking the gray RECORD button in the TUI was unresponsive. |
- Your work (PR anomalyco#9264, agent sidebar) preserved - Conflicts in prompt-input.tsx, server.ts, icon.tsx resolved in favor of our changes
|
@heimoshuiyu Have you considered adding wake word capability? |
|
@Mikec78660 That might be another PR. Whisper and ALM do not have wake word functionality. To support wake word, it may be necessary to use other smaller models to continuously monitor the microphone. |
00637c0 to
71e0ba2
Compare
f1ae801 to
08fa7f7
Compare
This pull request implements the voice typing feature. It uses external Whisper API or multimodal large models (such as gpt-4o, qwen3-omni) for voice input. It uses the last message as a prompt (context) to improve contextual recognition accuracy (key!).
Related issues and PR:
This feature follows the frontend-backend separation design of opencode.
Two types of speech recognition services can be configured, with whisper as the default:
It uses the last assistant message in the current session + the text in the input box as a prompt for speech transcription. This contextual understanding ability allows you to directly voice input special terms like code paths, variable names, etc.!
For testing convenience, I have set up a web frontend with voice input functionality at https://d3ir6x3lfy3u68.cloudfront.net. And replaced the hardcoded https://app.opencode.ai in opencode (in the third commit Add web deploy skill and configurable web proxy)
Disclaimer: Most of the code was vibe coded and then roughly checked by me. So this might be a relatively rough implementation (or even just a POC). Welcome to provide various improvement suggestions, or take this idea and implement it yourself.
External resources:
Qwen/Qwen3-Omni-30B-A3B-Instructmodel on https://cloud.siliconflow.cn / https://cloud.siliconflow.comHere is the demo
tui.mp4
app.mp4