Description
When a user triggers dictation but changes their mind and stops the recording instead of canceling it, the resulting all-silence audio is sent to the transcription model. Whisper-based models hallucinate on pure silence, producing looping text (e.g. "he was a student" repeated endlessly) that gets typed into the active application.
This is a natural usage pattern. Hitting stop instead of cancel is an easy instinct, and it shouldn't produce phantom text.
Testing
| Scenario |
Result |
| Pure silence, no speech |
Hallucinated looping text |
| Leading silence → speech |
Clean |
| Speech → trailing silence |
Clean |
| Speech → long mid-speech pause → speech |
Clean |
The issue only occurs when the recording contains no speech at all.
Proposed Solution
Add a speech detection check: if no speech is detected in the entire audio buffer, discard it and output nothing. This could be a basic energy/RMS threshold or a lightweight VAD. It doesn't need to be sophisticated since it only needs to distinguish "any speech" from "total silence."
Current Behavior
All-silence recordings are sent to the model, which hallucinates text.
Expected Behavior
All-silence recordings produce no output.
Description
When a user triggers dictation but changes their mind and stops the recording instead of canceling it, the resulting all-silence audio is sent to the transcription model. Whisper-based models hallucinate on pure silence, producing looping text (e.g. "he was a student" repeated endlessly) that gets typed into the active application.
This is a natural usage pattern. Hitting stop instead of cancel is an easy instinct, and it shouldn't produce phantom text.
Testing
The issue only occurs when the recording contains no speech at all.
Proposed Solution
Add a speech detection check: if no speech is detected in the entire audio buffer, discard it and output nothing. This could be a basic energy/RMS threshold or a lightweight VAD. It doesn't need to be sophisticated since it only needs to distinguish "any speech" from "total silence."
Current Behavior
All-silence recordings are sent to the model, which hallucinates text.
Expected Behavior
All-silence recordings produce no output.