Feat/realtime voice vad p2#19
Conversation
Root cause: after VAD silence, stopRecording() disconnected the ScriptProcessorNode entirely, so onaudioprocess stopped firing during AI speech. Barge-in detection never triggered because there was no audio being captured to analyze. Fix: separate "stop sending audio" from "stop hardware": - Add monitoringOnly flag: mic stays open but frames are not sent - pauseRecording(): sets monitoringOnly=true (called by VAD silence callback instead of full stopRecording) - resumeRecording(): sets monitoringOnly=false + resets VAD state (called after TTS ends or barge-in detected) - onaudioprocess: VAD + barge-in detection runs in BOTH recording and monitoring modes; audio is only sent when !monitoringOnly Now the mic is always on during a call. State transitions: RECORDING → VAD silence → MONITORING (mic on, not sending) MONITORING + AI speaking + user speaks → barge-in → RECORDING MONITORING + TTS ends → RECORDING stopAudioCall() → hardware stop
Three fixes for smoother barge-in experience: 1. clearQueue() now immediately stops the currently-playing AudioBufferSourceNode via source.stop(), instead of just emptying the queue while the current chunk plays to completion. This eliminates the 200-500ms audio tail after barge-in. 2. Barge-in callback explicitly calls setAISpeaking(false) so the VAD state resets immediately, rather than waiting for the playback onended callback chain. 3. Added barge_in_ack message handler so the frontend logs server confirmation of successful barge-in (was silently dropped in the default case before).
After barge-in, resumeRecording() was called 3 times: 1. By the barge-in callback (correct) 2. By the playback state listener (300ms later, duplicate) 3. By handleProcessComplete (old pipeline's complete message) Each call reset hasSpeechStarted=false, causing the user's ongoing speech to be forgotten and VAD to immediately re-trigger silence. Fix: add monitoringOnly guard to resumeRecording() - only runs when actually in monitoring mode. Second/third calls are no-ops since monitoringOnly is already false after the first resume.
…e timeout audio_start was sent before getUserMedia and ScriptProcessorNode setup, causing the server pipeline to wait for audio that wouldn't arrive for ~500ms. With vad_eos=1000ms, the server-side VAD timed out and the pipeline completed before any audio was sent. Fix: move wsClient.startAudioRecording() to AFTER audioManager.startRecording() returns successfully, ensuring the mic is capturing and the ScriptProcessorNode is connected before the server starts its STT pipeline.
…ce triggers Root cause: ambient noise / mic self-noise exceeded SPEECH_THRESHOLD (0.015), causing hasSpeechStarted=true before the user spoke. Then silence was detected immediately → audio_end sent with no real speech. Fixes: - SPEECH_THRESHOLD: 0.015 → 0.02 (less sensitive to ambient noise) - MIN_SPEECH_FRAMES: 2 → 5 (~640ms of genuine speech required) - New VAD_GRACE_FRAMES=8: after recording starts or resumes, skip the first ~1s of VAD detection so the user has time to begin speaking - Grace period also applied in resumeRecording() after TTS ends
…auses 768ms silence window was cutting users off mid-sentence during normal word/phrase pauses (200-500ms). Increased SILENCE_FRAMES_REQUIRED from 6 to 10 frames (~1.3s), which tolerates natural conversational pauses while still auto-submitting reasonably quickly after the user finishes speaking.
There was a problem hiding this comment.
Pull request overview
This PR iterates on realtime voice/VAD behavior to improve noise robustness and support “monitoring-only” mode for barge-in (interrupting TTS) without fully stopping the microphone pipeline.
Changes:
- Tuned VAD thresholds/frames and added a VAD grace window after recording starts.
- Introduced “monitoring-only” mode with
pauseRecording()/resumeRecording()to stop sending audio while keeping the mic open for barge-in detection. - Improved immediate TTS stop by tracking and stopping the current
AudioBufferSourceNodeinclearQueue().
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // 持续模式:TTS 播完后恢复录音(麦克风一直开着,只需 resume) | ||
| if (!isPlaying && this.isAudioCallActive && this.isContinuousModeActive) { | ||
| setTimeout(() => { | ||
| if (this.isAudioCallActive && this.voiceState === 'idle') { | ||
| this.startRecording().catch(err => console.error('❌ 自动重启录音失败:', err)) | ||
| if (this.isAudioCallActive) { | ||
| this.audioManager.resumeRecording() // 从监听模式回到发送模式 | ||
| } |
There was a problem hiding this comment.
In continuous mode you now call resumeRecording() after TTS ends, but the VAD silence handler sends audio_end to the server. Resuming local sending without also sending a new audio_start will likely leave the server with no active audio session for subsequent turns. Consider sending wsClient.startAudioRecording() immediately before/when resuming from monitoring mode (or making resumeRecording() coordinate this via a callback).
| setTimeout(() => { | ||
| if (this.isAudioCallActive && this.voiceState === 'idle') { | ||
| this.startRecording().catch(err => console.error('❌ complete后自动重启失败:', err)) | ||
| if (this.isAudioCallActive) { |
There was a problem hiding this comment.
Same issue as the playback handler: handleProcessComplete() resumes local sending via audioManager.resumeRecording(), but after an audio_end the server likely requires a new audio_start before accepting binary audio again. Send startAudioRecording() before resuming (or centralize this logic in one place).
| if (this.isAudioCallActive) { | |
| if (this.isAudioCallActive) { | |
| this.startAudioRecording() |
| const SPEECH_THRESHOLD = 0.02 // RMS 超过此值 → 识别为说话(提高以减少噪音误触) | ||
| const SILENCE_THRESHOLD = 0.01 // RMS 低于此值 → 识别为静音 | ||
| const MIN_SPEECH_FRAMES = 5 // 至少 5 帧真实语音才允许 VAD 触发(~640ms,防环境噪音误触) | ||
| const SILENCE_FRAMES_REQUIRED = 10 // 10 × 128ms ≈ 1.3s 静音后自动停止(容忍句间停顿) |
There was a problem hiding this comment.
The comment on SILENCE_FRAMES_REQUIRED says “静音后自动停止”, but the implementation now switches into monitoring-only mode and relies on callbacks to send audio_end / pause sending. Please update the constant comment to match the new behavior to avoid confusion when tuning VAD parameters.
| const SILENCE_FRAMES_REQUIRED = 10 // 10 × 128ms ≈ 1.3s 静音后自动停止(容忍句间停顿) | |
| const SILENCE_FRAMES_REQUIRED = 10 // 10 × 128ms ≈ 1.3s 连续静音后切换为监测/静音处理,具体 audio_end/暂停发送由回调控制(容忍句间停顿) |
| this.onAudioPlayCallback?.(false) | ||
|
|
||
| // 注册 VAD 静音回调:静音 ~0.8s 后自动提交 | ||
| // 注册 VAD 静音回调:静音 ~0.8s 后暂停发送(麦克风保持开启) |
There was a problem hiding this comment.
This comment says “静音 ~0.8s 后…”, but SILENCE_FRAMES_REQUIRED is now 10 frames at 128ms/frame (~1.3s). Update the comment to reflect the current timing so behavior matches expectations during QA/tuning.
| // 注册 VAD 静音回调:静音 ~0.8s 后暂停发送(麦克风保持开启) | |
| // 注册 VAD 静音回调:静音 ~1.3s 后暂停发送(麦克风保持开启) |
| source.start() | ||
| this.currentSource = source |
There was a problem hiding this comment.
playAudio() checks if (!this.audioContext) await this.initialize(), but initialize() no longer creates an AudioContext, so direct calls to playAudio() can still hit this.audioContext!.decodeAudioData(...) with a null context. Call ensureAudioContext() here (or otherwise guarantee audioContext exists) before starting playback.
| // 冷却期:录音刚启动时跳过 VAD 检测(防止环境噪音误触发) | ||
| if (this.vadGraceRemaining > 0) { | ||
| this.vadGraceRemaining-- | ||
| } else { |
There was a problem hiding this comment.
vadGraceRemaining is never initialized on a normal startRecording() path (it’s only set in resetRecordingState() / resumeRecording()), so the first recording session (and any session after a clean stopRecording()) will skip the intended grace period. Set this.vadGraceRemaining = VAD_GRACE_FRAMES when starting a new recording session (before audio processing begins).
| // 监听模式:麦克风开着但不发送音频 | ||
| if (this.monitoringOnly) return | ||
|
|
There was a problem hiding this comment.
monitoringOnly can remain true across sessions because stopRecording() doesn’t reset VAD/monitoring flags (unlike resetRecordingState()). If a user ends a call while in listening mode, the next startRecording() will immediately return at if (this.monitoringOnly) return, sending no audio. Ensure monitoringOnly (and related VAD counters) are reset when stopping a recording session and/or at the beginning of startRecording().
📌 变更内容
✅ 测试验证
PR 提交规范提醒: