Skip to content

Feat/realtime voice vad p2#19

Merged
ailuckly merged 6 commits into
developfrom
feat/realtime-voice-vad-p2
Apr 18, 2026
Merged

Feat/realtime voice vad p2#19
ailuckly merged 6 commits into
developfrom
feat/realtime-voice-vad-p2

Conversation

@ailuckly
Copy link
Copy Markdown
Owner

📌 变更内容

  • 如:新增用户登录 API
  • 如:修复登录失败时错误提示

✅ 测试验证

  • 本地运行通过
  • 自测通过
  • CI 流水线通过

PR 提交规范提醒:

  • 确保提交主题信息符合约定式提交规范 (feat/fix/docs/style/refactor/test/chore)
  • 确保代码已经通过本地测试
  • 确保没有提交敏感信息(密码、密钥等)

Root cause: after VAD silence, stopRecording() disconnected the
ScriptProcessorNode entirely, so onaudioprocess stopped firing during
AI speech. Barge-in detection never triggered because there was no
audio being captured to analyze.

Fix: separate "stop sending audio" from "stop hardware":
- Add monitoringOnly flag: mic stays open but frames are not sent
- pauseRecording(): sets monitoringOnly=true (called by VAD silence
  callback instead of full stopRecording)
- resumeRecording(): sets monitoringOnly=false + resets VAD state
  (called after TTS ends or barge-in detected)
- onaudioprocess: VAD + barge-in detection runs in BOTH recording
  and monitoring modes; audio is only sent when !monitoringOnly

Now the mic is always on during a call. State transitions:
  RECORDING → VAD silence → MONITORING (mic on, not sending)
  MONITORING + AI speaking + user speaks → barge-in → RECORDING
  MONITORING + TTS ends → RECORDING
  stopAudioCall() → hardware stop
Three fixes for smoother barge-in experience:

1. clearQueue() now immediately stops the currently-playing
   AudioBufferSourceNode via source.stop(), instead of just emptying
   the queue while the current chunk plays to completion. This
   eliminates the 200-500ms audio tail after barge-in.

2. Barge-in callback explicitly calls setAISpeaking(false) so the
   VAD state resets immediately, rather than waiting for the playback
   onended callback chain.

3. Added barge_in_ack message handler so the frontend logs server
   confirmation of successful barge-in (was silently dropped in the
   default case before).
After barge-in, resumeRecording() was called 3 times:
1. By the barge-in callback (correct)
2. By the playback state listener (300ms later, duplicate)
3. By handleProcessComplete (old pipeline's complete message)

Each call reset hasSpeechStarted=false, causing the user's ongoing
speech to be forgotten and VAD to immediately re-trigger silence.

Fix: add monitoringOnly guard to resumeRecording() - only runs when
actually in monitoring mode. Second/third calls are no-ops since
monitoringOnly is already false after the first resume.
…e timeout

audio_start was sent before getUserMedia and ScriptProcessorNode setup,
causing the server pipeline to wait for audio that wouldn't arrive for
~500ms. With vad_eos=1000ms, the server-side VAD timed out and the
pipeline completed before any audio was sent.

Fix: move wsClient.startAudioRecording() to AFTER audioManager.startRecording()
returns successfully, ensuring the mic is capturing and the ScriptProcessorNode
is connected before the server starts its STT pipeline.
…ce triggers

Root cause: ambient noise / mic self-noise exceeded SPEECH_THRESHOLD
(0.015), causing hasSpeechStarted=true before the user spoke. Then
silence was detected immediately → audio_end sent with no real speech.

Fixes:
- SPEECH_THRESHOLD: 0.015 → 0.02 (less sensitive to ambient noise)
- MIN_SPEECH_FRAMES: 2 → 5 (~640ms of genuine speech required)
- New VAD_GRACE_FRAMES=8: after recording starts or resumes, skip
  the first ~1s of VAD detection so the user has time to begin speaking
- Grace period also applied in resumeRecording() after TTS ends
…auses

768ms silence window was cutting users off mid-sentence during normal
word/phrase pauses (200-500ms). Increased SILENCE_FRAMES_REQUIRED
from 6 to 10 frames (~1.3s), which tolerates natural conversational
pauses while still auto-submitting reasonably quickly after the user
finishes speaking.
Copilot AI review requested due to automatic review settings April 18, 2026 16:00
@ailuckly ailuckly merged commit cfe3a00 into develop Apr 18, 2026
2 checks passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR iterates on realtime voice/VAD behavior to improve noise robustness and support “monitoring-only” mode for barge-in (interrupting TTS) without fully stopping the microphone pipeline.

Changes:

  • Tuned VAD thresholds/frames and added a VAD grace window after recording starts.
  • Introduced “monitoring-only” mode with pauseRecording() / resumeRecording() to stop sending audio while keeping the mic open for barge-in detection.
  • Improved immediate TTS stop by tracking and stopping the current AudioBufferSourceNode in clearQueue().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +858 to 863
// 持续模式:TTS 播完后恢复录音(麦克风一直开着,只需 resume)
if (!isPlaying && this.isAudioCallActive && this.isContinuousModeActive) {
setTimeout(() => {
if (this.isAudioCallActive && this.voiceState === 'idle') {
this.startRecording().catch(err => console.error('❌ 自动重启录音失败:', err))
if (this.isAudioCallActive) {
this.audioManager.resumeRecording() // 从监听模式回到发送模式
}
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In continuous mode you now call resumeRecording() after TTS ends, but the VAD silence handler sends audio_end to the server. Resuming local sending without also sending a new audio_start will likely leave the server with no active audio session for subsequent turns. Consider sending wsClient.startAudioRecording() immediately before/when resuming from monitoring mode (or making resumeRecording() coordinate this via a callback).

Copilot uses AI. Check for mistakes.
setTimeout(() => {
if (this.isAudioCallActive && this.voiceState === 'idle') {
this.startRecording().catch(err => console.error('❌ complete后自动重启失败:', err))
if (this.isAudioCallActive) {
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as the playback handler: handleProcessComplete() resumes local sending via audioManager.resumeRecording(), but after an audio_end the server likely requires a new audio_start before accepting binary audio again. Send startAudioRecording() before resuming (or centralize this logic in one place).

Suggested change
if (this.isAudioCallActive) {
if (this.isAudioCallActive) {
this.startAudioRecording()

Copilot uses AI. Check for mistakes.
const SPEECH_THRESHOLD = 0.02 // RMS 超过此值 → 识别为说话(提高以减少噪音误触)
const SILENCE_THRESHOLD = 0.01 // RMS 低于此值 → 识别为静音
const MIN_SPEECH_FRAMES = 5 // 至少 5 帧真实语音才允许 VAD 触发(~640ms,防环境噪音误触)
const SILENCE_FRAMES_REQUIRED = 10 // 10 × 128ms ≈ 1.3s 静音后自动停止(容忍句间停顿)
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on SILENCE_FRAMES_REQUIRED says “静音后自动停止”, but the implementation now switches into monitoring-only mode and relies on callbacks to send audio_end / pause sending. Please update the constant comment to match the new behavior to avoid confusion when tuning VAD parameters.

Suggested change
const SILENCE_FRAMES_REQUIRED = 10 // 10 × 128ms ≈ 1.3s 静音后自动停止(容忍句间停顿)
const SILENCE_FRAMES_REQUIRED = 10 // 10 × 128ms ≈ 1.3s 连续静音后切换为监测/静音处理,具体 audio_end/暂停发送由回调控制(容忍句间停顿)

Copilot uses AI. Check for mistakes.
this.onAudioPlayCallback?.(false)

// 注册 VAD 静音回调:静音 ~0.8s 后自动提交
// 注册 VAD 静音回调:静音 ~0.8s 后暂停发送(麦克风保持开启)
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment says “静音 ~0.8s 后…”, but SILENCE_FRAMES_REQUIRED is now 10 frames at 128ms/frame (~1.3s). Update the comment to reflect the current timing so behavior matches expectations during QA/tuning.

Suggested change
// 注册 VAD 静音回调:静音 ~0.8s 后暂停发送(麦克风保持开启)
// 注册 VAD 静音回调:静音 ~1.3s 后暂停发送(麦克风保持开启)

Copilot uses AI. Check for mistakes.
Comment on lines 716 to +717
source.start()
this.currentSource = source
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

playAudio() checks if (!this.audioContext) await this.initialize(), but initialize() no longer creates an AudioContext, so direct calls to playAudio() can still hit this.audioContext!.decodeAudioData(...) with a null context. Call ensureAudioContext() here (or otherwise guarantee audioContext exists) before starting playback.

Copilot uses AI. Check for mistakes.
Comment on lines +526 to +529
// 冷却期:录音刚启动时跳过 VAD 检测(防止环境噪音误触发)
if (this.vadGraceRemaining > 0) {
this.vadGraceRemaining--
} else {
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vadGraceRemaining is never initialized on a normal startRecording() path (it’s only set in resetRecordingState() / resumeRecording()), so the first recording session (and any session after a clean stopRecording()) will skip the intended grace period. Set this.vadGraceRemaining = VAD_GRACE_FRAMES when starting a new recording session (before audio processing begins).

Copilot uses AI. Check for mistakes.
Comment on lines +558 to +560
// 监听模式:麦克风开着但不发送音频
if (this.monitoringOnly) return

Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

monitoringOnly can remain true across sessions because stopRecording() doesn’t reset VAD/monitoring flags (unlike resetRecordingState()). If a user ends a call while in listening mode, the next startRecording() will immediately return at if (this.monitoringOnly) return, sending no audio. Ensure monitoringOnly (and related VAD counters) are reset when stopping a recording session and/or at the beginning of startRecording().

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants