Skip to content

fix: cancel TTS on remote human speech in multi-participant huddles#332

Merged
tlongwell-block merged 3 commits intomainfrom
fix/remote-human-tts-interrupt
Apr 16, 2026
Merged

fix: cancel TTS on remote human speech in multi-participant huddles#332
tlongwell-block merged 3 commits intomainfrom
fix/remote-human-tts-interrupt

Conversation

@tlongwell-block
Copy link
Copy Markdown
Collaborator

Problem

In multi-participant huddles, when a remote human speaks while the local agent's TTS is playing, the TTS continues uninterrupted. The existing barge-in mechanism only handles local speech (via STT), so remote human speech from the audio relay is ignored. This creates an awkward experience where the agent talks over humans.

Approach

Client-side per-peer frame counting in the relay recv task. Each remote peer gets an independent frame counter that increments on successfully-decoded Opus audio frames. Counting is gated on tts_active — frames only accumulate while TTS is playing. When any peer's counter crosses REMOTE_SPEECH_THRESHOLD (15 frames in a 500ms window), the shared tts_cancel atomic flag is set, which the TTS worker picks up and cancels playback.

Key design decisions:

  • Per-peer isolation: DTX comfort noise (~2-3 frames/window) from silent peers doesn't accumulate with real speech from active peers
  • Gated accumulation: Counters only increment while tts_active is true, preventing false triggers
  • Session reset: On TTS session start (false→true transition), all counters clear — prevents stale pre-playback speech from tripping a cancel
  • Instant-based window: 500ms reset window uses std::time::Instant (starvation-proof, no async dependency)
  • Saturating arithmetic: saturating_add prevents overflow on the u16 counters

What Changed

  • relay_api.rs (production): Per-peer HashMap<u8, u16> frame counting in the recv task, REMOTE_SPEECH_THRESHOLD constant promoted to pub(crate) module level, selective Joined handler, Instant-based window reset
  • tts.rs (tests): 18 new tests covering threshold behavior, per-peer isolation, DTX filtering, TTS session transitions, cancel consumption lifecycle, concurrent local+remote cancel safety, and apply_fades edge cases
  • check-file-sizes.mjs (config): File size overrides for relay_api.rs (502 lines, +2 over limit) and tts.rs (1005 lines with test suite)

Reviews

Three independent approvals:

  • Hana: Architecture review — approved the per-peer frame counting approach
  • Clove: Code quality review — 9/10
  • Lep: Security review — 9/10

Prior art research (Levin) confirmed no existing multi-participant barge-in implementations in the wild.

Thread tts_cancel and tts_active flags into the audio relay pipeline by
cloning them from HuddleState inside connect_audio_relay. Use per-peer
frame counting to distinguish real speech (~25 frames/500ms) from DTX
comfort noise (~1 frame/500ms). When any remote peer crosses
REMOTE_SPEECH_THRESHOLD (5 decoded audio frames) while TTS is active,
immediately set tts_cancel.

Key design choices:
- REMOTE_SPEECH_THRESHOLD promoted to pub(crate) module level so tests
  can import it directly instead of duplicating the value.
- Frame counting is gated on tts_active and happens after successful Opus
  decode (Ok(n) if n > 0). Corrupt frames and DTX silence are excluded.
- TTS state transitions tracked at binary frame level (tts_was_active)
  so session boundaries are detected even during DTX silence. Counters
  reset on new TTS session (false→true) and on Instant-based window
  expiry (starvation-proof). Uses saturating_add.
- joined selectively resets decoder/counter/active_indices state only for
  new or reassigned peer indices. Existing peers keep their state.
- left cleans up index_to_pubkey, frame_counts, and decoders.
- Acquire/Release ordering matches the existing tts.rs/stt.rs convention.

Every remote peer on the audio WebSocket is a human (agents never connect
to the audio relay), so no peer filtering is needed.
Comprehensive test suite for the per-peer frame counting and cancel
mechanism. Covers threshold behavior, per-peer isolation, DTX comfort
noise filtering, TTS session transitions, cancel consumption lifecycle,
concurrent local+remote cancel safety, and apply_fades edge cases.
relay_api.rs is 2 lines over (502 vs 500) from the per-peer frame
counting logic. tts.rs grew to ~1005 lines with the comprehensive
remote interrupt test suite (18 tests, 564 lines).
@tlongwell-block tlongwell-block merged commit d99332e into main Apr 16, 2026
10 checks passed
@tlongwell-block tlongwell-block deleted the fix/remote-human-tts-interrupt branch April 16, 2026 01:23
tlongwell-block added a commit that referenced this pull request Apr 16, 2026
* origin/main:
  [codex] Fix authz, scope propagation, and shell-injection bugs (#320)
  feat(mobile): implement Activity tab with personalized feed (#337)
  feat(mobile): upgrade mobile_scanner to v7 (Apple Vision) (#336)
  feat(mobile): app branding — icon, name, launch screen (#335)
  fix: cancel TTS on remote human speech in multi-participant huddles (#332)
  feat(mobile): design refresh — navigation, search, reactions (#334)
  feat(desktop+acp+mcp): deterministic nested thread replies via persisted reply context (#322)
  feat(mobile): channel management — create, browse, join/leave, DMs, canvas (#331)
  fix: derive staging ports from worktree to avoid collisions (#329)
  fix: mentions survive copy/paste from chat into composer (#328)
  feat(home): add activity and agent feed sections with deep-linking (#330)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant