fix: cancel TTS on remote human speech in multi-participant huddles#332
Merged
tlongwell-block merged 3 commits intomainfrom Apr 16, 2026
Merged
fix: cancel TTS on remote human speech in multi-participant huddles#332tlongwell-block merged 3 commits intomainfrom
tlongwell-block merged 3 commits intomainfrom
Conversation
Thread tts_cancel and tts_active flags into the audio relay pipeline by cloning them from HuddleState inside connect_audio_relay. Use per-peer frame counting to distinguish real speech (~25 frames/500ms) from DTX comfort noise (~1 frame/500ms). When any remote peer crosses REMOTE_SPEECH_THRESHOLD (5 decoded audio frames) while TTS is active, immediately set tts_cancel. Key design choices: - REMOTE_SPEECH_THRESHOLD promoted to pub(crate) module level so tests can import it directly instead of duplicating the value. - Frame counting is gated on tts_active and happens after successful Opus decode (Ok(n) if n > 0). Corrupt frames and DTX silence are excluded. - TTS state transitions tracked at binary frame level (tts_was_active) so session boundaries are detected even during DTX silence. Counters reset on new TTS session (false→true) and on Instant-based window expiry (starvation-proof). Uses saturating_add. - joined selectively resets decoder/counter/active_indices state only for new or reassigned peer indices. Existing peers keep their state. - left cleans up index_to_pubkey, frame_counts, and decoders. - Acquire/Release ordering matches the existing tts.rs/stt.rs convention. Every remote peer on the audio WebSocket is a human (agents never connect to the audio relay), so no peer filtering is needed.
Comprehensive test suite for the per-peer frame counting and cancel mechanism. Covers threshold behavior, per-peer isolation, DTX comfort noise filtering, TTS session transitions, cancel consumption lifecycle, concurrent local+remote cancel safety, and apply_fades edge cases.
relay_api.rs is 2 lines over (502 vs 500) from the per-peer frame counting logic. tts.rs grew to ~1005 lines with the comprehensive remote interrupt test suite (18 tests, 564 lines).
tlongwell-block
added a commit
that referenced
this pull request
Apr 16, 2026
* origin/main: [codex] Fix authz, scope propagation, and shell-injection bugs (#320) feat(mobile): implement Activity tab with personalized feed (#337) feat(mobile): upgrade mobile_scanner to v7 (Apple Vision) (#336) feat(mobile): app branding — icon, name, launch screen (#335) fix: cancel TTS on remote human speech in multi-participant huddles (#332) feat(mobile): design refresh — navigation, search, reactions (#334) feat(desktop+acp+mcp): deterministic nested thread replies via persisted reply context (#322) feat(mobile): channel management — create, browse, join/leave, DMs, canvas (#331) fix: derive staging ports from worktree to avoid collisions (#329) fix: mentions survive copy/paste from chat into composer (#328) feat(home): add activity and agent feed sections with deep-linking (#330)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In multi-participant huddles, when a remote human speaks while the local agent's TTS is playing, the TTS continues uninterrupted. The existing barge-in mechanism only handles local speech (via STT), so remote human speech from the audio relay is ignored. This creates an awkward experience where the agent talks over humans.
Approach
Client-side per-peer frame counting in the relay recv task. Each remote peer gets an independent frame counter that increments on successfully-decoded Opus audio frames. Counting is gated on
tts_active— frames only accumulate while TTS is playing. When any peer's counter crossesREMOTE_SPEECH_THRESHOLD(15 frames in a 500ms window), the sharedtts_cancelatomic flag is set, which the TTS worker picks up and cancels playback.Key design decisions:
tts_activeis true, preventing false triggersstd::time::Instant(starvation-proof, no async dependency)saturating_addprevents overflow on the u16 countersWhat Changed
relay_api.rs(production): Per-peerHashMap<u8, u16>frame counting in the recv task,REMOTE_SPEECH_THRESHOLDconstant promoted topub(crate)module level, selectiveJoinedhandler, Instant-based window resettts.rs(tests): 18 new tests covering threshold behavior, per-peer isolation, DTX filtering, TTS session transitions, cancel consumption lifecycle, concurrent local+remote cancel safety, andapply_fadesedge casescheck-file-sizes.mjs(config): File size overrides for relay_api.rs (502 lines, +2 over limit) and tts.rs (1005 lines with test suite)Reviews
Three independent approvals:
Prior art research (Levin) confirmed no existing multi-participant barge-in implementations in the wild.