-
Notifications
You must be signed in to change notification settings - Fork 415
Binary diarization #1102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary diarization #1102
Conversation
📝 WalkthroughWalkthroughThis set of changes introduces support for dual audio channels throughout the codebase. It adds a new Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant ListenClientBuilder
participant ListenClient / ListenClientDual
participant WebSocketServer
Client->>ListenClientBuilder: build_single()/build_dual()
ListenClientBuilder->>ListenClient: returns ListenClient (single) or ListenClientDual (dual)
Client->>ListenClient: from_audio(audio_stream) (Single)
Client->>ListenClientDual: from_audio(mic_stream, speaker_stream) (Dual)
ListenClientDual->>ListenClientDual: zip mic and speaker streams
ListenClient / ListenClientDual->>WebSocketServer: send ListenInputChunk::Audio or ::DualAudio
WebSocketServer->>WebSocketServer: process input (mix if DualAudio)
WebSocketServer-->>Client: ListenOutputChunk (with meta)
Possibly related PRs
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Clippy (1.86.0)error: failed to load source for dependency Caused by: Caused by: Caused by: 📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (2)
🧰 Additional context used📓 Path-based instructions (1)`**/*.{js,ts,tsx,rs}`: 1. No error handling. 2. No unused imports, variables, or functions. 3. For comments, keep it minimal. It should be about "Why", not "What".
⚙️ Source: CodeRabbit Configuration File List of files the instruction was applied to:
🧬 Code Graph Analysis (1)plugins/local-stt/src/server.rs (4)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
🔇 Additional comments (5)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
crates/whisper-local/src/stream.rs (1)
36-38: Inefficient cloning on every metadata accessThis implementation clones the metadata on every call to
meta(), which could be costly for large JSON values.plugins/listener/src/client.rs (1)
60-88: No error handling in builder methodsBoth
build_singleandbuild_dualuseunwrap()on URI parsing, which could panic.
🧹 Nitpick comments (5)
apps/app/server/src/native/listen/realtime.rs (1)
45-47: Complete the dual audio implementation.The
todo!()placeholder should be replaced with proper dual audio processing logic. This is part of the staged implementation for dual audio support.Do you want me to help implement the dual audio processing logic based on the patterns used in other files, or should this be tracked as a separate task?
plugins/local-stt/src/server.rs (1)
213-220: Dual channel function needs implementationThe dual channel WebSocket handler is currently stubbed out. This is acceptable for the current stage but should be implemented to complete the dual audio support.
Do you want me to generate a basic implementation structure for the dual channel function or create an issue to track this implementation?
crates/whisper-local/src/stream.rs (2)
20-23: Performance concern: Returning owned values instead of referencesThe change from returning
&Option<serde_json::Value>toOption<serde_json::Value>forces a clone on every access. For large metadata objects, this could impact performance significantly.Consider keeping the reference return type if metadata is frequently accessed:
- fn meta(&self) -> Option<serde_json::Value>; + fn meta(&self) -> Option<&serde_json::Value>;Or if ownership is truly needed, consider using
Cow<'_, serde_json::Value>to allow both borrowed and owned values.
191-193: Consider optimizing for None metadata caseWhile
Option::clone()is cheap forNone, you could avoid the iteration entirely when there's no metadata to assign.- for segment in &mut segments { - segment.meta = meta.clone(); - } + if let Some(ref meta_value) = meta { + for segment in &mut segments { + segment.meta = Some(meta_value.clone()); + } + }plugins/listener/src/client.rs (1)
129-134: Performance: Avoid unnecessary byte conversionsConverting
BytestoVec<u8>withto_vec()creates unnecessary copies of audio data.If
ListenInputChunk::DualAudiocan acceptbytes::Bytesdirectly:- ListenInputChunk::DualAudio { - mic: data.0.to_vec(), - speaker: data.1.to_vec(), - } + ListenInputChunk::DualAudio { + mic: data.0, + speaker: data.1, + }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (17)
apps/app/server/src/native/listen/realtime.rs(1 hunks)crates/audio-utils/src/lib.rs(3 hunks)crates/stt/src/realtime/clova.rs(1 hunks)crates/stt/src/realtime/deepgram.rs(1 hunks)crates/stt/src/realtime/whisper.rs(1 hunks)crates/whisper-cloud/src/client.rs(1 hunks)crates/whisper-local/src/model.rs(2 hunks)crates/whisper-local/src/stream.rs(4 hunks)crates/ws-utils/Cargo.toml(1 hunks)crates/ws-utils/src/lib.rs(2 hunks)crates/ws/src/client.rs(2 hunks)plugins/listener-interface/Cargo.toml(1 hunks)plugins/listener-interface/src/lib.rs(4 hunks)plugins/listener/Cargo.toml(1 hunks)plugins/listener/src/client.rs(7 hunks)plugins/listener/src/fsm.rs(2 hunks)plugins/local-stt/src/server.rs(4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{js,ts,tsx,rs}`: 1. No error handling. 2. No unused imports, variables, or functions. 3. For comments, keep it minimal. It should be about "Why", not "What".
**/*.{js,ts,tsx,rs}: 1. No error handling.
2. No unused imports, variables, or functions.
3. For comments, keep it minimal. It should be about "Why", not "What".
⚙️ Source: CodeRabbit Configuration File
List of files the instruction was applied to:
crates/stt/src/realtime/whisper.rscrates/stt/src/realtime/clova.rsapps/app/server/src/native/listen/realtime.rscrates/whisper-cloud/src/client.rscrates/ws-utils/src/lib.rscrates/whisper-local/src/model.rsplugins/listener/src/fsm.rscrates/audio-utils/src/lib.rscrates/stt/src/realtime/deepgram.rscrates/ws/src/client.rsplugins/listener-interface/src/lib.rsplugins/local-stt/src/server.rscrates/whisper-local/src/stream.rsplugins/listener/src/client.rs
🧬 Code Graph Analysis (3)
crates/ws-utils/src/lib.rs (1)
crates/audio-utils/src/lib.rs (1)
bytes_to_f32_samples(51-58)
crates/whisper-local/src/model.rs (1)
crates/whisper-local/src/stream.rs (2)
meta(22-22)meta(36-38)
crates/ws/src/client.rs (2)
plugins/listener/src/client.rs (2)
to_input(101-105)to_input(129-134)crates/whisper-cloud/src/client.rs (1)
to_input(94-96)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: ci (macos, macos-latest)
- GitHub Check: ci (windows, windows-latest)
- GitHub Check: ci
🔇 Additional comments (33)
crates/audio-utils/src/lib.rs (3)
8-8: Good addition of named constant for maintainability.Replacing the magic number
32768.0with a named constant improves code readability and maintainability.
26-26: Consistent use of I16_SCALE constant.Good refactoring to use the named constant instead of hardcoded values across all conversion functions.
Also applies to: 37-37, 45-45
51-58: New utility function looks correct.The
bytes_to_f32_samplesfunction correctly converts little-endian i16 byte pairs to normalized f32 samples using the established scaling constant.crates/ws-utils/Cargo.toml (1)
7-7: Appropriate dependency addition.Adding the
hypr-audio-utilsdependency makes sense for utilizing the new audio conversion utilities.plugins/listener/Cargo.toml (1)
58-59: Dependencies added for dual audio support.The addition of
statigwith async features and reorganization of dependencies aligns with the dual audio functionality being introduced.crates/stt/src/realtime/clova.rs (1)
39-39: Good use of default initialization.Using
..Default::default()ensures all fields inListenOutputChunkare properly initialized, which is especially important as the struct is being extended with new fields.crates/stt/src/realtime/whisper.rs (1)
33-33: Consistent default initialization pattern.Using
..Default::default()maintains consistency with other STT implementations and ensures proper field initialization.plugins/listener-interface/Cargo.toml (1)
12-12: LGTM! Dependency additions support new dual audio functionality.The
strumdependency withderivefeature andserde_jsonfeature forspectaare necessary to support the newAudioModeenum and metadata serialization capabilities introduced in this PR.Also applies to: 17-17
crates/stt/src/realtime/deepgram.rs (1)
85-88: LGTM! Improved struct initialization with default values.The struct update syntax
..Default::default()ensures all fields are properly initialized with default values, which is especially important with the addition of the new optionalmetafield toListenOutputChunk.crates/whisper-cloud/src/client.rs (1)
90-90: LGTM! Trait generalization supports flexible input data types.The addition of the associated type
Dataand its usage in theto_inputmethod enables theWebSocketIOtrait to handle both single and dual audio modes. This maintains backward compatibility while supporting the new dual audio functionality.Also applies to: 94-94
plugins/listener/src/fsm.rs (2)
357-357: Metadata capture for future use.The
_metavariable captures metadata from stream results, which aligns with the unified metadata handling improvements in this PR. The underscore prefix indicates it's currently unused but available for future functionality.
463-463: LGTM! Explicit single audio mode specification.The change from
.build()to.build_single()explicitly distinguishes between single and dual audio modes, which is essential for the new dual audio support while maintaining existing functionality.crates/whisper-local/src/model.rs (2)
240-240: LGTM - Consistent field renamingThe field rename from
metadatatometaaligns with the broader refactoring across the codebase for metadata handling standardization.
264-266: LGTM - Consistent method signature changeThe method change from returning a reference to returning an owned value by cloning is consistent with the pattern established in
crates/whisper-local/src/stream.rsand maintains API consistency.crates/ws-utils/src/lib.rs (2)
4-4: LGTM - Good refactoring to use utility functionThe import of
bytes_to_f32_samplesfrom the audio utils crate is a good refactoring that eliminates code duplication.
37-37: LGTM - Clean refactoringReplacing the manual byte conversion with the utility function improves code maintainability and consistency.
crates/ws/src/client.rs (3)
10-10: LGTM - Well-designed trait generificationAdding the
Dataassociated type withSendconstraint is a good design that enables the trait to work with different data types while maintaining thread safety.
14-14: LGTM - Consistent method signature updateThe method signature change to accept
Self::Datainstead of fixedbytes::Bytesis consistent with the trait generification and enables flexible input handling.
30-30: LGTM - Properly updated generic constraintThe
from_audiomethod correctly uses the genericT::Datatype, enabling it to work with both single and dual audio stream clients as shown in the relevant code snippets.plugins/listener-interface/src/lib.rs (6)
19-19: LGTM - Useful Default deriveAdding
Defaultto theWordstruct reduces boilerplate and makes the API more convenient to use.
40-40: LGTM - Consistent Default deriveAdding
DefaulttoListenOutputChunkis consistent with the pattern and useful for initialization.
42-42: LGTM - Consistent metadata field additionThe
metafield addition aligns with the metadata handling standardization across the codebase.
55-61: LGTM - Well-structured dual audio variantThe
DualAudiovariant is well-structured with properserde_bytesserialization for bothmicandspeakerfields, maintaining consistency with the existingAudiovariant.
73-89: LGTM - Clean enum designThe
AudioModeenum is well-designed with appropriate serde and strum attributes, clear variant names, and a sensible default ofSingle.
94-94: LGTM - Consistent parameter additionAdding the
audio_modefield toListenParamsis consistent with the dual audio support and has an appropriate default value.plugins/local-stt/src/server.rs (5)
134-144: LGTM - Clean audio mode dispatchThe refactoring to split socket handling and dispatch based on audio mode is well-structured and maintains clear separation of concerns.
146-151: LGTM - Consistent function signatureThe function signature change to accept split socket streams is consistent with the new architecture and enables better control over WebSocket handling.
153-153: LGTM - Good use of fully qualified pathsUsing the fully qualified path for
WebSocketAudioSourceis good practice and avoids potential import conflicts.
160-160: LGTM - Cleaner initializationUsing
..Default::default()instead of explicit field initialization is cleaner and more maintainable.
175-187: LGTM - Proper metadata handlingThe metadata extraction and propagation is correctly implemented and consistent with the broader metadata handling changes.
crates/whisper-local/src/stream.rs (1)
154-164: LGTM! Clean separation of metadata extractionGood practice to extract the metadata before consuming the chunk, ensuring proper ownership handling.
plugins/listener/src/client.rs (2)
164-173: LGTM! Clean dual stream implementationGood use of
zipto combine the mic and speaker streams into a single dual stream.
32-58: Critical: No error handling for URL parsingThe method uses multiple
unwrap()calls that could panic if:
- The API base URL is malformed
- The URL doesn't have a host
- Setting the scheme fails
Replace unwrap() calls with proper error handling:
- fn build_uri(&self, audio_mode: hypr_listener_interface::AudioMode) -> String { - let mut url: url::Url = self.api_base.as_ref().unwrap().parse().unwrap(); + fn build_uri(&self, audio_mode: hypr_listener_interface::AudioMode) -> Result<String, url::ParseError> { + let api_base = self.api_base.as_ref().ok_or_else(|| url::ParseError::EmptyHost)?; + let mut url: url::Url = api_base.parse()?;Also consider using a more robust host detection:
- let host = url.host_str().unwrap(); + let host = url.host_str().ok_or_else(|| url::ParseError::EmptyHost)?;Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
crates/ws-utils/src/lib.rs (1)
37-46: Apply scaling factor to prevent audio clipping.The audio mixing adds samples without scaling, which can cause clipping. Consider applying a scaling factor as suggested in the previous review.
+/// Scale factor applied to mixed audio to prevent clipping. +const DEFAULT_MIX_SCALE: f32 = 0.9; + fn mix_audio_channels(mic: &[f32], speaker: &[f32]) -> Vec<f32> { let max_len = mic.len().max(speaker.len()); (0..max_len) .map(|i| { let mic_sample = mic.get(i).copied().unwrap_or(0.0); let speaker_sample = speaker.get(i).copied().unwrap_or(0.0); - (mic_sample + speaker_sample).clamp(-1.0, 1.0) + ((mic_sample + speaker_sample) * DEFAULT_MIX_SCALE).clamp(-1.0, 1.0) }) .collect() }
🧹 Nitpick comments (3)
plugins/local-stt/src/server.rs (1)
153-153: Extract the hardcoded buffer size as a constant.The buffer size
16 * 1000is used in multiple places. Consider extracting it as a named constant for better maintainability.+const AUDIO_BUFFER_SIZE: usize = 16 * 1000; + async fn websocket_single_channel( // ... ) { let mut stream = { - let audio_source = hypr_ws_utils::WebSocketAudioSource::new(ws_receiver, 16 * 1000); + let audio_source = hypr_ws_utils::WebSocketAudioSource::new(ws_receiver, AUDIO_BUFFER_SIZE); // ... }; // ... } async fn websocket_dual_channel( // ... ) { let (mic_source, speaker_source) = - hypr_ws_utils::split_dual_audio_sources(ws_receiver, 16 * 1000); + hypr_ws_utils::split_dual_audio_sources(ws_receiver, AUDIO_BUFFER_SIZE); // ... }Also applies to: 220-220
crates/ws-utils/src/lib.rs (1)
121-122: Consider using bounded channels to prevent memory issues.The unbounded channels could cause memory exhaustion if the consumer is slower than the producer.
- let (mic_tx, mic_rx) = unbounded_channel::<Vec<f32>>(); - let (speaker_tx, speaker_rx) = unbounded_channel::<Vec<f32>>(); + const CHANNEL_BUFFER_SIZE: usize = 256; + let (mic_tx, mic_rx) = tokio::sync::mpsc::channel::<Vec<f32>>(CHANNEL_BUFFER_SIZE); + let (speaker_tx, speaker_rx) = tokio::sync::mpsc::channel::<Vec<f32>>(CHANNEL_BUFFER_SIZE);plugins/listener/src/fsm.rs (1)
454-454: Remove or implement the unused metadata handling.The
_metavariable is extracted but never used. Either remove it or implement the intended functionality.- let _meta = result.meta.clone(); - // We don't have to do this, and inefficient. But this is what works at the moment.Or if metadata is needed for future features, add a TODO comment explaining the planned usage.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
crates/audio-utils/src/lib.rs(3 hunks)crates/ws-utils/Cargo.toml(1 hunks)crates/ws-utils/src/lib.rs(3 hunks)plugins/listener/src/client.rs(7 hunks)plugins/listener/src/fsm.rs(9 hunks)plugins/local-stt/src/server.rs(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- crates/ws-utils/Cargo.toml
- crates/audio-utils/src/lib.rs
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{js,ts,tsx,rs}`: 1. No error handling. 2. No unused imports, variables, or functions. 3. For comments, keep it minimal. It should be about "Why", not "What".
**/*.{js,ts,tsx,rs}: 1. No error handling.
2. No unused imports, variables, or functions.
3. For comments, keep it minimal. It should be about "Why", not "What".
⚙️ Source: CodeRabbit Configuration File
List of files the instruction was applied to:
plugins/local-stt/src/server.rscrates/ws-utils/src/lib.rsplugins/listener/src/fsm.rsplugins/listener/src/client.rs
🧬 Code Graph Analysis (1)
plugins/listener/src/fsm.rs (1)
crates/audio-utils/src/lib.rs (1)
f32_to_i16_bytes(51-58)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: ci (windows, windows-latest)
- GitHub Check: ci (macos, macos-latest)
- GitHub Check: ci
🔇 Additional comments (6)
plugins/local-stt/src/server.rs (1)
134-144: LGTM! Clean separation of audio mode handling.The WebSocket split and dispatch pattern based on audio mode is well-structured.
crates/ws-utils/src/lib.rs (1)
127-130: Clarify the handling of single audio samples.When receiving single audio samples, the same data is sent to both mic and speaker channels. This duplicates the audio, which might not be the intended behavior.
Is this duplication intentional? If not, consider sending to only one channel or documenting why duplication is needed.
plugins/listener/src/client.rs (1)
32-58: LGTM! Well-structured URI builder.The method properly handles URL construction with appropriate scheme selection and parameter encoding.
plugins/listener/src/fsm.rs (3)
25-54: Excellent refactoring with AudioSaver!The AudioSaver struct provides a clean abstraction for WAV file operations, eliminating code duplication and improving maintainability.
75-75: Verify the buffer size reduction impact.The
CHUNK_BUFFER_SIZEhas been reduced from 256 to 64 (75% reduction). This could impact performance by causing more frequent channel operations.Has this change been tested under load? Consider documenting the rationale for this reduction or making it configurable.
353-364: Good optimization for recording logic!Computing the mixed audio only when recording is enabled is an excellent optimization that reduces unnecessary CPU usage.
No description provided.