Context
Post-meeting diarization currently runs Sortformer via soniqo/speech-swift and then post-processes with custom Swift code in src-tauri/swift-permissions/src/speech_bridge.swift. The Sortformer model itself is opaque to us, but the post-processing is 100% ours — and that's where most of the wins live.
User said it best: "user defines the speaker unless it's set as automatic." That means the system has two regimes:
- Automatic (
speakerCount: null) — Sortformer infers the count, we trust it.
- User-specified count — Sortformer runs unconstrained, then
constrainDiarizedSegments collapses excess clusters down to the requested count.
Regime 2 is where the current logic is weakest. This issue enumerates algorithmic weaknesses I found while reading the post-processing pipeline, ranked by expected impact on diarization error rate (DER).
Cross-meeting speaker identity (voice prints / speaker library) is a separate concern and tracked elsewhere.
Pipeline recap
DiarizationPipeline.diarizeAudioFile (speech_bridge.swift:1121) loads audio at 16kHz and calls SortformerDiarizer.diarize().
constrainDiarizedSegments (speech_bridge.swift:421) collapses excess clusters when requestedSpeakerCount is set.
_speech_diarize_audio_file returns segments keyed by speaker_N.
- Later, for per-speaker embedding:
selectSpeakerEmbeddingSegments (speech_bridge.swift:351) picks sample segments, SpeakerEmbeddingPipeline embeds them with WeSpeaker, normalizedEmbeddingCentroid averages.
Hypotheses ranked by expected DER impact
H1 — constrainDiarizedSegments reassigns by temporal distance, not acoustic similarity 🔴
Location: speech_bridge.swift:421-488
When Sortformer returns more clusters than the user asked for, we retain the N clusters with the largest total duration and reassign every segment from a dropped cluster to the temporally nearest retained segment (diarizedSegmentDistance = gap in seconds).
This is a heuristic that ignores acoustic evidence. If Alice has a 1-second interjection at t=60 and Bob is speaking t=58-62, Alice's segment gets merged into Bob purely because of time adjacency — regardless of how different their voices sound.
Proposed fix: Use speaker embeddings to drive the reassignment. We already have WeSpeakerModel loaded for the speaker-library flow. For each segment in a dropped cluster:
- Embed the segment with WeSpeaker.
- Compute cosine similarity against the centroid of each retained cluster.
- Assign to the max-similarity cluster.
Fall back to the current temporal heuristic only when a segment is too short to embed reliably (<1 s).
Why I rank this highest: it's isolated, measurable, and the current implementation is a placeholder. It also has the largest surface area for a tree-search optimizer (evo) to explore — similarity thresholds, min-duration gates, fallback order, centroid update rules.
H2 — No VAD pre-pass, so non-speech audio gets assigned to phantom speakers 🟠
Location: DiarizationPipeline.diarizeAudioFile:1127
Audio goes directly from AudioFileLoader.load into SortformerDiarizer.diarize(). No silence trimming, no voice-activity filtering.
SpeechVAD is already a dependency of speech-swift and is used in the live-transcription path. For offline diarization we ignore it. On recordings with long silences, music, typing, or background chatter, Sortformer tends to invent speakers for non-speech regions.
Proposed fix:
- Run
SpeechVAD on the full audio first.
- Mask or drop non-speech regions before passing to Sortformer.
- After diarization, map segments back to original timestamps.
Architectural, not a hill-climb target — ship this by hand.
H3 — Embedding samples are top-K longest, not diverse across the meeting 🟠
Location: selectSpeakerEmbeddingSegments:351
Samples are sorted by duration descending, then truncated to limit. If a speaker has one long monologue near the top of the call and many short contributions later, all K embedding samples come from the monologue. We lose voice variation (rested vs. excited vs. quiet).
Proposed fix: stratified sampling — pick the longest segment, then the earliest, latest, and middle-interval segments, within the min/max duration bounds. This matters more for cross-meeting identity than intra-meeting DER, but it's cheap and correct.
H4 — normalizedEmbeddingCentroid averages raw embeddings before L2-normalizing 🟡
Location: speech_bridge.swift:399-419
Standard practice for speaker embeddings is: L2-normalize each sample first, then average, then L2-normalize the result. The current code sums raw embeddings and normalizes only at the end, which biases the centroid toward high-magnitude samples.
Proposed fix: L2-normalize per sample before summing. One-line change.
H5 — No merge of adjacent same-speaker segments 🟡
Sortformer over-segments — same speaker frequently shows up as consecutive turns separated by <0.5 s. No step coalesces these. Mostly a UX issue (inflated speaker-turn counts), marginal DER impact.
Proposed fix: after constrainDiarizedSegments, merge adjacent segments with the same speakerId when the gap is below a small threshold (say 0.3 s).
H6 — Redundant 1 s floor in sliceAudio path 🟢
trimmedSpeakerEmbeddingSegment enforces ≥1.5 s or ≥2.5 s depending on fallback tier. Then SpeakerEmbeddingPipeline.analyzeAudioFile adds a third >= 16000 samples (1 s) guard. The earlier guards already subsume this. Clean up for clarity, no DER impact.
Plan
- Ship H2 and H4 as direct fixes — architectural (VAD pre-pass) and correctness (normalization order).
- Ship H1 as a concrete implementation, then hand the surface —
constrainDiarizedSegments and siblings — to evo with a DER gate against AMI (or a held-out set of real Char recordings) to squeeze further gains.
- H3, H5, H6 fold in as free wins during the H1 work.
I'd prefer to do the AMI benchmark harness + DER gate setup on a macOS host (needs Sortformer + WeSpeaker model downloads, Apple Silicon). Before that, H2 and H4 are shippable today with the existing test suite as the only gate.
Not in scope for this issue: Sortformer itself, cross-meeting speaker identity / voice prints, real-time diarization.
Context
Post-meeting diarization currently runs Sortformer via
soniqo/speech-swiftand then post-processes with custom Swift code insrc-tauri/swift-permissions/src/speech_bridge.swift. The Sortformer model itself is opaque to us, but the post-processing is 100% ours — and that's where most of the wins live.User said it best: "user defines the speaker unless it's set as automatic." That means the system has two regimes:
speakerCount: null) — Sortformer infers the count, we trust it.constrainDiarizedSegmentscollapses excess clusters down to the requested count.Regime 2 is where the current logic is weakest. This issue enumerates algorithmic weaknesses I found while reading the post-processing pipeline, ranked by expected impact on diarization error rate (DER).
Cross-meeting speaker identity (voice prints / speaker library) is a separate concern and tracked elsewhere.
Pipeline recap
DiarizationPipeline.diarizeAudioFile(speech_bridge.swift:1121) loads audio at 16kHz and callsSortformerDiarizer.diarize().constrainDiarizedSegments(speech_bridge.swift:421) collapses excess clusters whenrequestedSpeakerCountis set._speech_diarize_audio_filereturns segments keyed byspeaker_N.selectSpeakerEmbeddingSegments(speech_bridge.swift:351) picks sample segments,SpeakerEmbeddingPipelineembeds them with WeSpeaker,normalizedEmbeddingCentroidaverages.Hypotheses ranked by expected DER impact
H1 —
constrainDiarizedSegmentsreassigns by temporal distance, not acoustic similarity 🔴Location:
speech_bridge.swift:421-488When Sortformer returns more clusters than the user asked for, we retain the N clusters with the largest total duration and reassign every segment from a dropped cluster to the temporally nearest retained segment (
diarizedSegmentDistance= gap in seconds).This is a heuristic that ignores acoustic evidence. If Alice has a 1-second interjection at t=60 and Bob is speaking t=58-62, Alice's segment gets merged into Bob purely because of time adjacency — regardless of how different their voices sound.
Proposed fix: Use speaker embeddings to drive the reassignment. We already have
WeSpeakerModelloaded for the speaker-library flow. For each segment in a dropped cluster:Fall back to the current temporal heuristic only when a segment is too short to embed reliably (<1 s).
Why I rank this highest: it's isolated, measurable, and the current implementation is a placeholder. It also has the largest surface area for a tree-search optimizer (
evo) to explore — similarity thresholds, min-duration gates, fallback order, centroid update rules.H2 — No VAD pre-pass, so non-speech audio gets assigned to phantom speakers 🟠
Location:
DiarizationPipeline.diarizeAudioFile:1127Audio goes directly from
AudioFileLoader.loadintoSortformerDiarizer.diarize(). No silence trimming, no voice-activity filtering.SpeechVADis already a dependency ofspeech-swiftand is used in the live-transcription path. For offline diarization we ignore it. On recordings with long silences, music, typing, or background chatter, Sortformer tends to invent speakers for non-speech regions.Proposed fix:
SpeechVADon the full audio first.Architectural, not a hill-climb target — ship this by hand.
H3 — Embedding samples are top-K longest, not diverse across the meeting 🟠
Location:
selectSpeakerEmbeddingSegments:351Samples are sorted by duration descending, then truncated to
limit. If a speaker has one long monologue near the top of the call and many short contributions later, all K embedding samples come from the monologue. We lose voice variation (rested vs. excited vs. quiet).Proposed fix: stratified sampling — pick the longest segment, then the earliest, latest, and middle-interval segments, within the min/max duration bounds. This matters more for cross-meeting identity than intra-meeting DER, but it's cheap and correct.
H4 —
normalizedEmbeddingCentroidaverages raw embeddings before L2-normalizing 🟡Location:
speech_bridge.swift:399-419Standard practice for speaker embeddings is: L2-normalize each sample first, then average, then L2-normalize the result. The current code sums raw embeddings and normalizes only at the end, which biases the centroid toward high-magnitude samples.
Proposed fix: L2-normalize per sample before summing. One-line change.
H5 — No merge of adjacent same-speaker segments 🟡
Sortformer over-segments — same speaker frequently shows up as consecutive turns separated by <0.5 s. No step coalesces these. Mostly a UX issue (inflated speaker-turn counts), marginal DER impact.
Proposed fix: after
constrainDiarizedSegments, merge adjacent segments with the samespeakerIdwhen the gap is below a small threshold (say 0.3 s).H6 — Redundant 1 s floor in
sliceAudiopath 🟢trimmedSpeakerEmbeddingSegmentenforces ≥1.5 s or ≥2.5 s depending on fallback tier. ThenSpeakerEmbeddingPipeline.analyzeAudioFileadds a third>= 16000 samples(1 s) guard. The earlier guards already subsume this. Clean up for clarity, no DER impact.Plan
constrainDiarizedSegmentsand siblings — toevowith a DER gate against AMI (or a held-out set of real Char recordings) to squeeze further gains.I'd prefer to do the AMI benchmark harness + DER gate setup on a macOS host (needs Sortformer + WeSpeaker model downloads, Apple Silicon). Before that, H2 and H4 are shippable today with the existing test suite as the only gate.
Not in scope for this issue: Sortformer itself, cross-meeting speaker identity / voice prints, real-time diarization.