Improve post-meeting diarization: audit of Sortformer post-processing + hypotheses

## Context

Post-meeting diarization currently runs Sortformer via `soniqo/speech-swift` and then post-processes with custom Swift code in `src-tauri/swift-permissions/src/speech_bridge.swift`. The Sortformer model itself is opaque to us, but the post-processing is 100% ours — and that's where most of the wins live.

User said it best: "user defines the speaker unless it's set as automatic." That means the system has two regimes:

1. **Automatic** (`speakerCount: null`) — Sortformer infers the count, we trust it.
2. **User-specified count** — Sortformer runs unconstrained, then `constrainDiarizedSegments` collapses excess clusters down to the requested count.

Regime 2 is where the current logic is weakest. This issue enumerates algorithmic weaknesses I found while reading the post-processing pipeline, ranked by expected impact on diarization error rate (DER).

Cross-meeting speaker identity (voice prints / speaker library) is a separate concern and tracked elsewhere.

## Pipeline recap

1. `DiarizationPipeline.diarizeAudioFile` (`speech_bridge.swift:1121`) loads audio at 16kHz and calls `SortformerDiarizer.diarize()`.
2. `constrainDiarizedSegments` (`speech_bridge.swift:421`) collapses excess clusters when `requestedSpeakerCount` is set.
3. `_speech_diarize_audio_file` returns segments keyed by `speaker_N`.
4. Later, for per-speaker embedding: `selectSpeakerEmbeddingSegments` (`speech_bridge.swift:351`) picks sample segments, `SpeakerEmbeddingPipeline` embeds them with WeSpeaker, `normalizedEmbeddingCentroid` averages.

## Hypotheses ranked by expected DER impact

### H1 — `constrainDiarizedSegments` reassigns by temporal distance, not acoustic similarity 🔴

**Location:** `speech_bridge.swift:421-488`

When Sortformer returns more clusters than the user asked for, we retain the N clusters with the largest total duration and reassign every segment from a dropped cluster to the *temporally nearest* retained segment (`diarizedSegmentDistance` = gap in seconds).

This is a heuristic that ignores acoustic evidence. If Alice has a 1-second interjection at t=60 and Bob is speaking t=58-62, Alice's segment gets merged into Bob purely because of time adjacency — regardless of how different their voices sound.

**Proposed fix:** Use speaker embeddings to drive the reassignment. We already have `WeSpeakerModel` loaded for the speaker-library flow. For each segment in a dropped cluster:

1. Embed the segment with WeSpeaker.
2. Compute cosine similarity against the centroid of each retained cluster.
3. Assign to the max-similarity cluster.

Fall back to the current temporal heuristic only when a segment is too short to embed reliably (<1 s).

**Why I rank this highest:** it's isolated, measurable, and the current implementation is a placeholder. It also has the largest surface area for a tree-search optimizer (`evo`) to explore — similarity thresholds, min-duration gates, fallback order, centroid update rules.

### H2 — No VAD pre-pass, so non-speech audio gets assigned to phantom speakers 🟠

**Location:** `DiarizationPipeline.diarizeAudioFile:1127`

Audio goes directly from `AudioFileLoader.load` into `SortformerDiarizer.diarize()`. No silence trimming, no voice-activity filtering.

`SpeechVAD` is already a dependency of `speech-swift` and is used in the live-transcription path. For offline diarization we ignore it. On recordings with long silences, music, typing, or background chatter, Sortformer tends to invent speakers for non-speech regions.

**Proposed fix:**

1. Run `SpeechVAD` on the full audio first.
2. Mask or drop non-speech regions before passing to Sortformer.
3. After diarization, map segments back to original timestamps.

Architectural, not a hill-climb target — ship this by hand.

### H3 — Embedding samples are top-K longest, not diverse across the meeting 🟠

**Location:** `selectSpeakerEmbeddingSegments:351`

Samples are sorted by duration descending, then truncated to `limit`. If a speaker has one long monologue near the top of the call and many short contributions later, all K embedding samples come from the monologue. We lose voice variation (rested vs. excited vs. quiet).

**Proposed fix:** stratified sampling — pick the longest segment, then the earliest, latest, and middle-interval segments, within the min/max duration bounds. This matters more for cross-meeting identity than intra-meeting DER, but it's cheap and correct.

### H4 — `normalizedEmbeddingCentroid` averages raw embeddings before L2-normalizing 🟡

**Location:** `speech_bridge.swift:399-419`

Standard practice for speaker embeddings is: L2-normalize each sample first, then average, then L2-normalize the result. The current code sums raw embeddings and normalizes only at the end, which biases the centroid toward high-magnitude samples.

**Proposed fix:** L2-normalize per sample before summing. One-line change.

### H5 — No merge of adjacent same-speaker segments 🟡

Sortformer over-segments — same speaker frequently shows up as consecutive turns separated by <0.5 s. No step coalesces these. Mostly a UX issue (inflated speaker-turn counts), marginal DER impact.

**Proposed fix:** after `constrainDiarizedSegments`, merge adjacent segments with the same `speakerId` when the gap is below a small threshold (say 0.3 s).

### H6 — Redundant 1 s floor in `sliceAudio` path 🟢

`trimmedSpeakerEmbeddingSegment` enforces ≥1.5 s or ≥2.5 s depending on fallback tier. Then `SpeakerEmbeddingPipeline.analyzeAudioFile` adds a third `>= 16000 samples` (1 s) guard. The earlier guards already subsume this. Clean up for clarity, no DER impact.

## Plan

1. **Ship H2 and H4** as direct fixes — architectural (VAD pre-pass) and correctness (normalization order).
2. **Ship H1 as a concrete implementation**, then hand the surface — `constrainDiarizedSegments` and siblings — to `evo` with a DER gate against AMI (or a held-out set of real Char recordings) to squeeze further gains.
3. **H3, H5, H6** fold in as free wins during the H1 work.

I'd prefer to do the AMI benchmark harness + DER gate setup on a macOS host (needs Sortformer + WeSpeaker model downloads, Apple Silicon). Before that, H2 and H4 are shippable today with the existing test suite as the only gate.

Not in scope for this issue: Sortformer itself, cross-meeting speaker identity / voice prints, real-time diarization.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve post-meeting diarization: audit of Sortformer post-processing + hypotheses #4

Context

Pipeline recap

Hypotheses ranked by expected DER impact

H1 — `constrainDiarizedSegments` reassigns by temporal distance, not acoustic similarity 🔴

H2 — No VAD pre-pass, so non-speech audio gets assigned to phantom speakers 🟠

H3 — Embedding samples are top-K longest, not diverse across the meeting 🟠

H4 — `normalizedEmbeddingCentroid` averages raw embeddings before L2-normalizing 🟡

H5 — No merge of adjacent same-speaker segments 🟡

H6 — Redundant 1 s floor in `sliceAudio` path 🟢

Plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve post-meeting diarization: audit of Sortformer post-processing + hypotheses #4

Description

Context

Pipeline recap

Hypotheses ranked by expected DER impact

H1 — constrainDiarizedSegments reassigns by temporal distance, not acoustic similarity 🔴

H2 — No VAD pre-pass, so non-speech audio gets assigned to phantom speakers 🟠

H3 — Embedding samples are top-K longest, not diverse across the meeting 🟠

H4 — normalizedEmbeddingCentroid averages raw embeddings before L2-normalizing 🟡

H5 — No merge of adjacent same-speaker segments 🟡

H6 — Redundant 1 s floor in sliceAudio path 🟢

Plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

H1 — `constrainDiarizedSegments` reassigns by temporal distance, not acoustic similarity 🔴

H4 — `normalizedEmbeddingCentroid` averages raw embeddings before L2-normalizing 🟡

H6 — Redundant 1 s floor in `sliceAudio` path 🟢