Skip to content

Improve post-meeting diarization: audit of Sortformer post-processing + hypotheses #4

@ComputelessComputer

Description

@ComputelessComputer

Context

Post-meeting diarization currently runs Sortformer via soniqo/speech-swift and then post-processes with custom Swift code in src-tauri/swift-permissions/src/speech_bridge.swift. The Sortformer model itself is opaque to us, but the post-processing is 100% ours — and that's where most of the wins live.

User said it best: "user defines the speaker unless it's set as automatic." That means the system has two regimes:

  1. Automatic (speakerCount: null) — Sortformer infers the count, we trust it.
  2. User-specified count — Sortformer runs unconstrained, then constrainDiarizedSegments collapses excess clusters down to the requested count.

Regime 2 is where the current logic is weakest. This issue enumerates algorithmic weaknesses I found while reading the post-processing pipeline, ranked by expected impact on diarization error rate (DER).

Cross-meeting speaker identity (voice prints / speaker library) is a separate concern and tracked elsewhere.

Pipeline recap

  1. DiarizationPipeline.diarizeAudioFile (speech_bridge.swift:1121) loads audio at 16kHz and calls SortformerDiarizer.diarize().
  2. constrainDiarizedSegments (speech_bridge.swift:421) collapses excess clusters when requestedSpeakerCount is set.
  3. _speech_diarize_audio_file returns segments keyed by speaker_N.
  4. Later, for per-speaker embedding: selectSpeakerEmbeddingSegments (speech_bridge.swift:351) picks sample segments, SpeakerEmbeddingPipeline embeds them with WeSpeaker, normalizedEmbeddingCentroid averages.

Hypotheses ranked by expected DER impact

H1 — constrainDiarizedSegments reassigns by temporal distance, not acoustic similarity 🔴

Location: speech_bridge.swift:421-488

When Sortformer returns more clusters than the user asked for, we retain the N clusters with the largest total duration and reassign every segment from a dropped cluster to the temporally nearest retained segment (diarizedSegmentDistance = gap in seconds).

This is a heuristic that ignores acoustic evidence. If Alice has a 1-second interjection at t=60 and Bob is speaking t=58-62, Alice's segment gets merged into Bob purely because of time adjacency — regardless of how different their voices sound.

Proposed fix: Use speaker embeddings to drive the reassignment. We already have WeSpeakerModel loaded for the speaker-library flow. For each segment in a dropped cluster:

  1. Embed the segment with WeSpeaker.
  2. Compute cosine similarity against the centroid of each retained cluster.
  3. Assign to the max-similarity cluster.

Fall back to the current temporal heuristic only when a segment is too short to embed reliably (<1 s).

Why I rank this highest: it's isolated, measurable, and the current implementation is a placeholder. It also has the largest surface area for a tree-search optimizer (evo) to explore — similarity thresholds, min-duration gates, fallback order, centroid update rules.

H2 — No VAD pre-pass, so non-speech audio gets assigned to phantom speakers 🟠

Location: DiarizationPipeline.diarizeAudioFile:1127

Audio goes directly from AudioFileLoader.load into SortformerDiarizer.diarize(). No silence trimming, no voice-activity filtering.

SpeechVAD is already a dependency of speech-swift and is used in the live-transcription path. For offline diarization we ignore it. On recordings with long silences, music, typing, or background chatter, Sortformer tends to invent speakers for non-speech regions.

Proposed fix:

  1. Run SpeechVAD on the full audio first.
  2. Mask or drop non-speech regions before passing to Sortformer.
  3. After diarization, map segments back to original timestamps.

Architectural, not a hill-climb target — ship this by hand.

H3 — Embedding samples are top-K longest, not diverse across the meeting 🟠

Location: selectSpeakerEmbeddingSegments:351

Samples are sorted by duration descending, then truncated to limit. If a speaker has one long monologue near the top of the call and many short contributions later, all K embedding samples come from the monologue. We lose voice variation (rested vs. excited vs. quiet).

Proposed fix: stratified sampling — pick the longest segment, then the earliest, latest, and middle-interval segments, within the min/max duration bounds. This matters more for cross-meeting identity than intra-meeting DER, but it's cheap and correct.

H4 — normalizedEmbeddingCentroid averages raw embeddings before L2-normalizing 🟡

Location: speech_bridge.swift:399-419

Standard practice for speaker embeddings is: L2-normalize each sample first, then average, then L2-normalize the result. The current code sums raw embeddings and normalizes only at the end, which biases the centroid toward high-magnitude samples.

Proposed fix: L2-normalize per sample before summing. One-line change.

H5 — No merge of adjacent same-speaker segments 🟡

Sortformer over-segments — same speaker frequently shows up as consecutive turns separated by <0.5 s. No step coalesces these. Mostly a UX issue (inflated speaker-turn counts), marginal DER impact.

Proposed fix: after constrainDiarizedSegments, merge adjacent segments with the same speakerId when the gap is below a small threshold (say 0.3 s).

H6 — Redundant 1 s floor in sliceAudio path 🟢

trimmedSpeakerEmbeddingSegment enforces ≥1.5 s or ≥2.5 s depending on fallback tier. Then SpeakerEmbeddingPipeline.analyzeAudioFile adds a third >= 16000 samples (1 s) guard. The earlier guards already subsume this. Clean up for clarity, no DER impact.

Plan

  1. Ship H2 and H4 as direct fixes — architectural (VAD pre-pass) and correctness (normalization order).
  2. Ship H1 as a concrete implementation, then hand the surface — constrainDiarizedSegments and siblings — to evo with a DER gate against AMI (or a held-out set of real Char recordings) to squeeze further gains.
  3. H3, H5, H6 fold in as free wins during the H1 work.

I'd prefer to do the AMI benchmark harness + DER gate setup on a macOS host (needs Sortformer + WeSpeaker model downloads, Apple Silicon). Before that, H2 and H4 are shippable today with the existing test suite as the only gate.

Not in scope for this issue: Sortformer itself, cross-meeting speaker identity / voice prints, real-time diarization.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions