Problem
Current speaker identification relies on:
- Cosine similarity between speaker embeddings and stored profiles (
scoreSpeakerProfile in store.ts)
- Hard-coded thresholds — 0.7 (≥3 samples) or 0.74 (<3 samples) minimum score, plus a 0.04 margin gate when score < 0.82
- Simple centroid + best-sample scoring —
Math.max(bestSampleScore, centroidScore)
This works but the thresholds and scoring strategy were hand-tuned. There is no benchmark to measure how well speaker ID actually performs, and no systematic way to improve it.
Approach: Autoresearch loop (Evo-style)
Use evo or its approach — an autoresearch loop that discovers a benchmark, runs baseline, then spawns parallel agents to beat it:
- Tree search over greedy hill-climb — multiple forks from any committed improvement
- N parallel agents in git worktrees — each tries a different hypothesis
- Shared failure traces — agents don't repeat each other's mistakes
- Regression gates — changes that break existing correct matches get discarded
What to measure
Build a benchmark dataset of meetings with ground-truth speaker labels. Metrics:
- Accuracy — % of speakers correctly identified against known profiles
- Precision/Recall — false matches vs missed matches
- Confidence calibration — does a 0.85 confidence actually mean 85% correct?
- Threshold sensitivity — how much do results change with threshold tweaks?
What to explore
The autoresearch loop should explore improvements across the full stack:
Scoring strategy (store.ts)
- Weighted combination of centroid + sample scores instead of simple
Math.max
- Top-K sample averaging instead of single best sample
- Score normalization across profiles (relative ranking vs absolute threshold)
- Adaptive thresholds based on profile quality (sample count, embedding variance)
Embedding quality (Swift layer)
- Segment selection strategy — which diarization segments to embed (currently all)
- Minimum segment duration filtering
- Embedding aggregation — mean vs weighted mean vs attention-pooled centroids
- Per-sample quality scoring (reject noisy/short segments)
Profile management
- Automatic outlier detection in stored samples
- Profile convergence metrics — when does a profile have "enough" samples?
- Cross-meeting consistency checks
Matching logic
- Two-stage matching: fast centroid screen → detailed sample comparison
- Speaker verification (1:1) vs identification (1:N) distinction
- Temporal priors — if speaker A was in the last 3 meetings, they're likely in this one
Current architecture reference
Diarization → Segments with speaker IDs (Swift/CoreML sortformer)
↓
Embedding extraction → Per-speaker embedding vectors (Swift speech_bridge)
↓
Profile matching → cosine similarity against stored profiles (TS store.ts)
↓
Suggestion → recommendSpeakerProfile() returns best match above threshold
Key files:
src/store.ts — cosineSimilarity(), scoreSpeakerProfile(), recommendSpeakerProfile(), normalizedEmbeddingCentroid()
src-tauri/swift-permissions/src/speech_bridge.swift — embedding extraction, diarization segment processing, centroid computation
src-tauri/src/lib.rs — StoredSpeakerProfile, analyze_speaker_embeddings command
src-tauri/src/asr.rs — SpeakerEmbeddingPayload, FileSpeakerEmbeddingPayload
Implementation plan
- Build benchmark harness — collect ground-truth labeled meetings, define metrics, run baseline
- Set up evo — point it at the speaker ID codebase, configure benchmark as the optimization target
- Run optimization loop — let parallel agents explore scoring, thresholds, embedding strategies
- Gate on regression — any change must not regress existing correct matches
- Ship the winner — commit the best-performing configuration
Problem
Current speaker identification relies on:
scoreSpeakerProfileinstore.ts)Math.max(bestSampleScore, centroidScore)This works but the thresholds and scoring strategy were hand-tuned. There is no benchmark to measure how well speaker ID actually performs, and no systematic way to improve it.
Approach: Autoresearch loop (Evo-style)
Use evo or its approach — an autoresearch loop that discovers a benchmark, runs baseline, then spawns parallel agents to beat it:
What to measure
Build a benchmark dataset of meetings with ground-truth speaker labels. Metrics:
What to explore
The autoresearch loop should explore improvements across the full stack:
Scoring strategy (
store.ts)Math.maxEmbedding quality (Swift layer)
Profile management
Matching logic
Current architecture reference
Key files:
src/store.ts—cosineSimilarity(),scoreSpeakerProfile(),recommendSpeakerProfile(),normalizedEmbeddingCentroid()src-tauri/swift-permissions/src/speech_bridge.swift— embedding extraction, diarization segment processing, centroid computationsrc-tauri/src/lib.rs—StoredSpeakerProfile,analyze_speaker_embeddingscommandsrc-tauri/src/asr.rs—SpeakerEmbeddingPayload,FileSpeakerEmbeddingPayloadImplementation plan