Skip to content

Improve speaker identification using autoresearch/evo approach #1

@ComputelessComputer

Description

@ComputelessComputer

Problem

Current speaker identification relies on:

  1. Cosine similarity between speaker embeddings and stored profiles (scoreSpeakerProfile in store.ts)
  2. Hard-coded thresholds — 0.7 (≥3 samples) or 0.74 (<3 samples) minimum score, plus a 0.04 margin gate when score < 0.82
  3. Simple centroid + best-sample scoringMath.max(bestSampleScore, centroidScore)

This works but the thresholds and scoring strategy were hand-tuned. There is no benchmark to measure how well speaker ID actually performs, and no systematic way to improve it.

Approach: Autoresearch loop (Evo-style)

Use evo or its approach — an autoresearch loop that discovers a benchmark, runs baseline, then spawns parallel agents to beat it:

  • Tree search over greedy hill-climb — multiple forks from any committed improvement
  • N parallel agents in git worktrees — each tries a different hypothesis
  • Shared failure traces — agents don't repeat each other's mistakes
  • Regression gates — changes that break existing correct matches get discarded

What to measure

Build a benchmark dataset of meetings with ground-truth speaker labels. Metrics:

  • Accuracy — % of speakers correctly identified against known profiles
  • Precision/Recall — false matches vs missed matches
  • Confidence calibration — does a 0.85 confidence actually mean 85% correct?
  • Threshold sensitivity — how much do results change with threshold tweaks?

What to explore

The autoresearch loop should explore improvements across the full stack:

Scoring strategy (store.ts)

  • Weighted combination of centroid + sample scores instead of simple Math.max
  • Top-K sample averaging instead of single best sample
  • Score normalization across profiles (relative ranking vs absolute threshold)
  • Adaptive thresholds based on profile quality (sample count, embedding variance)

Embedding quality (Swift layer)

  • Segment selection strategy — which diarization segments to embed (currently all)
  • Minimum segment duration filtering
  • Embedding aggregation — mean vs weighted mean vs attention-pooled centroids
  • Per-sample quality scoring (reject noisy/short segments)

Profile management

  • Automatic outlier detection in stored samples
  • Profile convergence metrics — when does a profile have "enough" samples?
  • Cross-meeting consistency checks

Matching logic

  • Two-stage matching: fast centroid screen → detailed sample comparison
  • Speaker verification (1:1) vs identification (1:N) distinction
  • Temporal priors — if speaker A was in the last 3 meetings, they're likely in this one

Current architecture reference

Diarization → Segments with speaker IDs (Swift/CoreML sortformer)
     ↓
Embedding extraction → Per-speaker embedding vectors (Swift speech_bridge)
     ↓
Profile matching → cosine similarity against stored profiles (TS store.ts)
     ↓
Suggestion → recommendSpeakerProfile() returns best match above threshold

Key files:

  • src/store.tscosineSimilarity(), scoreSpeakerProfile(), recommendSpeakerProfile(), normalizedEmbeddingCentroid()
  • src-tauri/swift-permissions/src/speech_bridge.swift — embedding extraction, diarization segment processing, centroid computation
  • src-tauri/src/lib.rsStoredSpeakerProfile, analyze_speaker_embeddings command
  • src-tauri/src/asr.rsSpeakerEmbeddingPayload, FileSpeakerEmbeddingPayload

Implementation plan

  1. Build benchmark harness — collect ground-truth labeled meetings, define metrics, run baseline
  2. Set up evo — point it at the speaker ID codebase, configure benchmark as the optimization target
  3. Run optimization loop — let parallel agents explore scoring, thresholds, embedding strategies
  4. Gate on regression — any change must not regress existing correct matches
  5. Ship the winner — commit the best-performing configuration

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions