Skip to content

Build the speaker ID benchmark dataset (dogfood 20 meetings) #3

@ComputelessComputer

Description

@ComputelessComputer

Follow-up to #1. Tracks building the dataset the autoresearch loop optimizes against.

Pivoted plan (Apr 17): dropped the synthetic + AMI + user opt-in tiered approach. Just label 20 of our own Char meetings. The labeler lands in #2.

Why the pivot

The tiered dataset plan was academically correct and practically wrong for where Char is today. VoxCeleb optimizes against celebrity interviews. LibriSpeech optimizes against audiobook narrators. AMI gets closer to meetings but still isn't Char audio — different mics, different languages, different acoustic environments.

The users whose experience we want to improve are us. We already record every meeting through Char. We already know who was in the room. Labeling ground truth is a checkbox exercise, not a data-collection project.

The actual plan

  1. Label 20 meetings this week using benchmarks/speaker-id/label.py from Dogfood benchmark harness for speaker ID #2
  2. Land the Swift embedding extractor CLI so extract_embeddings.py can run end-to-end
  3. Commit baseline.json with current main-branch numbers on those 20 meetings
  4. Point Evo at it — start the autoresearch loop against the real benchmark

What we still need

Raw meeting export (blocking the labeler)

If speakers are already labeled in the app, the markdown export has human names instead of Speaker 1/Speaker 2. The matcher's guesses leak into ground truth and contaminate the benchmark.

Options:

  • Add uchar meetings export <id> --raw that dumps pre-labeling diarization JSON to stdout
  • Or only label fresh (never-touched) meetings

The CLI command is the right long-term answer — it also unlocks other tooling.

Swift embedding extractor CLI

benchmarks/speaker-id/score.py expects embeddings.json per meeting, shaped like:

{
  "speakers": {
    "Speaker 1": [[...embedding vector...], ...],
    "Speaker 2": [[...], ...]
  }
}

These come from the same Swift pipeline unsigned Char uses in-app (speech_embed_speaker_audio_file). Two approaches:

  • Reuse in-app code path — expose uchar meetings embed <id> that runs the existing Swift extractor against a meeting's audio + diarization and dumps the JSON
  • Standalone binary — a tiny Swift target that links the same embedding code, runs on an audio file + segment list, prints JSON

First option is cleaner — one code path, one model, no drift.

Anti-goals

  • No synthetic corpora. Dropped for good.
  • No scraping public podcasts. Licensing + doesn't match Char conditions.
  • No user-contributed labeling flywheel yet. That's a post-launch product question, not a benchmark question.

Success looks like

A single command against our own labeled meetings that prints:

metric              value
n_enrolled_turns      312
n_stranger_turns      197
accuracy             0.XX
unknown_rejection    0.XX
false_accept         0.XX
calibration_error    0.XX

Any PR that regresses any number fails CI. That's the bar.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions