Build the speaker ID benchmark dataset (dogfood 20 meetings)

Follow-up to #1. Tracks building the dataset the autoresearch loop optimizes against.

**Pivoted plan (Apr 17):** dropped the synthetic + AMI + user opt-in tiered approach. Just label 20 of our own Char meetings. The labeler lands in #2.

## Why the pivot

The tiered dataset plan was academically correct and practically wrong for where Char is today. VoxCeleb optimizes against celebrity interviews. LibriSpeech optimizes against audiobook narrators. AMI gets closer to meetings but still isn't Char audio — different mics, different languages, different acoustic environments.

The users whose experience we want to improve are us. We already record every meeting through Char. We already know who was in the room. Labeling ground truth is a checkbox exercise, not a data-collection project.

## The actual plan

1. **Label 20 meetings this week** using `benchmarks/speaker-id/label.py` from #2
2. **Land the Swift embedding extractor CLI** so `extract_embeddings.py` can run end-to-end
3. **Commit `baseline.json`** with current main-branch numbers on those 20 meetings
4. **Point Evo at it** — start the autoresearch loop against the real benchmark

## What we still need

### Raw meeting export (blocking the labeler)

If speakers are already labeled in the app, the markdown export has human names instead of `Speaker 1`/`Speaker 2`. The matcher's guesses leak into ground truth and contaminate the benchmark.

Options:
- Add `uchar meetings export <id> --raw` that dumps pre-labeling diarization JSON to stdout
- Or only label fresh (never-touched) meetings

The CLI command is the right long-term answer — it also unlocks other tooling.

### Swift embedding extractor CLI

`benchmarks/speaker-id/score.py` expects `embeddings.json` per meeting, shaped like:

```
{
  "speakers": {
    "Speaker 1": [[...embedding vector...], ...],
    "Speaker 2": [[...], ...]
  }
}
```

These come from the same Swift pipeline unsigned Char uses in-app (`speech_embed_speaker_audio_file`). Two approaches:

- **Reuse in-app code path** — expose `uchar meetings embed <id>` that runs the existing Swift extractor against a meeting's audio + diarization and dumps the JSON
- **Standalone binary** — a tiny Swift target that links the same embedding code, runs on an audio file + segment list, prints JSON

First option is cleaner — one code path, one model, no drift.

## Anti-goals

- No synthetic corpora. Dropped for good.
- No scraping public podcasts. Licensing + doesn't match Char conditions.
- No user-contributed labeling flywheel *yet*. That's a post-launch product question, not a benchmark question.

## Success looks like

A single command against our own labeled meetings that prints:

```
metric              value
n_enrolled_turns      312
n_stranger_turns      197
accuracy             0.XX
unknown_rejection    0.XX
false_accept         0.XX
calibration_error    0.XX
```

Any PR that regresses any number fails CI. That's the bar.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build the speaker ID benchmark dataset (dogfood 20 meetings) #3

Why the pivot

The actual plan

What we still need

Raw meeting export (blocking the labeler)

Swift embedding extractor CLI

Anti-goals

Success looks like

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Build the speaker ID benchmark dataset (dogfood 20 meetings) #3

Description

Why the pivot

The actual plan

What we still need

Raw meeting export (blocking the labeler)

Swift embedding extractor CLI

Anti-goals

Success looks like

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions