You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Follow-up to #1. Tracks building the dataset the autoresearch loop optimizes against.
Pivoted plan (Apr 17): dropped the synthetic + AMI + user opt-in tiered approach. Just label 20 of our own Char meetings. The labeler lands in #2.
Why the pivot
The tiered dataset plan was academically correct and practically wrong for where Char is today. VoxCeleb optimizes against celebrity interviews. LibriSpeech optimizes against audiobook narrators. AMI gets closer to meetings but still isn't Char audio — different mics, different languages, different acoustic environments.
The users whose experience we want to improve are us. We already record every meeting through Char. We already know who was in the room. Labeling ground truth is a checkbox exercise, not a data-collection project.
Land the Swift embedding extractor CLI so extract_embeddings.py can run end-to-end
Commit baseline.json with current main-branch numbers on those 20 meetings
Point Evo at it — start the autoresearch loop against the real benchmark
What we still need
Raw meeting export (blocking the labeler)
If speakers are already labeled in the app, the markdown export has human names instead of Speaker 1/Speaker 2. The matcher's guesses leak into ground truth and contaminate the benchmark.
Options:
Add uchar meetings export <id> --raw that dumps pre-labeling diarization JSON to stdout
Or only label fresh (never-touched) meetings
The CLI command is the right long-term answer — it also unlocks other tooling.
Swift embedding extractor CLI
benchmarks/speaker-id/score.py expects embeddings.json per meeting, shaped like:
These come from the same Swift pipeline unsigned Char uses in-app (speech_embed_speaker_audio_file). Two approaches:
Reuse in-app code path — expose uchar meetings embed <id> that runs the existing Swift extractor against a meeting's audio + diarization and dumps the JSON
Standalone binary — a tiny Swift target that links the same embedding code, runs on an audio file + segment list, prints JSON
First option is cleaner — one code path, one model, no drift.
Anti-goals
No synthetic corpora. Dropped for good.
No scraping public podcasts. Licensing + doesn't match Char conditions.
No user-contributed labeling flywheel yet. That's a post-launch product question, not a benchmark question.
Success looks like
A single command against our own labeled meetings that prints:
Follow-up to #1. Tracks building the dataset the autoresearch loop optimizes against.
Pivoted plan (Apr 17): dropped the synthetic + AMI + user opt-in tiered approach. Just label 20 of our own Char meetings. The labeler lands in #2.
Why the pivot
The tiered dataset plan was academically correct and practically wrong for where Char is today. VoxCeleb optimizes against celebrity interviews. LibriSpeech optimizes against audiobook narrators. AMI gets closer to meetings but still isn't Char audio — different mics, different languages, different acoustic environments.
The users whose experience we want to improve are us. We already record every meeting through Char. We already know who was in the room. Labeling ground truth is a checkbox exercise, not a data-collection project.
The actual plan
benchmarks/speaker-id/label.pyfrom Dogfood benchmark harness for speaker ID #2extract_embeddings.pycan run end-to-endbaseline.jsonwith current main-branch numbers on those 20 meetingsWhat we still need
Raw meeting export (blocking the labeler)
If speakers are already labeled in the app, the markdown export has human names instead of
Speaker 1/Speaker 2. The matcher's guesses leak into ground truth and contaminate the benchmark.Options:
uchar meetings export <id> --rawthat dumps pre-labeling diarization JSON to stdoutThe CLI command is the right long-term answer — it also unlocks other tooling.
Swift embedding extractor CLI
benchmarks/speaker-id/score.pyexpectsembeddings.jsonper meeting, shaped like:These come from the same Swift pipeline unsigned Char uses in-app (
speech_embed_speaker_audio_file). Two approaches:uchar meetings embed <id>that runs the existing Swift extractor against a meeting's audio + diarization and dumps the JSONFirst option is cleaner — one code path, one model, no drift.
Anti-goals
Success looks like
A single command against our own labeled meetings that prints:
Any PR that regresses any number fails CI. That's the bar.