This is a minimal extension of OpenAI's Whisper models that adds speaker diarization to transcripts via special <|speakerturn|> tokens. It can be used as a drop-in replacement for whisper.transcribe with the same API, and no extra dependencies.
Simply run the original setup and use the small.en-tdrz model instead of small.en. That's it! 🎉
pip install -e .
whisper AUDIO --model small.en-tdrz SAME_CLI_ARGS
(the code will auto-download the finetuned checkpoint, see whisper.__init__ for info)
You can try it out on videos from YouTube using this notebook
demo_video-trim.mp4
- Speaker diarization is the task of identifying who spoke when in an audio recording. Along with spoken content, it is a key part of creating who-spoke-what transcripts, such as those for podcasts.
- tinyDiarize aims to be a minimal, interpretable extension of original Whisper models (inspired by minGPT) that keeps extra dependencies to a minimum.
- By extending models with special
<|speakerturn|>tokens [citations] a key part of the task can be solved cleanly, effectively, and at no extra cost. Stay tuned for details in an upcoming blog post! 📺 - The simplicity (same structure checkpoint, few line edits of inference code) has the added benefit of ease of integration into existing ports like whisper.cpp that runs on MacBooks and iPhones.
- By also releasing reproducible finetuning, we hope to enable others (or even OpenAI themselves!) to improve performance and extend support (multilingual, speech translation etc.)
- Whisper
small.encheckpoints were finetuned using HuggingFace Transformers and Datasets. This could be done relatively cheaply with just 30mins of 1 GPU training :) - Code will be released shortly for full reproducibility.
- An important contribution of this repo is also a scoring/analysis setup using revdotcom/fstalign that allows for interpretable error inspection and side-by-side analysis.
- A blog post and accompanying Jupyter notebooks will be released soon with more details.
Note that this is still WIP and there are a few things to be aware of:
- This was done only for the
small.enEnglish model mainly to demonstrate feasibility. - Initial tests show it's possible to have minimal impact on the original accuracy (WER) of models. Just putting this in gotchas here until a more thorough analysis is done.
- Only local diarization is handled so far (speaker turns). A (TBD) clustering step will be needed to group speaker turns into speaker A/B/C etc.
- Stuff is still quite hacky and subject to change, so bear with us until things are released! 🙏
[1] Joint Speech Recognition and Speaker Diarization via Sequence Transduction [2] Serialized Output Training for End-to-End Overlapped Speech Recognition [3] Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection
For information on the underlying Whisper model, please refer to the original documentation (release: 20230308)
Code and model weights are released under the MIT License. See LICENSE for further details.

