Skip to content

ec531/tinyDiarize

 
 

Repository files navigation

tinyDiarize 🐥🗣️

This is a minimal extension of OpenAI's Whisper models that adds speaker diarization to transcripts via special <|speakerturn|> tokens. It can be used as a drop-in replacement for whisper.transcribe with the same API, and no extra dependencies.

demo

Quickstart

Simply run the original setup and use the small.en-tdrz model instead of small.en. That's it! 🎉

pip install -e .
whisper AUDIO --model small.en-tdrz SAME_CLI_ARGS

(the code will auto-download the finetuned checkpoint, see whisper.__init__ for info)

You can try it out on videos from YouTube using this notebook Open In Colab

demo_video-trim.mp4

Why do this?

  • Speaker diarization is the task of identifying who spoke when in an audio recording. Along with spoken content, it is a key part of creating who-spoke-what transcripts, such as those for podcasts.
  • tinyDiarize aims to be a minimal, interpretable extension of original Whisper models (inspired by minGPT) that keeps extra dependencies to a minimum.
  • By extending models with special <|speakerturn|> tokens [citations] a key part of the task can be solved cleanly, effectively, and at no extra cost. Stay tuned for details in an upcoming blog post! 📺
  • The simplicity (same structure checkpoint, few line edits of inference code) has the added benefit of ease of integration into existing ports like whisper.cpp that runs on MacBooks and iPhones.
  • By also releasing reproducible finetuning, we hope to enable others (or even OpenAI themselves!) to improve performance and extend support (multilingual, speech translation etc.)

More info

  • Whisper small.en checkpoints were finetuned using HuggingFace Transformers and Datasets. This could be done relatively cheaply with just 30mins of 1 GPU training :)
  • Code will be released shortly for full reproducibility.
  • An important contribution of this repo is also a scoring/analysis setup using revdotcom/fstalign that allows for interpretable error inspection and side-by-side analysis.
  • A blog post and accompanying Jupyter notebooks will be released soon with more details.

metrics

Gotchas

Note that this is still WIP and there are a few things to be aware of:

  • This was done only for the small.en English model mainly to demonstrate feasibility.
  • Initial tests show it's possible to have minimal impact on the original accuracy (WER) of models. Just putting this in gotchas here until a more thorough analysis is done.
  • Only local diarization is handled so far (speaker turns). A (TBD) clustering step will be needed to group speaker turns into speaker A/B/C etc.
  • Stuff is still quite hacky and subject to change, so bear with us until things are released! 🙏

References

[1] Joint Speech Recognition and Speaker Diarization via Sequence Transduction [2] Serialized Output Training for End-to-End Overlapped Speech Recognition [3] Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection

For information on the underlying Whisper model, please refer to the original documentation (release: 20230308)

License

Code and model weights are released under the MIT License. See LICENSE for further details.

About

Minimal extension of OpenAI's Whisper adding speaker diarization with special tokens

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 98.9%
  • Shell 1.1%