# WhisperSpeech

An Open Source text-to-speech system built by inverting Whisper. Previously known as **spear-tts-pytorch**.

The general architecture is similar to [AudioLM](https://google-research.github.io/seanet/audiolm/examples/), [SPEAR TTS](https://google-research.github.io/seanet/speartts/examples/) from Google and [MusicGen](https://ai.honu.io/papers/musicgen/) from Meta but we avoided the NIH syndrome and built it on top of powerful Open Source models: [Whisper](https://github.com/openai/whisper) from OpenAI to generate semantic tokens and perform transcription, [EnCodec](https://github.com/facebookresearch/encodec) from Meta for acoustic modeling and [Vocos](https://github.com/charactr-platform/vocos) from Charactr Inc as the high-quality vocoder.

Currently the models are trained on the English LibreLight and LibreTTS datasets. Ultimately we want to target multiple languages (Whisper and EnCodec are both multilanguage).

## Progress updates

**UPDATE 2023-07-14**: We have trained a new pair of models, added support for multiple speakers and integrated the
    Vocos vocoder to deliver a big overall quality boost. And this is not even our last word because we are doing
    hyperparameter tuning to train bigger, higher-quality models.

## Roadmap

- [x] [Extract acoustic tokens](https://github.com/collabora/spear-tts-pytorch/issues/2)
- [x] [Extract Whisper embeddings and quantize them to semantic tokens](https://github.com/collabora/spear-tts-pytorch/issues/3)
- [x] [Semantic token to acoustic token (S->A) model](https://github.com/collabora/spear-tts-pytorch/issues/4)
- [x] [Text token to semantic token (T->S) model](https://github.com/collabora/spear-tts-pytorch/issues/9)
- [x] [Improve the EnCodec speech quality](https://github.com/collabora/spear-tts-pytorch/issues/10)
- [ ] [Gather a bigger emotive speech dataset](https://github.com/collabora/spear-tts-pytorch/issues/11)
- [ ] [Train final high-quality models](https://github.com/collabora/spear-tts-pytorch/issues/12)

## Architecture

### Whisper for modeling semantic tokens

We utilize the OpenAI Whisper encoder block to generate embeddings which we then quantize with a small 2-layer model to get semantic tokens.

If the language is already supported by Whisper then this process requires only audio files (without ground truth transcriptions).

![Using Whisper for semantic token extraction diagram](whisper-block.png)

## EnCodec for modeling acoustic tokens

We use EnCodec to model the audio waveform. Out of the box it delivers reasonable quality at 1.5kbps and we can bring this to high-quality by using Vocos – a vocoder pretrained on EnCodec tokens.

![EnCodec block diagram](https://github.com/facebookresearch/encodec/raw/main/architecture.png)

## Appreciation

[<img height=80 src="https://user-images.githubusercontent.com/107984/229537027-a6d7462b-0c9c-4fd4-b69e-58e98c3ee63f.png" alt="Collabora logo">](https://www.collabora.com)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[<img height=80 src="https://user-images.githubusercontent.com/107984/229535036-c741d775-4a9b-4193-89a0-9ddb89ecd011.png" alt="LAION logo">](https://laion.ai)

This work would not be possible without the generous sponsorships from:

- [Collabora](https://www.collabora.com) – code development and model training
- [LAION](https://laion.ai) – community building and datasets

We are available to help you with both Open Source and proprietary AI projects. You can reach us via the Collabora website or on Discord ([![](https://dcbadge.vercel.app/api/shield/270267134960074762?style=flat)](https://discordapp.com/users/270267134960074762) and [![](https://dcbadge.vercel.app/api/shield/1088938086400016475?style=flat)](https://discordapp.com/users/1088938086400016475))

## Citations

We rely on many amazing Open Source projects and research papers:

```bibtex
@article{SpearTTS,
  title = {Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision},
  url = {https://arxiv.org/abs/2302.03540},
  author = {Kharitonov, Eugene and Vincent, Damien and Borsos, Zalán and Marinier, Raphaël and Girgin, Sertan and Pietquin, Olivier and Sharifi, Matt and Tagliasacchi, Marco and Zeghidour, Neil},
  publisher = {arXiv},
  year = {2023},
}
```

```bibtex
@article{MusicGen,
  title={Simple and Controllable Music Generation}, 
  url = {https://arxiv.org/abs/2306.05284},
  author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
  publisher={arXiv},
  year={2023},
}

```bibtex
@article{Whisper
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  publisher = {arXiv},
  year = {2022},
}
```

```bibtex
@article{EnCodec
  title = {High Fidelity Neural Audio Compression},
  url = {https://arxiv.org/abs/2210.13438},
  author = {Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
  publisher = {arXiv},
  year = {2022},
}
```

```bibtex
@article{Vocos
      title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis}, 
      url = {https://arxiv.org/abs/2306.00814},
      author={Hubert Siuzdak},
      publisher={arXiv},
      year={2023},
}
```