## Introduction
This notebook contains a record of my research and experimentation with [TorToiSe TTS](https://github.com/neonbjb/tortoise-tts), a Text To Speech Gen AI tool created by James Betker.

Tortoise-TTS borrows two approaches for Gen AI more commonly used in text and image based Gen AI models. It uses auto-regression (auto: "self", regressus: "to go back") to create the next output token based on previous outputs right-shifted in time. The output speech sample is denoised using Denoising Diffusion Probabalistic Model (DDPM). These are two approaches more commonly associated with LLMs and Image generation.

As it's name suggests, Tortoise was initially quite slow. High quality audio output would take tens of seconds to create a single word on a desktop graphics card. However, subsequent improvemets in caching of keys and values in the decoder, adoption of Microsofts DeepSpeed Pytorch optimisation library and precision options collectively improved performance by 5-6x.

## Setup
Toroise was installed in a virtual Python `v3.10` environment on a Ubuntu `22.04.3` LTS (GNU/Linux 5.15.0-91-generic x86_64) with `96GB` DDR5 ram, Intel Core `i5-13600K`, Samsung `NVME` storage and a `RTX 4000 NVIDIA` graphics card. Setting up of the environment is omitted from this notebook.

## Installation of Tortoise-TTS
Install `tortoise-tts` using `pip`.

In [1]:
pip install git+https://github.com/neonbjb/tortoise-tts

Collecting git+https://github.com/neonbjb/tortoise-tts
  Cloning https://github.com/neonbjb/tortoise-tts to /tmp/pip-req-build-464da0zd
  Running command git clone --filter=blob:none --quiet https://github.com/neonbjb/tortoise-tts /tmp/pip-req-build-464da0zd
  Resolved https://github.com/neonbjb/tortoise-tts to commit 1a3b014d5c5b14feaa416145004411a7ee2a3970
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting einops
  Using cached einops-0.7.0-py3-none-any.whl (44 kB)
Collecting inflect
  Using cached inflect-7.2.0-py3-none-any.whl (34 kB)
Collecting librosa
  Using cached librosa-0.10.1-py3-none-any.whl (253 kB)
Collecting progressbar
  Using cached progressbar-2.5.tar.gz (10 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting rotary_embedding_torch
  Using cached rotary_embedding_torch-0.5.3-py3-none-any.whl (5.3 kB)
Collecting scipy
  Using cached scipy-1.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.6 MB)
Collecting tokenizers
  Usi

Confirm the installation was successful.

In [3]:
!tortoise_tts.py --help

Traceback (most recent call last):
  File "/home/djh/Development/python/tortoise-tts/venv/bin/tortoise_tts.py", line 10, in <module>
    import torchaudio
ModuleNotFoundError: No module named 'torchaudio'


The `torchaudio` package appears to be missing from `requirements.txt`. Let's install it.

In [4]:
pip install torchaudio

Collecting torchaudio
  Using cached torchaudio-2.2.2-cp310-cp310-manylinux1_x86_64.whl (3.3 MB)
Installing collected packages: torchaudio
Successfully installed torchaudio-2.2.2
Note: you may need to restart the kernel to use updated packages.


In [5]:
!tortoise_tts.py --help

usage: tortoise_tts.py [-h] [-v, --voice VOICE] [-V, --voices-dir VOICES_DIR]
                       [-p, --preset {ultra_fast,fast,standard,high_quality}]
                       [-q, --quiet]
                       (-l, --list-voices | -P, --play | -o, --output OUTPUT | -O, --output-dir OUTPUT_DIR)
                       [--candidates CANDIDATES] [--regenerate REGENERATE]
                       [--skip-existing] [--produce-debug-state] [--seed SEED]
                       [--models-dir MODELS_DIR] [--text-split TEXT_SPLIT]
                       [--disable-redaction] [--device DEVICE]
                       [--batch-size BATCH_SIZE]
                       [--num-autoregressive-samples NUM_AUTOREGRESSIVE_SAMPLES]
                       [--temperature TEMPERATURE]
                       [--length-penalty LENGTH_PENALTY]
                       [--repetition-penalty REPETITION_PENALTY]
                       [--top-p TOP_P] [--max-mel-tokens MAX_MEL_TOKENS]
                       [--cvvp-

Tortoise-tts comes with approximately a dozen built-in voices. Let's see what voices we can choose from.

In [6]:
!tortoise_tts.py --list-voices

usage: tortoise_tts.py [-h] [-v, --voice VOICE] [-V, --voices-dir VOICES_DIR]
                       [-p, --preset {ultra_fast,fast,standard,high_quality}]
                       [-q, --quiet]
                       (-l, --list-voices | -P, --play | -o, --output OUTPUT | -O, --output-dir OUTPUT_DIR)
                       [--candidates CANDIDATES] [--regenerate REGENERATE]
                       [--skip-existing] [--produce-debug-state] [--seed SEED]
                       [--models-dir MODELS_DIR] [--text-split TEXT_SPLIT]
                       [--disable-redaction] [--device DEVICE]
                       [--batch-size BATCH_SIZE]
                       [--num-autoregressive-samples NUM_AUTOREGRESSIVE_SAMPLES]
                       [--temperature TEMPERATURE]
                       [--length-penalty LENGTH_PENALTY]
                       [--repetition-penalty REPETITION_PENALTY]
                       [--top-p TOP_P] [--max-mel-tokens MAX_MEL_TOKENS]
                       [--cvvp-

That doesn't seem to be a valid option. Let's try using the `-v` option instead.

In [7]:
!tortoise_tts.py -l

angie
applejack
cond_latent_example
daniel
deniro
emma
freeman
geralt
halle
jlaw
lj
mol
myself
pat
pat2
rainbow
snakes
tim_reynolds
tom
train_atkins
train_daws
train_dotrice
train_dreams
train_empire
train_grace
train_kennard
train_lescault
train_mouse
weaver
william


That worked as expexted. Let's now create a test speech sample using the prerecorded voice `tom` with a quality preset of `ultra_fast`. This should be a baseline for the fastest, lowest fidelity output without using some of the subsequently added caching and library optimisation additions.

In [8]:
!time tortoise_tts.py --seed 42 -p ultra_fast -v tom -o tom-ultra-fast-text01.wav "Testing, testing, one two, one two three."

Loading tts...
Some weights of the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_v', 'wav2vec2.encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametriz

The recording sound quality isn't great, but it is the `ultra_fast` preset option. Let's see how good the `high_quality` preset is.

In [9]:
!time tortoise_tts.py --seed 42 -p high_quality -v tom -o tom-high-quality-text01.wav "Testing, testing, one two, one two three."

Loading tts...
Some weights of the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametriz

The quality has improved but at the cost of 6x the GPU processing compared with using the `ultra_fast` preset. It requires approximately 30 seconds of GPU time per word to create a high fidelity output. As the author mentions, it's called 🐢 tortoise for a good reason. Let's see if we can improve performance without losing too much quality.

Let's first listen to the audio quality of the `standard` preset.

In [22]:
!time tortoise_tts.py --seed 42 -p standard -v tom -o tom-standard-text01.wav "Testing, testing, one two, one two three."

Loading tts...
Some weights of the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametriz

I cannot hear a significant difference in the quality of the `standard` vs `high_quality` presents. The GPU time required by each is similar for this short test. Let's explore difference options starting with caching.

## Improving Efficiency While Maintaining Performance


### Key Value Caching
Tortoise-tts includes support for `kv_caching`. Instead of recomputing the key matrix (all the previous context the model decoder should pay attention to) and value matrix (weighted sum over previous context), we can cache results. The k and v matrices don't change very much as we output new tokens. Refer to this [video](https://youtu.be/80bIUggRJf4?si=ceCOeyQlhFY9AiCD) for a helpful explanation.

Unfortunately, the command line interface doesn't support a kv_cache input option so let's update `tortoise_tts.py` to enable it. Change line 209 of `venv/bin/tortoise_tts.py` to:

```python
tts = TextToSpeech(kv_cache=True, models_dir=args.models_dir, enable_redaction=not args.disable_redaction,
				   device=args.device, autoregressive_batch_size=args.batch_size)
```

and rerun the earlier test using the `high_quality` preset.

In [1]:
!time tortoise_tts.py --seed 42 -p high_quality -v tom -o tom-high-quality-kvcache-text01.wav "Testing, testing, one two, one two three."

Loading tts...
Some weights of the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_v', 'wav2vec2.encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.encoder.pos_conv_embed.conv.parametriz

Enabling `kv_cache` resulted in a significant improvement in performance. We've reduced the time taken by a factor of four but the audio fidelity sounds different. This is unexpected. I used the same `seed` value across tests so if `kv_cache` is indeed avoiding unnecessary computations of k and v values, I wouldn't expect a difference in the output wav file generated. It's difficult to objectively say that the new audio is worse without scientifically comparing the two segments, but qualitatively the latest recording doesn't sound as good.

It looks like this is a known issue with the origial source code. It's been addressed in this derivative repo [tortoise-tts-fast](https://github.com/152334H/tortoise-tts-fast). Explore this later as the derivative code base uses a new sampler `dpm++2m`.

### DeepSpeed Optimisation Library
Tortoise-tts also supports the `DeepSpeed` library from Microsoft. One of the four innovation pillars of its development focusses on the inference of LLMs. Once again we'll need to enable its use by modifying `tortoise_tts.py`. Change line 209 of venv/bin/tortoise_tts.py to:

```python
tts = TextToSpeech(use_deepspeed=True, models_dir=args.models_dir, enable_redaction=not args.disable_redaction,
                   device=args.device, autoregressive_batch_size=args.batch_size)
```                   
and rerun the earlier test using the high_quality preset.

> [!NOTE]  
> I've removed the earlier setting of the argument `kv_cache` so we are only changing one variable at a time.

In [2]:
!time tortoise_tts.py --seed 42 -p high_quality -v tom -o tom-high-quality-deepspeed-text01.wav "Testing, testing, one two, one two three."

Loading tts...
Some weights of the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametriz

The `deepspeed` module is missing so we'll need to install it and try again.

In [10]:
pip install deepspeed

Collecting deepspeed
  Using cached deepspeed-0.14.1-py3-none-any.whl
Collecting pynvml
  Using cached pynvml-11.5.0-py3-none-any.whl (53 kB)
Collecting py-cpuinfo
  Using cached py_cpuinfo-9.0.0-py3-none-any.whl (22 kB)
Collecting hjson
  Using cached hjson-3.1.0-py3-none-any.whl (54 kB)
Collecting ninja
  Using cached ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
Collecting pydantic
  Using cached pydantic-2.7.0-py3-none-any.whl (407 kB)
Collecting annotated-types>=0.4.0
  Using cached annotated_types-0.6.0-py3-none-any.whl (12 kB)
Collecting pydantic-core==2.18.1
  Using cached pydantic_core-2.18.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
Installing collected packages: py-cpuinfo, ninja, hjson, pynvml, pydantic-core, annotated-types, pydantic, deepspeed
Successfully installed annotated-types-0.6.0 deepspeed-0.14.1 hjson-3.1.0 ninja-1.11.1.1 py-cpuinfo-9.0.0 pydantic-2.7.0 pydantic-core-2.18.1 pynvml-11.5.0
Note: you may ne

In [8]:
!time tortoise_tts.py --seed 42 -p high_quality -v tom -o tom-high-quality-deepspeed-text01.wav "Testing, testing, one two, one two three."

Loading tts...
Some weights of the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.encoder.pos_conv_embed.conv.parametriz

If this fails and complains of missing header files you may need to install the development python package for the version of python you are using in your vitual environment.

```bash
sudo apt-get install python3.10-dev
```

The deepspeed module reduced the GPU time by 50% with no audible reduction in the fidelity of the output.

The last performance option we'll explore is using half-precision 16 bit floating point numbers to store model weights/intermediate activations in the neural network. This will allow us to make more efficient use of the available GPU memory. Again we need to modify the code:

```python
tts = TextToSpeech(half=True, models_dir=args.models_dir, enable_redaction=not args.disable_redaction,
                   device=args.device, autoregressive_batch_size=args.batch_size)
```

In [10]:
!time tortoise_tts.py --seed 42 -p high_quality -v tom -o tom-high-quality-halfp-text01.wav "Testing, testing, one two, one two three."

Loading tts...
Some weights of the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.encoder.pos_conv_embed.conv.parametriz

Similar to using the deepspeed module, using half precision resulted in a 50% reduction in GPU time. While there's no obvious reduction in audio fidelity, I would expect a quantitative comparison to show there is a difference.

### Combining Caching, DeepSpeed and Reduced Precision

We can enable all three options at once, thereby hopefully resulting in a significant improvement in performance. Again we modify the code:

```python
tts = TextToSpeech(kv_cache=True, use_deepspeed=True, half=True, models_dir=args.models_dir, enable_redaction=not args.disable_redaction,
                   device=args.device, autoregressive_batch_size=args.batch_size)
```

and rerun the test.

In [12]:
!time tortoise_tts.py --seed 42 -p high_quality -v tom -o tom-high-quality-halfp-deep-cache-text01.wav "Testing, testing, one two, one two three."

Loading tts...
Some weights of the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.encoder.pos_conv_embed.conv.parametriz

We've reduced GPU time by approximately 85% without any notable reduction in the audible fidelity.

I've included each of the synthetically generated text-to-speech audio files under the folder `./output-audio-samples` should you wish to listen and compare them.

Let's now revist the earlier fork of the original repo that also reported that `kv_cache` changed the output recording. 

- https://github.com/neonbjb/tortoise-tts ↩
    - https://github.com/152334H/tortoise-tts-fast/forks ↩
        - https://github.com/manmay-nakhashi/tortoise-tts-fastest
|
Active development seems to have stopped, well at least that's what appears to be the case based on when the forks were last updated. The developer has choosen to develop on Python v3.8. While the code suggests that Python v3.10 should be a valid version, I found compatibility issues. I suspect this could result in non insignificant side project to get it running on v3.10 or later. Let's halt progress along this path and get back to the original repo `tortoise-tts`.


## Creating a Custom Voice

Tortoise-tts supports custom voices. The process for creating a new voice appears to be quite simple, so let's do so. 

First we capture five to ten examples of the voice we wish to copy. Each recording should be no more than ten seconds long. I asked Google's Gemini for ten training texts. It provided the following suggestions:

1. **Historical Fact:** The Great Wall of China is the longest man-made structure in the world, stretching over 21,000 kilometers. (9 seconds)
2. **Instructional:** When crossing the street, look both ways before stepping off the curb. (8 seconds)
3. **Technical Description:** A laptop computer is a portable personal computer designed for mobile use. (9 seconds)
4. **News Report:** Local libraries will be offering a series of free workshops on creative writing throughout the summer. (10 seconds)
5. **Restaurant Order:** I would like the chicken stir-fry with brown rice and a side salad, please. (8 seconds)
6. **Travel Description:** The bustling marketplace was filled with vendors selling colorful fabrics and handcrafted souvenirs. (10 seconds)
7. **Animal Fact:** The average lifespan of a domestic cat is around 15 years. (8 seconds)
8. **Book Description:** This science fiction novel explores the concept of faster-than-light travel. (9 seconds)
9. **Daily Routine:** I usually wake up at 7 am, exercise for 30 minutes, and then head to work. (9 seconds)
10. **Movie Summary:** The heartwarming film follows a group of misfits who come together to achieve a common goal. (9 seconds)

While I recorded each of the ten examples using a good quality lavalier microphone in a quiet environment, the first attempt to synthesise my voice produced poor results. Small amounts of background noise and hiss were amplified and created strange ethereal artefacts in the recording. 

Converting the input voice samples from stereo to mono and using Audacity's noise removal filter (Effect > Noise Remover and Reduction) significantly improved the quality of the resulting custom voice. I feel there's still plenty of scope to improve this further but for now let's work what we have.

Place a copy of the training samples in a training folder `./input-audio-samples/djh`

Then reference this folder using a combination of the `-V` and `-v` options.

I experimented with the seed value to find an custom voice that was most similar to my voice.

In [3]:
!time tortoise_tts.py --seed 88 -p high_quality -V ./input-audio-samples -v djh -o djh-high-quality-fullp-deep-cache-text01.wav "Testing, testing, one two, one two three."

Loading tts...
Some weights of the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_v', 'wav2vec2.encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.encoder.pos_conv_embed.conv.parametriz

## Conclusion
That concludes this short adventure into one of the seminal works and open source projects on Generative AI Test To Speech (TTS). The project is now over two years old but is still able to create realistic, high fidelity audio using only a small amount of training data. I could be a useful tool for TTS provided you invest the time to create high quality training data with a tone and cadence similar to the audio you want to create. It can also blend together multiple voices, which may allow us to inject mood into the spoken text.

Text to Speech like so many other areas of Generative AI provide a rabbit's warren of options to explore. Tortoise-tts is one notable but aging project that still producing good results. In a future research session I may look into more recent open and closed source projects including Tacotron 2, coqui-TTS, Google's TTS and Amazon's BASE TTS.