# Training

For the Training, I have created a container that effectively contains the code from notebooks I've found online:

* YourTTS: 
* VITS for one speaker: https://colab.research.google.com/drive/1wAuG-TcZeAUYhff0f6ZiG-so9KT-sBIE?usp=sharing#scrollTo=psssJnAtRp4-
* For VITS multispeaker I merged two aforementioned notebooks myself.

I have also added some background operations:
* CoquiTTS saves checkpoints (last 4 states of the models) and "best" models according to its metrics. So sometimes there are hours of training that do not produce best models and the checkpoints are deleted. So I added stashing of these intermediate states.
* Inference from best- and stahed models, so we could hear how these models sound.
* Drawing a plot from the training logs.

## Why Docker container?

Comparing with collab notebooks, containers have the following advantages:

* Notebooks require step-by-step execution, which is not what I want: I want a process I can start from Python or command line and forget about it until it's done.
* You can run notebooks at any machine, you only need SSH and Docker for this. Collab notebook you can only run, well, on collab. I believe Docker solution helps avoiding vendor lock.
* It's easier to spread: if I need to deliver docker to a remote machine, I just need docker push and pull operations, no interation with server's GUI is required.
* I don't trust very much the code written in notebooks: for instance, it's impossible to test it properly. Some of the background operations I've written have unit tests which wouldn't be possible with notebooks.
* In general, I recommend using notebooks for what they were designed for - for experiments. I don't want experiments in this area, I want to a ready code that works.

At the local machine, there was also an option to run this program under WSL (running it without VM wasn't an option because of espeak problems under Windows). **Avoid this option**. Despite WSL and docker actually use the same Windows subsystem, WSL works far worse and far less predictable. There were numerous problems with it:
* First time I run it, it failed after 5 hours.
* I "fixed" it with `kaia.infra.rebooter` that monitors the log file and restarts the system because of the inactivity. Effectively, it restarted every 10 minutes. This has proven to be bad for the training process itself: despite I continued training with CoquiTTS means, I had the impression it's kinda reset.
* After several days of experiments, I've found out that actually running training _in a different console than the rebooter's console_ solves the problem. How exactly sharing console prevents WSL from working properly, I don't know.
* After packaging the solution in Docker container, I've never had any problems with this process again.



## Setup

Training container is organized by the same architecture, as the docker-based deciders in BrainBox. There are `CoquiTrainingContainerSettings` and `CoquiTrainingContainerInstaller`. 

First, we need to create container. This will take awhile. Check for progress in the console from which Jupyter Lab was run.

In [1]:
from kaia.ml.voice_cloning.coqui_training_container import CoquiTrainingContainerSettings, CoquiTrainingContainerInstaller, SupportedModels

settings = CoquiTrainingContainerSettings()
installer = CoquiTrainingContainerInstaller(settings)
installer.install()

Then, we need to place a model inside the working folder. The models are downloaded for `CoquiTTS` decider, so this is where we can take them:

In [2]:
import os
from kaia.brainbox.deciders.docker_based import CoquiTTSSettings

coqui_tts_settings = CoquiTTSSettings()
source_folder = coqui_tts_settings.resource_folder/'builtin'
models = os.listdir(source_folder)
models

['tts_models--en--vctk--vits',
 'tts_models--multilingual--multi-dataset--xtts_v2',
 'tts_models--multilingual--multi-dataset--your_tts']

In [3]:
import shutil

model = [m for m in models if 'vits' in m][0]
dst_folder = settings.resource_folder/'model'
shutil.rmtree(dst_folder, ignore_errors=True)
shutil.copytree(source_folder/model, dst_folder)
os.listdir(dst_folder)

['config.json', 'model_file.pth', 'speaker_ids.json']

Now, we also need a media library to train on. We can't create and annotate a library with TortoiseTTS, it's too laborious. Instead, we will generate samples with couple of the voices from VITS model, and then train VITS model on them. This is in general meaningless, because we basically train the model on its own output, but it will help us to see that container works normally.

Run BrainBox and then execute the following cell:

In [4]:
from kaia.ml.voice_cloning.data_prep.task_generator import generate_tasks
from kaia.brainbox import BrainBoxTask, BrainBox
from kaia.infra import FileIO

def create_coqui_task(voice, text):
    return BrainBoxTask(
        id=BrainBoxTask.safe_id(),
        decider='CoquiTTS',
        decider_parameters='tts_models/en/vctk/vits',
        decider_method='dub',
        arguments=dict(voice=voice, text=text),
    )

sentences = [item['text'] for item in FileIO.read_json('files/golden_set.json')]
voices = ['p225','p229']

tasks = generate_tasks(sentences,voices,None,create_coqui_task, dict(selected=True))
api = BrainBox().create_api('127.0.0.1')
#result = api.execute(tasks); api.download(result,settings.resource_folder/'test_dataset.zip')

In [5]:
os.listdir(settings.resource_folder)

['dataset.zip', 'model', 'test_dataset.zip', 'training']

Now, we are ready to start the training.

In [6]:
#installer.create_train_process(SupportedModels.VitsMultispeaker, 'model/model_file.pth', 'test_dataset.zip').run()

The following folder are located inside `settings.resource_folder/'training'`:

* Created by data transformation:
  * `data` contains the unziped and transformed dataset in the format, required by VITS/Yourtts (they are different in the recipies I used, although I think they can be unified)
  * `temp` is not important
* Created by CoquiTTS itself
  * `training_output` contains everything important from the training, including
    * Training log
    * Best models: the model's states that are selected by CoquiTTS
    * Checkpoints: last states of the model that are required to restart
    * Stashed models: my addon. Sometimes there are huge gaps between the best states, and I wanted to evaluate the model in-between as well. So I wrote a threaded service that stash checkpoints.
  * `model` contains a json config, the Coqui process creates it, it's not important
* Created by sampler:
  * I thought it's good to have an opportunity to listen to the models, so `samples` contains the voiceover of one sentence by all the best and stashed models. You can evaluate the models "by ear" and decide when to stop the training.
 
You may use the script from `kaia.ml.voice_cloning.data_prep.quality_assurance` to concatenate sounds into one video with indices. This script however doesn't always work for some obscure FFMPEG-related reasons.

Most importantly, you may export the resulting model to `CoquiTTS` decider and use it for inference:

In [7]:
from kaia.ml.voice_cloning.coqui_training_container.container.utils import get_working_folder
from kaia.ml.voice_cloning.coqui_training_container.container.model_file_info import ModelFileInfo

training_output = settings.resource_folder/'training/training_output/'
working_folder = get_working_folder(training_output)
models = ModelFileInfo.parse_folder(working_folder)
best_model_path = [m for m in models if m.type==ModelFileInfo.Type.BestModel][0].path

model_name = coqui_tts_settings.export_model_from_training(best_model_path, 'test')
model_name

'custom_models/test.pth'

In [8]:
from kaia.brainbox.deciders.docker_based import CoquiTTSInstaller

installer = CoquiTTSInstaller(coqui_tts_settings)
coqui_tts_api = installer.run_in_any_case_and_return_api()
model_info = coqui_tts_api.load_model(model_name)
voiceover = coqui_tts_api.dub(model_info, 'Hello, world!', model_info['speakers'][0])
with open('files/hello_world.wav','wb') as file:
    file.write(voiceover)

I have saved the resulting audio so you could enjoy it without running the whole thing:

In [10]:
from ipywidgets import Audio
with open('files/hello_world.wav','rb') as file:
    voiceover = file.read()
Audio(value=voiceover, autoplay=False)

Audio(value=b'RIFFD4\x01\x00WAVEfmt \x10\x00\x00\x00\x01\x00\x01\x00"V\x00\x00D\xac\x00\x00\x02\x00\x10\x00dat…

The quality is not the best. However, it will greatly improved if using more samples and if the samples will come from TortoiseTTS and not from CoquiTTS, as in the sample. What is clear, however, that the training container works and you may now create fast voiceovers with an okeyish quality for your favourite characters!