# osms short demo

## Install osms

This Jupyter Notebook should be launched in the folder **notebooks**.

We suggest to create a new virtual environment for demonstration.

Run the cell below to create a new venv called demo_venv and activate it.

The venv allows to isolate all newly installed pakages from system's python interpreter in order not to make confusions with global packages and their versions.

You can run the cell below or run these commands in your terminal

In [1]:
!python3 -m venv demo_venv
!source demo_venv/bin/activate
!python3 -m pip install --upgrade pip



Then one needs to change the current working directory to the root of OSM-one-shot-multispeaker

In [1]:
import os
import warnings
warnings.filterwarnings("ignore")

print(f"Initial working directory: {os.getcwd()}")
os.chdir('../')
print(f"Current working directory: {os.getcwd()}")

Initial working directory: /Users/kolya/Documents/Skoltech_PhD/courses/Theoretical Foundations of Data Science/project/OSM-one-shot-multispeaker/notebooks
Current working directory: /Users/kolya/Documents/Skoltech_PhD/courses/Theoretical Foundations of Data Science/project/OSM-one-shot-multispeaker


Now one can install osms package

In [2]:
!pip3 install --upgrade .

Processing /Users/kolya/Documents/Skoltech_PhD/courses/Theoretical Foundations of Data Science/project/OSM-one-shot-multispeaker
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m


Using legacy 'setup.py install' for osms, since package 'wheel' is not installed.
Installing collected packages: osms
  Attempting uninstall: osms
    Found existing installation: osms 1.0.0
    Uninstalling osms-1.0.0:
      Successfully uninstalled osms-1.0.0
    Running setup.py install for osms ... [?25ldone
[?25hSuccessfully installed osms-1.0.0


## Speaker Encoder training demo

This section describes how to manipulate with datasets and how to train an encoder

In [4]:
if not os.path.exists('dataset'):
    os.mkdir('dataset')

Download [LibriSpeech](https://www.openslr.org/resources/12/train-clean-100.tar.gz) dataset and unpack it in the directory `dataset`

Import classes and functions for dataset preprocessing:

In [5]:
from osms.common.configs import get_default_main_configs
from osms.tts_modules.encoder.configs import get_default_encoder_config
from osms.tts_modules.encoder.data.wav2mel import StandardWav2MelTransform
from osms.tts_modules.encoder.data.wav_preprocessing import StandardAudioPreprocessor
from osms.tts_modules.encoder.data.dataset import PreprocessLibriSpeechDataset

All configs in `osms` are instances of `CfgNode` from `yacs` library. The default values are defined in the corresponding functions inside `osms`. The user can always update these configs from custom \*.yaml configuration file. However in this demo we won't use customized configurations.

`main_configs` contains some general configurations which are available across the whole package.

`encoder_config` contains all configurations required for speaker encoder module and its methods and classes

In [6]:
main_configs = get_default_main_configs()
encoder_config = get_default_encoder_config()

preprocessor = StandardAudioPreprocessor(encoder_config)
wav2mel = StandardWav2MelTransform(encoder_config)

data_preprocesser = PreprocessLibriSpeechDataset(encoder_config, preprocessor, wav2mel)

Before training, the dataset has to be preprocessed. Thus call the method `preprocess_dataset` from `PreprocessLibriSpeechDataset`. 

In order to speed-up the demonstration, reduce the number of speakers which will be preprocessed by 10. The processing of all speakers requires much time.

In [7]:
data_preprocesser.preprocess_dataset(n_speakers=10)

10 speakers were preprocessed.


The data from 10 preprocessed speakers can be used for training of Speaker Encoder.

Our encoder the is d-vec model.

For test purpose we also reduce the number of training steps, just to ensure that training is correct.

The whole procedure also requires much time.

In [8]:
import torch
from osms.tts_modules.encoder import SpeakerEncoderManager
from osms.tts_modules.encoder.data.dataset import SpeakerEncoderDataLoader, SpeakerEncoderDataset
from osms.tts_modules.encoder.models import DVecModel

In [9]:
dataset = SpeakerEncoderDataset(encoder_config)
dataloader = SpeakerEncoderDataLoader(encoder_config, dataset, 'train')

model = DVecModel(encoder_config)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

encoder_manager = SpeakerEncoderManager(main_configs, model, 
                                        preprocessor, wav2mel, 
                                        dataloader, dataloader, 
                                        optimizer)

Trainable Parameters in dVecModel: 1.424M


In [10]:
encoder_manager.train_session(number_steps=8, each_n_print_steps=2)

Starting the training from scratch.
Step 1. Train loss value: 4.157451629638672
Saving the model (step 2)
Step 3. Train loss value: 4.1539411544799805
Step 5. Train loss value: 4.152585029602051
Saving the model (step 7)
Step 7. Train loss value: 4.139642238616943
Step 9. Train loss value: 4.148777961730957
Stopping Training Session at step #8


As one can see the loss decreases constantly. The obtained checkpoints can be found in `train_output/checkpoints` folder.

## One-Shot Learning Example

Run the cell below

In [3]:
from osms.common.configs import update_config, get_default_main_configs
from osms import MultispeakerManager 

def mkdir(folder_name):
    if not os.path.exists(folder_name):
        os.mkdir(folder_name)
        print(f'Folder {folder_name} is created!')


mkdir('audio_samples')
mkdir('texts')
mkdir('result_speech')

Create a 5-second .wav file with someone speaking English and put it into the folder **audio_samples**.

We suggest to use the app [Audio Recorder](https://apps.apple.com/us/app/audio-recorder-wav-m4a/id1454488895) to record the voice. Set the sample rate to 16HGz there.

Create a .txt file with some sentences written in English and put it into the **texts** folder.


The examples are already present in these folders. In order to process your wav and txt files, write their names in the cell below.

In [9]:
wav_file_name = 'google_test.wav'
text_file_name = 'test1.txt'
output_wav_file_name = 'google_test1.wav'

In [10]:
main_configs = get_default_main_configs()
update_list = ['SPEAKER_SPEECH_PATH', os.path.join('audio_samples', wav_file_name),
               'INPUT_TEXTS_PATH', os.path.join('texts', text_file_name),
               'OUTPUT_AUDIO_FILE_NAME', output_wav_file_name]
main_configs = update_config(main_configs, update_list=update_list)

The pretrained weights of models will be automatically downloaded when you create an object of `MultispeakerManager` class first time. These weights will be saved in **checkpoints** folder which will be created in the working directory. The next time you run the programm, the pretrained weights will be loaded from this folder, not downloaded from the Internet.

The method `inference()` of `MultispeakerManager` launches the whole multispeaker TTS pipeline. All the steps are done sequentially.

In [11]:
multispeaker_manager = MultispeakerManager(main_configs=main_configs)
multispeaker_manager.inference()

Trainable Parameters in dVecModel: 1.424M
Loading DVecModel checkpoint from checkpoints/encoder.pt
Trainable Parameters in Tacotron: 30.870M
Loading Tacotron checkpoint from checkpoints/synthesizer.pt
Trainable Parameters in dVecModel: 1.424M
Loading DVecModel checkpoint from checkpoints/encoder.pt
Trainable Parameters in WaveRNN: 4.481M
Loading WaveRNN checkpoint from checkpoints/vocoder.pt


array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
       -5.77767793e-08, -9.76717447e-09,  0.00000000e+00])

The resulting \*.wav file will be stored in **result_speech** folder