# One-Shot Learning Example

The Jupyter Notebook should be launched in the folder **notebooks**.

In [2]:
import os
os.chdir('../src')
from osms.common.multispeaker import MultispeakerManager
import torch
import yaml
import warnings
warnings.filterwarnings("ignore")


Create a 5-second .wav file with someone speaking English and put it into the folder **audio_samples**.
Set the path to your .wav file in the attribute `SPEAKER_SPEECH_PATH` in `src/tts_modules/common/configs/main_config.yaml`.
We suggest to use the app [Audio Recorder](https://apps.apple.com/us/app/audio-recorder-wav-m4a/id1454488895) to record the voice. Set the sample rate to 16HGz there.

Create a .txt file with some sentences written in English and put it into the **texts** folder. Set the path to your .txt file in the attribute `INPUT_TEXTS_PATH` in `src/tts_modules/common/configs/main_config.yaml`.


The examples are already present in these folders.

In [2]:
with open(os.path.join(os.getcwd(), 'osms/tts_modules/common/configs/main_config.yaml'), "r") as ymlfile:
    main_config = yaml.load(ymlfile)
    
SPEAKER_SPEECH_PATH = "../audio_samples"
if not os.path.exists(SPEAKER_SPEECH_PATH):
    os.makedirs(SPEAKER_SPEECH_PATH)
    
INPUT_TEXTS_PATH = "../texts"
if not os.path.exists(INPUT_TEXTS_PATH):
    os.makedirs(INPUT_TEXTS_PATH)
    
OUTPUT_AUDIO_DIR = "../result_speech"
if not os.path.exists(OUTPUT_AUDIO_DIR):
    os.makedirs(OUTPUT_AUDIO_DIR)

In [2]:
main_config = None

In [3]:
multispeaker_manager = MultispeakerManager(main_configs=main_config)
multispeaker_manager.inference()

Trainable Parameters in dVecModel: 1.424M
Loading DVecModel checkpoint from checkpoints/encoder.pt
Trainable Parameters in Tacotron: 30.870M
Loading Tacotron checkpoint from checkpoints/synthesizer/synthesizer.pt
Trainable Parameters in WaveRNN: 4.481M
Loading WaveRNN checkpoint from checkpoints/vocoder/vocoder.pt


array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
       5.21322577e-07, 7.94726051e-08, 0.00000000e+00])

The results will be available in the folder `result_speech`. The name of the file will be **result.wav**.

The usability will be further improved.

In [1]:
import os
os.chdir('../src')
import torch
from osms.tts_modules.encoder import SpeakerEncoderManager
from osms.common.configs import get_default_main_configs
from osms.tts_modules.encoder.configs import get_default_encoder_config
from osms.tts_modules.encoder.data.wav2mel import StandardWav2MelTransform
from osms.tts_modules.encoder.data.wav_preprocessing import StandardAudioPreprocessor
from osms.tts_modules.encoder.data.dataset import SpeakerEncoderDataLoader, SpeakerEncoderDataset, PreprocessLibriSpeechDataset
from osms.tts_modules.encoder.models import DVecModel


In [2]:
main_configs = get_default_main_configs()
encoder_config = get_default_encoder_config()

preprocessor = StandardAudioPreprocessor(encoder_config)
wav2mel = StandardWav2MelTransform(encoder_config)

data_preprocesser = PreprocessLibriSpeechDataset(encoder_config, preprocessor, wav2mel)

In [3]:
data_preprocesser.preprocess_dataset(n_speakers=10)

10 speakers were preprocessed.


In [3]:
dataset = SpeakerEncoderDataset(encoder_config)
dataloader = SpeakerEncoderDataLoader(encoder_config, dataset, 'train')

model = DVecModel(encoder_config)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

encoder_manager = SpeakerEncoderManager(main_configs, model, 
                                        preprocessor, wav2mel, 
                                        dataloader, dataloader, 
                                        optimizer)


Trainable Parameters in dVecModel: 1.424M


In [4]:
encoder_manager.train_session(number_steps=3)

Starting the training from scratch.
Saving the model (step 2)
Stopping Training Session


# Install osms

We suggest to create a new virtual environment for demonstration.

Run the cell below to create a new venv called demo_venv and activate it.

The venv allows to isolate all newly installed pakages from system's python interpreter in order not to make confusions with global packages and their versions.

In [1]:
!python3 -m venv demo_venv
!source demo_venv/bin/activate
!python3 -m pip install --upgrade pip



Then one needs to change the current working directory to the root of OSM-one-shot-multispeaker

In [2]:
import os

print(f"Initial working directory: {os.getcwd()}")
os.chdir('../')
print(f"Current working directory: {os.getcwd()}")

Initial working directory: /Users/kolya/Documents/Skoltech_PhD/courses/Theoretical Foundations of Data Science/project/OSM-one-shot-multispeaker/notebooks
Current working directory: /Users/kolya/Documents/Skoltech_PhD/courses/Theoretical Foundations of Data Science/project/OSM-one-shot-multispeaker


Now one can install osms package

In [3]:
!pip3 install .

Processing /Users/kolya/Documents/Skoltech_PhD/courses/Theoretical Foundations of Data Science/project/OSM-one-shot-multispeaker
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m


Using legacy 'setup.py install' for osms, since package 'wheel' is not installed.
Installing collected packages: osms
    Running setup.py install for osms ... [?25ldone
[?25hSuccessfully installed osms-1.0.0


# Speaker Encoder training demo

This section describes how to manipulate with datasets and how to train an encoder

In [4]:
if not os.path.exists('dataset'):
    os.mkdir('dataset')

Download [LibriSpeech](https://www.openslr.org/resources/12/train-clean-100.tar.gz) dataset and unpack it in the directory `dataset`

Import classes and functions for dataset preprocessing:

In [5]:
from osms.common.configs import get_default_main_configs
from osms.tts_modules.encoder.configs import get_default_encoder_config
from osms.tts_modules.encoder.data.wav2mel import StandardWav2MelTransform
from osms.tts_modules.encoder.data.wav_preprocessing import StandardAudioPreprocessor
from osms.tts_modules.encoder.data.dataset import PreprocessLibriSpeechDataset

All configs in `osms` are instances of `CfgNode` from `yacs` library. The default values are defined in the corresponding functions inside `osms`. The user can always update these configs from custom \*.yaml configuration file. However in this demo we won't use customized configurations.

`main_configs` contains some general configurations which are available across the whole package.

`encoder_config` contains all configurations required for speaker encoder module and its methods and classes

In [None]:
main_configs = get_default_main_configs()
encoder_config = get_default_encoder_config()

preprocessor = StandardAudioPreprocessor(encoder_config)
wav2mel = StandardWav2MelTransform(encoder_config)

data_preprocesser = PreprocessLibriSpeechDataset(encoder_config, preprocessor, wav2mel)

In [None]:
from osms.tts_modules.encoder import SpeakerEncoderManager
from osms.tts_modules.encoder.data.dataset import SpeakerEncoderDataLoader, SpeakerEncoderDataset
from osms.tts_modules.encoder.models import DVecModel