Fixed bug related to yourtts speaker embeddings issue #2234

iamkhalidbashir · 2022-12-22T10:19:56Z

Fixes: #2236

CLAassistant · 2022-12-22T10:20:01Z

All committers have signed the CLA.

erogol · 2022-12-22T10:26:22Z

@Edresson 👀

TTS/tts/models/base_tts.py

TTS/tts/utils/speakers.py

iamkhalidbashir · 2022-12-22T12:48:46Z

I reversed changes made in base_tts.py so we can merge this typo fix for now
But the ideal solution would be to make d_vector_file List[str] as @Edresson suggested

…ero-shot inference

Edresson · 2022-12-22T14:17:36Z

I fixed the issue with the following:

Changed the type of d_vector_file to List[str].
Updated the YourTTS config to match the new list type.
Update ModelManager._update_path to deal with list attributes.

In addition, I added the speaker encoder model and config paths on the YourTTS recipe as default to easily do zero-shot inference. In this way the user will not need to change the config.json manually to set these paths and any model trained with the YourTTS will be able to do inference using "--speaker_idx" and "--speaker_wav".

erogol · 2022-12-22T14:28:46Z

@Edresson is it possible to add a test case to prevent that to happen again?

Edresson · 2022-12-22T14:47:52Z

@Edresson is it possible to add a test case to prevent that to happen again?

I don't think so. The only long-term test that I can see is on coqpit side. We should raise an error if the type of argument is different. Currently, it overrides the value by None and it is hard to debug it.

…_list_of_files

…istakes

Edresson · 2022-12-28T13:47:10Z

@erogol Everything looks ok to me. Could you merge it, pleaase?

iamkhalidbashir · 2023-01-01T11:45:19Z

I am definitely getting this error when running the yourtts train recipe from a restore path, the code is exact to the recipie:-

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/trainer/trainer.py", line 1591, in fit
    self._fit()
  File "/usr/local/lib/python3.9/dist-packages/trainer/trainer.py", line 1544, in _fit
    self.train_epoch()
  File "/usr/local/lib/python3.9/dist-packages/trainer/trainer.py", line 1292, in train_epoch
    self.train_loader = self.get_train_dataloader(
  File "/usr/local/lib/python3.9/dist-packages/trainer/trainer.py", line 803, in get_train_dataloader
    return self._get_loader(
  File "/usr/local/lib/python3.9/dist-packages/trainer/trainer.py", line 767, in _get_loader
    loader = model.get_data_loader(
  File "/workspace/project/TTS/TTS/tts/models/vits.py", line 1621, in get_data_loader
    sampler = self.get_sampler(config, dataset, num_gpus)
  File "/workspace/project/TTS/TTS/tts/models/vits.py", line 1554, in get_sampler
    multi_dict = config.weighted_sampler_multipliers.get(attr_name, None)
AttributeError: 'NoneType' object has no attribute 'get'

The rcepie code + my small change of restore path:

import os

import torch
from trainer import Trainer, TrainerArgs

from TTS.bin.compute_embeddings import compute_embeddings
from TTS.bin.resample import resample_files
from TTS.config.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import CharactersConfig, Vits, VitsArgs, VitsAudioConfig
from TTS.utils.downloaders import download_vctk

torch.set_num_threads(24)

# pylint: disable=W0105
"""
    This recipe replicates the first experiment proposed in the YourTTS paper (https://arxiv.org/abs/2112.02418).
    YourTTS model is based on the VITS model however it uses external speaker embeddings extracted from a pre-trained speaker encoder and has small architecture changes.
    In addition, YourTTS can be trained in multilingual data, however, this recipe replicates the single language training using the VCTK dataset.
    If you are interested in multilingual training, we have commented on parameters on the VitsArgs class instance that should be enabled for multilingual training.
    In addition, you will need to add the extra datasets following the VCTK as an example.
"""
CURRENT_PATH = base_path

# Name of the run for the Trainer
RUN_NAME = "Test"

# Path where you want to save the models outputs (configs, checkpoints and tensorboard logs)
OUT_PATH =  base_path + "/output"  # "/raid/coqui/Checkpoints/original-YourTTS/"

# If you want to do transfer learning and speedup your training you can set here the path to the original YourTTS model
RESTORE_PATH = "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth"  # "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth"

# This paramter is usefull to debug, it skips the training epochs and just do the evaluation  and produce the test sentences
SKIP_TRAIN_EPOCH = False

# Set here the batch size to be used in training and evaluation
BATCH_SIZE = 14

# Training Sampling rate and the target sampling rate for resampling the downloaded dataset (Note: If you change this you might need to redownload the dataset !!)
# Note: If you add new datasets, please make sure that the dataset sampling rate and this parameter are matching, otherwise resample your audios
SAMPLE_RATE = 16000

# Max audio length in seconds to be used in training (every audio bigger than it will be ignored)
MAX_AUDIO_LEN_IN_SECONDS = 10

### Download VCTK dataset
VCTK_DOWNLOAD_PATH = os.path.join(CURRENT_PATH, "VCTK")
# Define the number of threads used during the audio resampling
NUM_RESAMPLE_THREADS = 10
# Check if VCTK dataset is not already downloaded, if not download it
if not os.path.exists(VCTK_DOWNLOAD_PATH):
    print(">>> Downloading VCTK dataset:")
    download_vctk(VCTK_DOWNLOAD_PATH)
    resample_files(VCTK_DOWNLOAD_PATH, SAMPLE_RATE, file_ext="flac", n_jobs=NUM_RESAMPLE_THREADS)

# init configs
vctk_config = BaseDatasetConfig(
    formatter="vctk",
    dataset_name="vctk",
    meta_file_train="",
    meta_file_val="",
    path=VCTK_DOWNLOAD_PATH,
    language="en",
    ignored_speakers=[
        "p261",
        "p225",
        "p294",
        "p347",
        "p238",
        "p234",
        "p248",
        "p335",
        "p245",
        "p326",
        "p302",
    ],  # Ignore the test speakers to full replicate the paper experiment
)

# Add here all datasets configs, in our case we just want to train with the VCTK dataset then we need to add just VCTK. Note: If you want to added new datasets just added they here and it will automatically compute the speaker embeddings (d-vectors) for this new dataset :)
DATASETS_CONFIG_LIST = [vctk_config]

### Extract speaker embeddings
SPEAKER_ENCODER_CHECKPOINT_PATH = (
    "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar"
)
SPEAKER_ENCODER_CONFIG_PATH = "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json"

D_VECTOR_FILES = []  # List of speaker embeddings/d-vectors to be used during the training

# Iterates all the dataset configs checking if the speakers embeddings are already computated, if not compute it
for dataset_conf in DATASETS_CONFIG_LIST:
    # Check if the embeddings weren't already computed, if not compute it
    embeddings_file = os.path.join(dataset_conf.path, "speakers.pth")
    if not os.path.isfile(embeddings_file):
        print(f">>> Computing the speaker embeddings for the {dataset_conf.dataset_name} dataset")
        compute_embeddings(
            SPEAKER_ENCODER_CHECKPOINT_PATH,
            SPEAKER_ENCODER_CONFIG_PATH,
            embeddings_file,
            old_spakers_file=None,
            config_dataset_path=None,
            formatter_name=dataset_conf.formatter,
            dataset_name=dataset_conf.dataset_name,
            dataset_path=dataset_conf.path,
            meta_file_train=dataset_conf.meta_file_train,
            meta_file_val=dataset_conf.meta_file_val,
            disable_cuda=False,
            no_eval=False,
        )
    D_VECTOR_FILES.append(embeddings_file)


# Audio config used in training.
audio_config = VitsAudioConfig(
    sample_rate=SAMPLE_RATE,
    hop_length=256,
    win_length=1024,
    fft_size=1024,
    mel_fmin=0.0,
    mel_fmax=None,
    num_mels=80,
)

# Init VITSArgs setting the arguments that is needed for the YourTTS model
model_args = VitsArgs(
    d_vector_file=D_VECTOR_FILES,
    use_d_vector_file=True,
    d_vector_dim=512,
    num_layers_text_encoder=10,
    speaker_encoder_model_path=SPEAKER_ENCODER_CHECKPOINT_PATH,
    speaker_encoder_config_path=SPEAKER_ENCODER_CONFIG_PATH,
    resblock_type_decoder="2",  # On the paper, we accidentally trained the YourTTS using ResNet blocks type 2, if you like you can use the ResNet blocks type 1 like the VITS model
    # Usefull parameters to enable the Speaker Consistency Loss (SCL) discribed in the paper
    # use_speaker_encoder_as_loss=True,
    # Usefull parameters to the enable multilingual training
    # use_language_embedding=True,
    # embedded_language_dim=4,
)

# General training config, here you can change the batch size and others usefull parameters
config = VitsConfig(
    output_path=OUT_PATH,
    model_args=model_args,
    run_name=RUN_NAME,
    project_name="YourTTS",
    run_description="""
            - Original YourTTS trained using VCTK dataset
        """,
    dashboard_logger="tensorboard",
    logger_uri=None,
    audio=audio_config,
    batch_size=BATCH_SIZE,
    batch_group_size=5,
    eval_batch_size=BATCH_SIZE,
    num_loader_workers=8,
    eval_split_max_size=256,
    print_step=50,
    plot_step=100,
    log_model_step=1000,
    save_step=500,
    save_n_checkpoints=2,
    save_checkpoints=True,
    target_loss="loss_1",
    print_eval=True,
    use_phonemes=False,
    phonemizer="espeak",
    phoneme_language="en",
    compute_input_seq_cache=True,
    add_blank=True,
    text_cleaner="multilingual_cleaners",
    characters=CharactersConfig(
        characters_class="TTS.tts.models.vits.VitsCharacters",
        pad="_",
        eos="&",
        bos="*",
        blank=None,
        characters="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ",
        punctuations="!'(),-.:;? ",
        phonemes="",
        is_unique=True,
        is_sorted=True,
    ),
    phoneme_cache_path=None,
    precompute_num_workers=12,
    start_by_longest=True,
    datasets=DATASETS_CONFIG_LIST,
    cudnn_benchmark=False,
    max_audio_len=SAMPLE_RATE * MAX_AUDIO_LEN_IN_SECONDS,
    mixed_precision=False,
    test_sentences=[
        [
            "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
            "VCTK_p277",
            None,
            "en",
        ],
        [
            "Be a voice, not an echo.",
            "VCTK_p239",
            None,
            "en",
        ],
        [
            "I'm sorry Dave. I'm afraid I can't do that.",
            "VCTK_p258",
            None,
            "en",
        ],
        [
            "This cake is great. It's so delicious and moist.",
            "VCTK_p244",
            None,
            "en",
        ],
        [
            "Prior to November 22, 1963.",
            "VCTK_p305",
            None,
            "en",
        ],
    ],
    # Enable the weighted sampler
    use_weighted_sampler=True,
    # Ensures that all speakers are seen in the training batch equally no matter how many samples each speaker has
    weighted_sampler_attrs={"speaker_name": 1.0},
    weighted_sampler_multipliers={},
    # It defines the Speaker Consistency Loss (SCL) α to 9 like the paper
    speaker_encoder_loss_alpha=9.0,
)

# Load all the datasets samples and split traning and evaluation sets
train_samples, eval_samples = load_tts_samples(
    config.datasets,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

# Init the model
model = Vits.init_from_config(config)

erogol · 2023-01-02T09:00:18Z

@iamkhalidbashir error from this PR?

iamkhalidbashir · 2023-01-02T09:02:14Z

@iamkhalidbashir error from this PR?

Yes I took recipie from this PR.

erogol · 2023-01-02T09:11:58Z

But in the pr weighted_sampler_multipliers is an empty dict, not None. Did you use the latest version of the PR?

iamkhalidbashir · 2023-01-02T09:13:32Z

Yes, I did use the latest. @Edresson Doesn't seem to reproduce this on his end, not sure if it's related to my python version. I will debug where the issue lies.

karynaur · 2023-01-03T09:14:07Z

How do i make sure i dont get the error? multi_dict = config.weighted_sampler_multipliers.get(attr_name, None) AttributeError: 'NoneType' object has no attribute 'get'

iamkhalidbashir · 2023-01-03T09:16:15Z

How do i make sure i dont get the error? multi_dict = config.weighted_sampler_multipliers.get(attr_name, None) AttributeError: 'NoneType' object has no attribute 'get'

I do this

# Enable the weighted sampler
use_weighted_sampler=False,
# Ensures that all speakers are seen in the training batch equally no matter how many samples each speaker has
weighted_sampler_attrs={"speaker_name": 1.0},
weighted_sampler_multipliers={"speaker_name": {}},

then the error disappears.
the important line is the last

weighted_sampler_multipliers={"speaker_name": {}},

karynaur · 2023-01-03T09:24:42Z

thank you so much! That worked. Also how do i change the number of epochs, it seems to be 1000 by default

iamkhalidbashir · 2023-01-03T10:53:33Z

Jus add a property named epoch=3 for example

On Tue, 3 Jan 2023 at 2:24 PM Aditya Srinivas Menon < ***@***.***> wrote: thank you so much! That worked. Also how do i change the number of epochs, it seems to be 1000 by default — Reply to this email directly, view it on GitHub <#2234 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGS5WWYZ245232JMKET7YTDWQPV6JANCNFSM6AAAAAATGRXCI4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- *Mr. Bashir,* *CEO, AMOXT Pvt. Ltd*

Fixed bug related to yourtts speaker embeddings issue

6bae35c

iamkhalidbashir mentioned this pull request Dec 22, 2022

Add YourTTS VCTK recipe #2198

Merged

Edresson reviewed Dec 22, 2022

View reviewed changes

TTS/tts/models/base_tts.py Outdated Show resolved Hide resolved

Edresson reviewed Dec 22, 2022

View reviewed changes

TTS/tts/utils/speakers.py Outdated Show resolved Hide resolved

Reverted code for base_tts

ec22bb2

iamkhalidbashir mentioned this pull request Dec 22, 2022

train_yourtts speaker embeddings does not generate audio #2236

Closed

Edresson added 5 commits December 22, 2022 10:29

Bug fix on VITS d_vector_file type

242b593

Ignore the test speakers on YourTTS recipe

d154c22

Add speaker encoder model and config on YourTTS recipe to easily do z…

69f3b9a

…ero-shot inference

Update YourTTS config file

8298c2c

Update ModelManager._update_path to deal with list attributes

bd01d85

Fix lint checks

1c92d81

Edresson requested a review from erogol December 22, 2022 14:24

Remove unused code

6c3b1ad

Edresson added 3 commits December 22, 2022 12:56

Fix unit tests

ed60197

Reset name_to_id to get the right speaker ids on load_embeddings_from…

c8245cd

…_list_of_files

Set weighted_sampler_multipliers as an empty dict to prevent users' m…

0a9d28d

…istakes

erogol merged commit 42afad5 into coqui-ai:dev Jan 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed bug related to yourtts speaker embeddings issue #2234

Fixed bug related to yourtts speaker embeddings issue #2234

iamkhalidbashir commented Dec 22, 2022 •

edited

CLAassistant commented Dec 22, 2022 •

edited

erogol commented Dec 22, 2022

iamkhalidbashir commented Dec 22, 2022

Edresson commented Dec 22, 2022

erogol commented Dec 22, 2022

Edresson commented Dec 22, 2022

Edresson commented Dec 28, 2022

iamkhalidbashir commented Jan 1, 2023

erogol commented Jan 2, 2023

iamkhalidbashir commented Jan 2, 2023

erogol commented Jan 2, 2023

iamkhalidbashir commented Jan 2, 2023

karynaur commented Jan 3, 2023

iamkhalidbashir commented Jan 3, 2023

karynaur commented Jan 3, 2023

iamkhalidbashir commented Jan 3, 2023 via email

Fixed bug related to yourtts speaker embeddings issue #2234

Fixed bug related to yourtts speaker embeddings issue #2234

Conversation

iamkhalidbashir commented Dec 22, 2022 • edited

CLAassistant commented Dec 22, 2022 • edited

erogol commented Dec 22, 2022

iamkhalidbashir commented Dec 22, 2022

Edresson commented Dec 22, 2022

erogol commented Dec 22, 2022

Edresson commented Dec 22, 2022

Edresson commented Dec 28, 2022

iamkhalidbashir commented Jan 1, 2023

erogol commented Jan 2, 2023

iamkhalidbashir commented Jan 2, 2023

erogol commented Jan 2, 2023

iamkhalidbashir commented Jan 2, 2023

karynaur commented Jan 3, 2023

iamkhalidbashir commented Jan 3, 2023

karynaur commented Jan 3, 2023

iamkhalidbashir commented Jan 3, 2023 via email

iamkhalidbashir commented Dec 22, 2022 •

edited

CLAassistant commented Dec 22, 2022 •

edited