How to Compute Speaker Embeddings in 24K? #2552

chigkim · 2023-04-23T18:29:26Z

chigkim
Apr 23, 2023

I set SAMPLE_RATE = 24000 in recipes/vctk/yourtts/train_yourtts.py.
However, it loads AudioProcessor in 16k and computes speaker embeddings in 16k.
Where is this 16k sample rate hard coded?
Thanks!

pivolan · 2023-05-11T15:55:01Z

pivolan
May 11, 2023

You can found this in speaker config file:
https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json

I already tried change this params, but then model will not loaded, because parameters differs than you set in config.
Now I start an experiments to use this with 22050 sample rate dataset, but I think it will not work properly. Also I don't know is it possible to get embedding for audios with sample rate 16000 but train model on 22050 sample rate audios.

0 replies

p0p4k · 2023-05-17T12:39:16Z

p0p4k
May 17, 2023

The spk-encoder model (that creates spk-embedding) has been trained on 16k hz sample rate. You will have to resample the dataset for creating embeddings. You can train TTS on 22kHz.

3 replies

p0p4k May 17, 2023

or 24 kHz in your case

KaikeWesleyReis Apr 15, 2024

@p0p4k this mismatch: embedding of 16k and audio sample for VITS of 22k would be a problem? Because I'm doing a fine tuning for VITS:

Embeddings of 16k as the code bellow
Audio Sample of 22k for VITS

p0p4k Apr 15, 2024

Resample the wav to 16k for speaker embeddings, and use 22k for vits.

pivolan · 2023-06-13T22:08:43Z

pivolan
Jun 13, 2023

if you want to use different sample rates, you should use your own custom formatter functions for two methods, one for load_tts_samples:

train_samples, eval_samples = load_tts_samples(
    config.datasets,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
    formatter=custom_formatter
)

and one for embedding compute:

for dataset_conf in DATASETS_CONFIG_LIST:
    # Check if the embeddings weren't already computed, if not compute it
    embeddings_file = os.path.join(dataset_conf.path, "speakers.pth")
    if not os.path.isfile(embeddings_file):
        print(f">>> Computing the speaker embeddings for the {dataset_conf.dataset_name} dataset")
        compute_embeddings(
            SPEAKER_ENCODER_CHECKPOINT_PATH,
            SPEAKER_ENCODER_CONFIG_PATH,
            embeddings_file,
            old_spakers_file=None,
            config_dataset_path=None,
            formatter_name=dataset_conf.formatter,
            dataset_name=dataset_conf.dataset_name,
            dataset_path=dataset_conf.path,
            meta_file_train=dataset_conf.meta_file_train,
            meta_file_val=dataset_conf.meta_file_val,
            disable_cuda=False,
            no_eval=False,
            formatter=custom_formatter_16khz
        )
    D_VECTOR_FILES.append(embeddings_file)

But it is not all, you also have to change mapping that will map audio file embedding to absolute filename path.

So my method was:
I created two folders with wavs:
/dataset/wavs/.wav
/dataset/16khz/.wav
In first I have 22050 audio files and in second 16000.
I created two custom formatters, and rewrite method compute_embedding(I just copy paste this from sources and add one more parameter - formatter)

so this is the example of my script, to start train DE with thorsten and custom_formatters with different sample rates for train and for embedding.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Compute Speaker Embeddings in 24K? #2552

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to Compute Speaker Embeddings in 24K? #2552

chigkim Apr 23, 2023

Replies: 3 comments · 3 replies

pivolan May 11, 2023

p0p4k May 17, 2023

p0p4k May 17, 2023

KaikeWesleyReis Apr 15, 2024

p0p4k Apr 15, 2024

pivolan Jun 13, 2023

chigkim
Apr 23, 2023

Replies: 3 comments 3 replies

pivolan
May 11, 2023

p0p4k
May 17, 2023

pivolan
Jun 13, 2023