## Multi-Accent and Multi-Lingual Voice Clone Demo with MeloTTS

In [1]:
import os
import torch
from openvoice import se_extractor
from openvoice.api import ToneColorConverter

ckpt_converter = 'checkpoints_v2/converter'
device = "cuda:0" if torch.cuda.is_available() else "cpu"
output_dir = 'outputs_v2'

tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)
tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

os.makedirs(output_dir, exist_ok=True)

  from .autonotebook import tqdm as notebook_tqdm


Importing the dtw module. When using in academic works please cite:
  T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package.
  J. Stat. Soft., doi:10.18637/jss.v031.i07.

Loaded checkpoint 'checkpoints_v2/converter/checkpoint.pth'
missing/unexpected keys: [] []


### Obtain Tone Color Embedding
We only extract the tone color embedding for the target speaker. The source tone color embeddings can be directly loaded from `checkpoints_v2/ses` folder.

In [2]:

#reference_speaker = 'resources/example_reference.mp3' # This is the voice you want to clone
reference_speaker = 'resources/training_enrique.mp3' # This is the voice you want to clone
target_se, audio_name = se_extractor.get_se(reference_speaker, tone_color_converter, vad=False)

OpenVoice version: v2


Estimating duration from bitrate, this may be inaccurate


#### Use MeloTTS as Base Speakers

MeloTTS is a high-quality multi-lingual text-to-speech library by @MyShell.ai, supporting languages including English (American, British, Indian, Australian, Default), Spanish, French, Chinese, Japanese, Korean. In the following example, we will use the models in MeloTTS as the base speakers. 

In [4]:
from melo.api import TTS

texts = {
    'EN_NEWEST': """Don’t be an asshole! Miguel, this audio has been created from text using a sample of 1 minute of my voice.
     I used to train a bad micro, so the audio itself sucks, but we can improve it! Stay tuned!
    """,
    'EN': """Don’t be an asshole! Miguel, this audio has been created from text using a sample of 1 minute of my voice.
     I used to train a bad micro, so the audio itself sucks, but we can improve it! Stay tuned!
    """,
    'FR': """Ne sois pas un connard ! Miguel, cet audio a été créé à partir de texte en utilisant un échantillon d'une minute de ma voix.
      J'avais l'habitude de former un mauvais micro, donc l'audio en lui-même est nul, mais nous pouvons l'améliorer ! Restez à l'écoute!
    """,
    'ES': """¡No seas idiota! Miguel, este audio ha sido creado a partir de texto usando una muestra de 1 minuto de mi voz.
      Solía entrenar un micro malo, por lo que el audio en sí apesta, ¡pero podemos mejorarlo! ¡Manténganse al tanto!
    """
}


src_path = f'{output_dir}/tmp.wav'

# Speed is adjustable
speed = 1.0

for language, text in texts.items():
    model = TTS(language=language, device=device)
    speaker_ids = model.hps.data.spk2id
    
    for speaker_key in speaker_ids.keys():
        speaker_id = speaker_ids[speaker_key]
        speaker_key = speaker_key.lower().replace('_', '-')
        
        source_se = torch.load(f'checkpoints_v2/base_speakers/ses/{speaker_key}.pth', map_location=device)
        model.tts_to_file(text, speaker_id, src_path, speed=speed)
        save_path = f'{output_dir}/output_v2_{speaker_key}.wav'

        # Run the tone color converter
        encode_message = "@MyShell"
        tone_color_converter.convert(
            audio_src_path=src_path, 
            src_se=source_se, 
            tgt_se=target_se, 
            output_path=save_path,
            message=encode_message)

 > Text split to sentences.
Don't be an asshole! Miguel, this audio has been created from text using a sample of 1 minute of my voice. I used to train a bad micro, so the audio itself sucks, but we can improve it! Stay tuned!


100%|██████████| 1/1 [00:04<00:00,  4.25s/it]


 > Text split to sentences.
Don't be an asshole! Miguel, this audio has been created from text using a sample of 1 minute of my voice. I used to train a bad micro, so the audio itself sucks, but we can improve it! Stay tuned!


100%|██████████| 1/1 [00:05<00:00,  5.57s/it]


 > Text split to sentences.
Don't be an asshole! Miguel, this audio has been created from text using a sample of 1 minute of my voice. I used to train a bad micro, so the audio itself sucks, but we can improve it! Stay tuned!


100%|██████████| 1/1 [00:06<00:00,  6.29s/it]


 > Text split to sentences.
Don't be an asshole! Miguel, this audio has been created from text using a sample of 1 minute of my voice. I used to train a bad micro, so the audio itself sucks, but we can improve it! Stay tuned!


100%|██████████| 1/1 [00:06<00:00,  6.33s/it]


 > Text split to sentences.
Don't be an asshole! Miguel, this audio has been created from text using a sample of 1 minute of my voice. I used to train a bad micro, so the audio itself sucks, but we can improve it! Stay tuned!


100%|██████████| 1/1 [00:06<00:00,  6.24s/it]


 > Text split to sentences.
Don't be an asshole! Miguel, this audio has been created from text using a sample of 1 minute of my voice. I used to train a bad micro, so the audio itself sucks, but we can improve it! Stay tuned!


100%|██████████| 1/1 [00:05<00:00,  5.37s/it]


 > Text split to sentences.
Ne sois pas un connard ! Miguel, cet audio a été créé à partir de texte en utilisant un échantillon d'une minute de ma voix. J'avais l'habitude de former un mauvais micro, donc l'audio en lui-même est nul, mais nous pouvons l'améliorer ! Restez à l'écoute!


100%|██████████| 1/1 [00:08<00:00,  8.44s/it]


 > Text split to sentences.
¡No seas idiota! Miguel, este audio ha sido creado a partir de texto usando una muestra de 1 minuto de mi voz. Solía entrenar un micro malo, por lo que el audio en sí apesta, ¡pero podemos mejorarlo! ¡Manténganse al tanto!


100%|██████████| 1/1 [00:08<00:00,  8.62s/it]
