## Multi-Accent and Multi-Lingual Voice Clone Demo with MeloTTS

In [1]:
import os
import os
import torch
from openvoice import se_extractor
from openvoice.api import ToneColorConverter

  from .autonotebook import tqdm as notebook_tqdm


Importing the dtw module. When using in academic works please cite:
  T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package.
  J. Stat. Soft., doi:10.18637/jss.v031.i07.



### Initialization

In this example, we will use the checkpoints from OpenVoiceV2. OpenVoiceV2 is trained with more aggressive augmentations and thus demonstrate better robustness in some cases.

In [2]:
ckpt_converter = r".\checkpoints_v2\converter"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
output_dir = r'.\checkpoints_v2\outputs_v2'

tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)
tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

os.makedirs(output_dir, exist_ok=True)

  WeightNorm.apply(module, name, dim)


Loaded checkpoint '.\checkpoints_v2\converter/checkpoint.pth'
missing/unexpected keys: [] []


### Obtain Tone Color Embedding
We only extract the tone color embedding for the target speaker. The source tone color embeddings can be directly loaded from `checkpoints_v2/ses` folder.

In [None]:
# Here specify the reference speaker file you want to use
voice_file = "kursche_voice.mp3"
reference_speaker = f'./{voice_file}' # This is the voice you want to clone
target_se, audio_name = se_extractor.get_se(reference_speaker, tone_color_converter, vad=True)

OpenVoice version: v2
[(1.294, 12.242), (12.558, 37.138), (37.23, 48.818), (49.07, 60.8391875)]
after vad: dur = 58.884965986394555


Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\native\SpectralOps.cpp:880.)
  return _VF.stft(  # type: ignore[attr-defined]


#### Use MeloTTS as Base Speakers

MeloTTS is a high-quality multi-lingual text-to-speech library by @MyShell.ai, supporting languages including English (American, British, Indian, Australian, Default), Spanish, French, Chinese, Japanese, Korean. In the following example, we will use the models in MeloTTS as the base speakers. 

In [None]:
import nltk
from melo.api import TTS


# Download required NLTK data first
print("Checking NLTK data...")
try:
    nltk.data.find('tokenizers/punkt')
    print("✓ Punkt tokenizer found")
except LookupError:
    print("Downloading Punkt tokenizer...")
    nltk.download('punkt', quiet=True)

try:
    nltk.data.find('taggers/averaged_perceptron_tagger')
    print("✓ Averaged perceptron tagger found")
except LookupError:
    print("Downloading averaged perceptron tagger...")
    nltk.download('averaged_perceptron_tagger', quiet=True)

# The texts to be synthesized, needs to be in the language of the TTS model
# You can add more languages/models if you want
texts = {
    'EN_NEWEST': "Hello students from FerienAkademie. How are you doing? It's beautiful weather today, I hope you are ready for a long hike. To tell more about myself, I love reading books and listening to music."
}

src_path = f'{output_dir}/tmp.wav'
# Experiment with these parameters:
speed = 1.0  # Slower speech (0.8-1.2)
noise_scale = 0.667  # More variation (0.5-0.8)
noise_scale_w = 0.8  # More expressiveness (0.6-1.0)
sdp_ratio = 0.5  # Balance between styles (0.0-1.0)

# Check if required variables exist
if 'target_se' not in locals() and 'target_se' not in globals():
    print("ERROR: target_se is not defined. You need to run the tone color extraction first!")
elif 'tone_color_converter' not in locals() and 'tone_color_converter' not in globals():
    print("ERROR: tone_color_converter is not defined. You need to initialize it first!")
else:
    for language, text in texts.items():
        print(f"Processing {language}...")
        model = TTS(language=language, device=device)
        speaker_ids = model.hps.data.spk2id
        
        for speaker_key in speaker_ids.keys():
            speaker_id = speaker_ids[speaker_key]
            speaker_key = speaker_key.lower().replace('_', '-')
            
            ses_file_path = f'./checkpoints_v2/base_speakers/ses/{speaker_key}.pth'
            
            if not os.path.exists(ses_file_path):
                print(f"File not found: {ses_file_path}")
                continue
                
            try:
                source_se = torch.load(ses_file_path, map_location=device)
                
                if torch.backends.mps.is_available() and device == 'cpu':
                    torch.backends.mps.is_available = lambda: False
                    
                print(f"Generating audio for {speaker_key}...")
                model.tts_to_file(text, speaker_id, src_path, speed=speed)
                save_path = f'{output_dir}/output_v2_{language}_{speaker_key}.wav'

                encode_message = "@MyShell"
                tone_color_converter.convert(
                    audio_src_path=src_path, 
                    src_se=source_se, 
                    tgt_se=target_se, 
                    output_path=save_path,
                    message=encode_message)
                    
                print(f"✓ Saved: {save_path}")
                    
            except Exception as e:
                print(f"Error processing {speaker_key}: {e}")

Checking NLTK data...
✓ Punkt tokenizer found
✓ Averaged perceptron tagger found
Processing EN_NEWEST...


  WeightNorm.apply(module, name, dim)


Generating audio for en-newest...
 > Text split to sentences.
Hello students from FerienAkademie. How are you doing? It's beautiful weather today, I hope you are ready for a long hike. To tell more about myself, I love reading books and listening to music.


  0%|          | 0/1 [00:00<?, ?it/s]Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 1/1 [00:14<00:00, 14.64s/it]


✓ Saved: .\checkpoints_v2\outputs_v2/output_v2_EN_NEWEST_en-newest.wav
