# Voice tone cloning with OpenVoice and OpenVINO

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/openvoice/openvoice.ipynb" />


OpenVoice is a versatile instant voice tone transferring and generating speech in various languages with just a brief audio snippet from the source speaker. OpenVoice has three main features: (i) high quality tone color replication with multiple languages and accents; (ii) it provides fine-tuned control over voice styles, including emotions, accents, as well as other parameters such as rhythm, pauses, and intonation. (iii) OpenVoice achieves zero-shot cross-lingual voice cloning, eliminating the need for the generated speech and the reference speech to be part of a massive-speaker multilingual training dataset.

![image](https://github.com/openvinotoolkit/openvino_notebooks/assets/5703039/ca7eab80-148d-45b0-84e8-a5a279846b51)

More details about model can be found in [project web page](https://research.myshell.ai/open-voice), [paper](https://arxiv.org/abs/2312.01479), and official [repository](https://github.com/myshell-ai/OpenVoice)

This notebook provides example of converting [PyTorch OpenVoice model](https://github.com/myshell-ai/OpenVoice) to OpenVINO IR. In this tutorial we will explore how to convert and run OpenVoice using OpenVINO.

#### Table of contents:

- [Clone repository and install requirements](#Clone-repository-and-install-requirements)
- [Download checkpoints and load PyTorch model](#Download-checkpoints-and-load-PyTorch-model)
- [Convert models to OpenVINO IR](#Convert-models-to-OpenVINO-IR)
- [Inference](#Inference)
    - [Select inference device](#Select-inference-device)
    - [Select reference tone](#Select-reference-tone)
    - [Run inference](#Run-inference)
- [Run OpenVoice Gradio interactive demo](#Run-OpenVoice-Gradio-interactive-demo)
- [Cleanup](#Cleanup)


### Installation Instructions

This is a self-contained example that relies solely on its own code.

We recommend  running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to [Installation Guide](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/README.md#-installation-guide).

## Clone repository and install requirements
[back to top ⬆️](#Table-of-contents:)

In [1]:
# Fetch `notebook_utils` module
import requests
from pathlib import Path

if not Path("notebook_utils.py").exists():

    r = requests.get(
        url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
    )
    open("notebook_utils.py", "w").write(r.text)

if not Path("cmd_helper.py").exists():
    r = requests.get(
        url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/cmd_helper.py",
    )
    open("cmd_helper.py", "w").write(r.text)

if not Path("pip_helper.py").exists():
    r = requests.get(
        url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/pip_helper.py",
    )
    open("pip_helper.py", "w").write(r.text)

# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry
from notebook_utils import collect_telemetry

collect_telemetry("openvoice.ipynb")

In [None]:
from pathlib import Path

from cmd_helper import clone_repo
from pip_helper import pip_install
import platform


repo_dir = Path("OpenVoice")

clone_repo("https://github.com/myshell-ai/OpenVoice")
orig_english_path = Path("OpenVoice/openvoice/text/_orig_english.py")
english_path = Path("OpenVoice/openvoice/text/english.py")

if not orig_english_path.exists():
    orig_english_path = Path("OpenVoice/openvoice/text/_orig_english.py")
    english_path = Path("OpenVoice/openvoice/text/english.py")

    english_path.rename(orig_english_path)

    with orig_english_path.open("r") as f:
        data = f.read()
        data = data.replace("unidecode", "anyascii")
        with english_path.open("w") as out_f:
            out_f.write(data)


# fix a problem with silero downloading and installing
with Path("OpenVoice/openvoice/se_extractor.py").open("r") as orig_file:
    data = orig_file.read()
    data = data.replace('method="silero"', 'method="silero:3.0"')
    with Path("OpenVoice/openvoice/se_extractor.py").open("w") as out_f:
        out_f.write(data)

#clone melotts 
clone_repo("https://github.com/myshell-ai/MeloTTS")

"""
pip_install("librosa>=0.8.1", "pydub>=0.25.1", "tqdm", "inflect>=7.0.0", "pypinyin>=0.50.0", "openvino>=2023.3", "gradio>=4.15")
pip_install(
    "--extra-index-url",
    "https://download.pytorch.org/whl/cpu",
    "wavmark>=0.0.3",
    "faster-whisper>=0.9.0",
    "eng_to_ipa>=0.0.2",
    "cn2an>=0.5.22",
    "jieba>=0.42.1",
    "langid>=1.1.6",
    "ipywebrtc",
    "anyascii",
    "torch>=2.1",
    "nncf>=2.11.0",
    "dtw-python",
    "more-itertools",
    "tiktoken",
)"
"""""
pip_install("librosa==0.9.1", "pydub==0.25.1", "tqdm", "inflect==7.0.0", "pypinyin==0.50.0", "openvino>=2025.0", "gradio",)
pip_install(
    "--extra-index-url",
    "https://download.pytorch.org/whl/cpu",
    "wavmark>=0.0.3",
    "faster-whisper>=0.9.0",
    "eng_to_ipa==0.0.2",
    "cn2an==0.5.22",
    "jieba==0.42.1",
    "langid==1.1.6",
    "ipywebrtc",
    "anyascii==0.3.2",
    "torch>=2.1",
    "torchaudio",
    "cached_path",
    "transformers==4.27.4",
    "num2words==0.5.12",
    "unidic_lite==1.0.8",
    "unidic==1.1.0",
    "mecab-python3==1.0.9",
    "pykakasi==2.2.1",
    "fugashi==1.3.0",
    "g2p_en==2.1.0",
    "jamo==0.4.1",
    "gruut[de,es,fr]==2.2.3",
    "g2pkk>=0.1.1",
    "dtw-python",
    "more-itertools",
    "tiktoken",
    "tensorboard==2.16.2",
    "loguru==0.7.2",
    "nncf"
)
pip_install("--no-deps", "whisper-timestamped>=1.14.2", "openai-whisper")

if platform.system() == "Darwin":
    pip_install("numpy<2.0")

# install unidic
#!python -m unidic download

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu
download url: https://cotonoha-dic.s3-ap-northeast-1.amazonaws.com/unidic-3.1.0.zip
Dictionary version: 3.1.0+2021-08-31
Downloading UniDic v3.1.0+2021-08-31...
unidic-3.1.0.zip:   1%|                      | 2.94M/526M [00:09<28:47, 303kB/s]
Traceback (most recent call last):
  File "/home/gta/miniforge3/envs/openvoice_notebook/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/gta/miniforge3/envs/openvoice_notebook/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/gta/miniforge3/envs/openvoice_notebook/lib/python3.10/site-packages/unidic/__main__.py", line 17, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "/home/gta/miniforge3/envs/openvoice_notebook/lib/python3.10/site-packages/plac_core.py", line 436, in call
    cmd, result = parser.consume(arglist)
  File "/home/gta/minifor

## Download checkpoints and load PyTorch model
[back to top ⬆️](#Table-of-contents:)

In [4]:
import os
import torch
import openvino as ov
import ipywidgets as widgets
from IPython.display import Audio
from notebook_utils import download_file, device_widget

core = ov.Core()

from openvoice.api import ToneColorConverter, OpenVoiceBaseClass
from openvoice.api import spectrogram_torch
import openvoice.se_extractor as se_extractor
from melo.api import TTS

Importing the dtw module. When using in academic works please cite:
  T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package.
  J. Stat. Soft., doi:10.18637/jss.v031.i07.





In [None]:

base_speakers_suffix =  Path("base_speakers/ses")
converter_suffix = Path("converter")

To make notebook lightweight by default model for Chinese speech is not activated, in order turn on please set flag `enable_chinese_lang` to True

In [7]:
use_only_english_newest_and_chinese = True

In [None]:

def download_from_hf_hub(repo_id, filename, local_dir="./"):
    from huggingface_hub import hf_hub_download
    local_path = Path(local_dir)
    local_path.mkdir(exist_ok=True)
    hf_hub_download(repo_id=repo_id, filename=filename, local_dir=local_path)

#Download OpenVoice2
download_from_hf_hub("myshell-ai/OpenVoiceV2", f"{converter_suffix.as_posix()}/checkpoint.pth")
download_from_hf_hub("myshell-ai/OpenVoiceV2", f"{converter_suffix.as_posix()}/config.json")

# if use_only_english_newest_and_chinese:
#     download_from_hf_hub("myshell-ai/OpenVoiceV2", f"{base_speakers_suffix.as_posix()}/en-newest.pth")
#     download_from_hf_hub("myshell-ai/OpenVoiceV2", f"{base_speakers_suffix.as_posix()}/zh.pth")
# else:
#     download_from_hf_hub("myshell-ai/OpenVoiceV2", f"{base_speakers_suffix.as_posix()}/en-default.pth")
#     download_from_hf_hub("myshell-ai/OpenVoiceV2", f"{base_speakers_suffix.as_posix()}/en-au.pth")
#     download_from_hf_hub("myshell-ai/OpenVoiceV2", f"{base_speakers_suffix.as_posix()}/en-br.pth")
#     download_from_hf_hub("myshell-ai/OpenVoiceV2", f"{base_speakers_suffix.as_posix()}/en-india.pth")
#     download_from_hf_hub("myshell-ai/OpenVoiceV2", f"{base_speakers_suffix.as_posix()}/en-newest.pth")
#     download_from_hf_hub("myshell-ai/OpenVoiceV2", f"{base_speakers_suffix.as_posix()}/en-us.pth")
#     download_from_hf_hub("myshell-ai/OpenVoiceV2", f"{base_speakers_suffix.as_posix()}/zh.pth")
#     download_from_hf_hub("myshell-ai/OpenVoiceV2", f"{base_speakers_suffix.as_posix()}/es.pth")
#     download_from_hf_hub("myshell-ai/OpenVoiceV2", f"{base_speakers_suffix.as_posix()}/fr.pth")
#     download_from_hf_hub("myshell-ai/OpenVoiceV2", f"{base_speakers_suffix.as_posix()}/jp.pth")
#     download_from_hf_hub("myshell-ai/OpenVoiceV2", f"{base_speakers_suffix.as_posix()}/kr.pth")

#Download MeloTTS
download_from_hf_hub("myshell-ai/MeloTTS-Chinese", "checkpoint.pth","MeloTTS-Chinese")
download_from_hf_hub("myshell-ai/MeloTTS-Chinese", "config.json","MeloTTS-Chinese")
download_from_hf_hub("myshell-ai/MeloTTS-English-v3", "checkpoint.pth","MeloTTS-English-v3")
download_from_hf_hub("myshell-ai/MeloTTS-English-v3", "config.json","MeloTTS-English-v3")

   

checkpoint.pth:   0%|          | 0.00/208M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.30k [00:00<?, ?B/s]

In [12]:
pt_device = "cpu"

melo_tts_en_newest = TTS("EN_NEWEST", pt_device, use_hf=False, config_path = Path("MeloTTS-English-v3")/"config.json", ckpt_path = Path("MeloTTS-English-v3")/"checkpoint.pth")
melo_tts_zh = TTS("ZH", pt_device, use_hf=False, config_path = Path("MeloTTS-Chinese")/"config.json", ckpt_path = Path("MeloTTS-Chinese")/"checkpoint.pth")

tone_color_converter = ToneColorConverter(converter_suffix / "config.json", device=pt_device)
tone_color_converter.load_ckpt(converter_suffix / "checkpoint.pth")
print(f"ToneColorConverter version: {tone_color_converter.version}")
#if enable_no_english_lang:
#     zh_base_speaker_tts = BaseSpeakerTTS(zh_suffix / "config.json", device=pt_device)
#     zh_base_speaker_tts.load_ckpt(zh_suffix / "checkpoint.pth")
# else:
#     zh_base_speaker_tts = None

Loaded checkpoint 'converter/checkpoint.pth'
missing/unexpected keys: [] []
ToneColorConverter version: v2


## Convert models to OpenVINO IR
[back to top ⬆️](#Table-of-contents:)

There are 2 models in OpenVoice: first one is responsible for speech generation `BaseSpeakerTTS` and the second one `ToneColorConverter` imposes arbitrary voice tone to the original speech. To convert to OpenVino IR format first we need to get acceptable `torch.nn.Module` object. Both ToneColorConverter, BaseSpeakerTTS instead of using `self.forward` as the main entry point use custom `infer` and `convert_voice` methods respectively, therefore need to wrap them with a custom class that is inherited from torch.nn.Module.

In [None]:
class OVOpenVoiceBase(torch.nn.Module):
    """
    Base class for both TTS and voice tone conversion model: constructor is same for both of them.
    """

    def __init__(self, voice_model: OpenVoiceBaseClass):
        super().__init__()
        self.voice_model = voice_model
        for par in voice_model.model.parameters():
            par.requires_grad = False


class OVSynthesizerTTSWrapper(torch.nn.Module):
    """
    Wrapper for SynthesizerTrn model to make it compatible with Torch-style inference.
    """

    def __init__(self, model,language):
        super().__init__()
        self.model = model
        self.language = language

    def forward(self, x, x_lengths, sid, tone, language, bert, ja_bert,noise_scale, length_scale, noise_scale_w, sdp_ratio):
        """
        Forward call to the underlying SynthesizerTrn model. Accepts arbitrary arguments
        and forwards them directly to the model's inference method.
        """
        return self.model.infer( 
                        x,
                        x_lengths,
                        sid,
                        tone,
                        language,
                        bert,
                        ja_bert,
                        sdp_ratio = sdp_ratio,
                        noise_scale = noise_scale,
                        noise_scale_w = noise_scale_w,
                        length_scale = length_scale)

    def get_example_input(self):
        """
        Return a tuple of example inputs for tracing/ONNX exporting or debugging.
        """
        if self.language == "EN_NEWEST":
            x_tst = torch.tensor([[  0,   0,   0,  34,   0,  59,   0,  34,   0, 110,   0, 103,   0,  39,
            0,  14,   0,  43,   0,  49,   0,  59,   0,  85,   0,  23,   0,  45,
            0,  80,   0,  68,   0,  89,   0,  44,   0,  70,   0,  23,   0,  30,
            0,  28,   0,  89,   0,  23,   0,  67,   0,  29,   0,  23,   0,  73,
            0,  89,   0,  89,   0,  43,   0,  89,   0,  23,   0,  70,   0, 209,
            0,   0,   0]],dtype=torch.int64).to(pt_device)

            x_tst_lengths = torch.tensor([73],dtype=torch.int64).to(pt_device)
            speakers = torch.tensor([0],dtype=torch.int64).to(pt_device)
            tones = torch.tensor([[0, 7, 0, 7, 0, 9, 0, 7, 0, 7, 0, 9, 0, 9, 0, 7, 0, 8, 0, 7, 0, 9, 0, 7,
            0, 8, 0, 7, 0, 9, 0, 7, 0, 7, 0, 9, 0, 7, 0, 8, 0, 7, 0, 9, 0, 7, 0, 8,
            0, 7, 0, 9, 0, 8, 0, 7, 0, 7, 0, 7, 0, 9, 0, 7, 0, 8, 0, 7, 0, 7, 0, 7,
            0]],dtype=torch.int64).to(pt_device)
            lang_ids = torch.tensor([[0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2,
            0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2,
            0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2,
            0]],dtype=torch.int64).to(pt_device)
            bert = torch.zeros((1, 1024, 73), dtype=torch.float32).to(pt_device)
            ja_bert = torch.randn(1, 768, 73).float().to(pt_device)
            sdp_ratio = torch.tensor(0.2).to(pt_device)
            noise_scale = torch.tensor(0.6).to(pt_device)
            noise_scale_w = torch.tensor(0.8).to(pt_device)
            length_scale = torch.tensor(1.0).to(pt_device)
        elif self.language == "ZH":
            x_tst = torch.tensor([[  0,   0,   0, 100,   0,  13,   0, 101,   0,  26,   0,  21,   0,  41,
           0,   7,   0,  33,   0,  57,   0,  33,   0,  77,   0,  12,   0,  62,
           0, 101,   0,  67,   0, 106,   0,   0,   0]],dtype=torch.int64).to(pt_device)

            x_tst_lengths = torch.tensor([37],dtype=torch.int64).to(pt_device)
            speakers = torch.tensor([1],dtype=torch.int64).to(pt_device)
            tones = torch.tensor([[0, 0, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 7, 0, 8, 0, 7, 0, 9, 0, 7,
            0, 8, 0, 7, 0, 1, 0, 1, 0, 0, 0, 0, 0]],dtype=torch.int64).to(pt_device)
            lang_ids = torch.tensor([[0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3,
            0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0]],dtype=torch.int64).to(pt_device)
            bert = torch.zeros((1, 1024, 37), dtype=torch.float32).to(pt_device)
            ja_bert = torch.randn(1, 768, 37).float().to(pt_device)
            sdp_ratio = torch.tensor(0.2).to(pt_device)
            noise_scale = torch.tensor(0.6).to(pt_device)
            noise_scale_w = torch.tensor(0.8).to(pt_device)
            length_scale = torch.tensor(1.0).to(pt_device)

        return (
            x_tst,
            x_tst_lengths,
            speakers,
            tones,
            lang_ids,
            bert,
            ja_bert,
            noise_scale,
            length_scale,
            noise_scale_w,
            sdp_ratio,
        )


class OVOpenVoiceConverter(OVOpenVoiceBase):
    """
    Constructor of this class accepts ToneColorConverter object for voice tone conversion and wraps it's 'voice_conversion' method with forward.
    """

    def get_example_input(self):
        y = torch.randn([1, 513, 238], dtype=torch.float32)
        y_lengths = torch.LongTensor([y.size(-1)])
        target_se = torch.randn(*(1, 256, 1))
        source_se = torch.randn(*(1, 256, 1))
        tau = torch.tensor(0.3)
        return (y, y_lengths, source_se, target_se, tau)

    def forward(self, y, y_lengths, sid_src, sid_tgt, tau):
        return self.voice_model.model.voice_conversion(y, y_lengths, sid_src, sid_tgt, tau)

Convert to OpenVino IR and save to IRs_path folder for the future use. If IRs already exist skip conversion and read them directly

For reducing memory consumption, weights compression optimization can be applied using [NNCF](https://github.com/openvinotoolkit/nncf). Weight compression aims to reduce the memory footprint of a model.
models, which require extensive memory to store the weights during inference, can benefit from weight compression in the following ways:

* enabling the inference of exceptionally large models that cannot be accommodated in the memory of the device;

* improving the inference performance of the models by reducing the latency of the memory access when computing the operations with weights, for example, Linear layers.

[Neural Network Compression Framework (NNCF)](https://github.com/openvinotoolkit/nncf) provides 4-bit / 8-bit mixed weight quantization as a compression method. The main difference between weights compression and full model quantization (post-training quantization) is that activations remain floating-point in the case of weights compression which leads to a better accuracy. In addition, weight compression is data-free and does not require a calibration dataset, making it easy to use.

`nncf.compress_weights` function can be used for performing weights compression. The function accepts an OpenVINO model and other compression parameters.

More details about weights compression can be found in [OpenVINO documentation](https://docs.openvino.ai/2024/openvino-workflow/model-optimization-guide/weight-compression.html).

In [121]:
import nncf


IRS_PATH = Path("openvino_irs/")
EN_TTS_IR = IRS_PATH / "melo_tts_en_newest.xml"
ZH_TTS_IR = IRS_PATH / "melo_tts_zh.xml"
VOICE_CONVERTER_IR = IRS_PATH / "openvoice2_tone_conversion.xml"

paths = [EN_TTS_IR, ZH_TTS_IR, VOICE_CONVERTER_IR]
models = [
    OVSynthesizerTTSWrapper(melo_tts_en_newest.model,'EN_NEWEST'),
    OVSynthesizerTTSWrapper(melo_tts_zh.model, 'ZH'),
    OVOpenVoiceConverter(tone_color_converter),
]

ov_models = []

for model, path in zip(models, paths):
    if not path.exists():
        ov_model = ov.convert_model(model, example_input=model.get_example_input())
        ov_model = nncf.compress_weights(ov_model)
        ov.save_model(ov_model, path)
    else:
        ov_model = core.read_model(path)
    ov_models.append(ov_model)

ov_en_tts, ov_zh_tts, ov_voice_conversion = ov_models
#ov_voice_conversion = ov_models[0]
# if enable_chinese_lang:
#     ov_zh_tts = ov_models[-1]

## Inference
[back to top ⬆️](#Table-of-contents:)

### Select inference device
[back to top ⬆️](#Table-of-contents:)

In [28]:
core = ov.Core()

device = device_widget()
device

Dropdown(description='Device:', index=3, options=('CPU', 'GPU', 'NPU', 'AUTO'), value='AUTO')

### Select reference tone
[back to top ⬆️](#Table-of-contents:)

First of all, select the reference tone of voice to which the generated text will be converted: your can select from existing ones, record your own by selecting `record_manually` or upload you own file by `load_manually`

In [29]:
REFERENCE_VOICES_PATH = f"{repo_dir}/resources/"
reference_speakers = [
    *[path for path in os.listdir(REFERENCE_VOICES_PATH) if os.path.splitext(path)[-1] == ".mp3"],
    "record_manually",
    "load_manually",
]

ref_speaker = widgets.Dropdown(
    options=reference_speakers,
    value=reference_speakers[0],
    description="reference voice from which tone color will be copied",
    disabled=False,
)

ref_speaker

Dropdown(description='reference voice from which tone color will be copied', options=('example_reference.mp3',…

In [30]:
OUTPUT_DIR = Path("outputs/")
OUTPUT_DIR.mkdir(exist_ok=True)

In [31]:
ref_speaker_path = f"{REFERENCE_VOICES_PATH}/{ref_speaker.value}"
allowed_audio_types = ".mp4,.mp3,.wav,.wma,.aac,.m4a,.m4b,.webm"

if ref_speaker.value == "record_manually":
    ref_speaker_path = OUTPUT_DIR / "custom_example_sample.webm"
    from ipywebrtc import AudioRecorder, CameraStream

    camera = CameraStream(constraints={"audio": True, "video": False})
    recorder = AudioRecorder(stream=camera, filename=ref_speaker_path, autosave=True)
    display(recorder)
elif ref_speaker.value == "load_manually":
    upload_ref = widgets.FileUpload(
        accept=allowed_audio_types,
        multiple=False,
        description="Select audio with reference voice",
    )
    display(upload_ref)

Play the reference voice sample before cloning it's tone to another speech

In [32]:
def save_audio(voice_source: widgets.FileUpload, out_path: str):
    with open(out_path, "wb") as output_file:
        assert len(voice_source.value) > 0, "Please select audio file"
        output_file.write(voice_source.value[0]["content"])


if ref_speaker.value == "load_manually":
    ref_speaker_path = f"{OUTPUT_DIR}/{upload_ref.value[0].name}"
    save_audio(upload_ref, ref_speaker_path)

In [33]:
Audio(ref_speaker_path)

In [34]:
torch_hub_local = Path("torch_hub_local/")
%env TORCH_HOME={str(torch_hub_local.absolute())}

env: TORCH_HOME=/home/gta/qiu/openvino_notebooks/notebooks/openvoice/torch_hub_local


Load speaker embeddings

In [35]:
# second step to fix a problem with silero downloading and installing
import os
import zipfile

url = "https://github.com/snakers4/silero-vad/zipball/v3.0"

torch_hub_dir = torch_hub_local / "hub"
torch.hub.set_dir(torch_hub_dir.as_posix())

zip_filename = "v3.0.zip"
output_path = torch_hub_dir / "v3.0"
if not (torch_hub_dir / zip_filename).exists():
    download_file(url, directory=torch_hub_dir, filename=zip_filename)
    zip_ref = zipfile.ZipFile((torch_hub_dir / zip_filename).as_posix(), "r")
    zip_ref.extractall(path=output_path.as_posix())
    zip_ref.close()

v3_dirs = [d for d in output_path.iterdir() if "snakers4-silero-vad" in d.as_posix()]
if len(v3_dirs) > 0 and not (torch_hub_dir / "snakers4_silero-vad_v3.0").exists():
    v3_dir = str(v3_dirs[0])
    os.rename(str(v3_dirs[0]), (torch_hub_dir / "snakers4_silero-vad_v3.0").as_posix())

In [40]:
en_source_newest_se = torch.load(f"{base_speakers_suffix}/en-newest.pth")
zh_source_se = torch.load(f"{base_speakers_suffix}/zh.pth")

target_se, audio_name = se_extractor.get_se(ref_speaker_path, tone_color_converter, target_dir=OUTPUT_DIR, vad=True)

OpenVoice version: v2


[(0.0, 2.066), (2.572, 12.722), (15.436, 17.138), (17.548, 21.65), (21.772, 26.354), (30.028, 40.37), (41.068, 48.53), (48.844, 54.002), (55.756, 57.362), (59.692, 61.202), (65.932, 67.538), (74.284, 78.386), (80.716, 86.45), (86.764, 91.090875)]
after vad: dur = 64.44804988662132


Replace original infer methods of `OpenVoiceBaseClass` with optimized OpenVINO inference.

There are pre and post processings that are not traceable and could not be offloaded to OpenVINO, instead of writing such processing ourselves we will rely on the already existing ones. We just replace infer and voice conversion functions of `OpenVoiceBaseClass` so that the the most computationally expensive part is done in OpenVINO.

In [50]:
def get_pathched_infer(ov_model: ov.Model, device: str) -> callable:
    compiled_model = core.compile_model(ov_model, device)
    def infer_impl(x, x_lengths, sid, tone, language, bert, ja_bert,noise_scale, length_scale, noise_scale_w, max_len = None, sdp_ratio=1.0,y = None, g = None):
        ov_output = compiled_model((x, x_lengths, sid, tone, language, bert, ja_bert,noise_scale, length_scale, noise_scale_w, sdp_ratio))
        return (torch.tensor(ov_output[0]),)
    return infer_impl

def get_patched_voice_conversion(ov_model: ov.Model, device: str) -> callable:
    compiled_model = core.compile_model(ov_model, device)

    def voice_conversion_impl(y, y_lengths, sid_src, sid_tgt, tau):
        ov_output = compiled_model((y, y_lengths, sid_src, sid_tgt, tau))
        return (torch.tensor(ov_output[0]),)

    return voice_conversion_impl


melo_tts_en_newest.model.infer = get_pathched_infer(ov_en_tts, device.value)
melo_tts_zh.model.infer = get_pathched_infer(ov_zh_tts, device.value)
tone_color_converter.model.voice_conversion = get_patched_voice_conversion(ov_voice_conversion, device.value)
# if enable_chinese_lang:
#     zh_base_speaker_tts.model.infer = get_pathched_infer(ov_zh_tts, device.value)

### Run inference
[back to top ⬆️](#Table-of-contents:)

In [38]:
voice_source = widgets.Dropdown(
    options=["use TTS", "choose_manually"],
    value="use TTS",
    description="Voice source",
    disabled=False,
)

voice_source

Dropdown(description='Voice source', options=('use TTS', 'choose_manually'), value='use TTS')

In [39]:
if voice_source.value == "choose_manually":
    upload_orig_voice = widgets.FileUpload(
        accept=allowed_audio_types,
        multiple=False,
        description="audo whose tone will be replaced",
    )
    display(upload_orig_voice)

In [123]:
if voice_source.value == "choose_manually":
    orig_voice_path = f"{OUTPUT_DIR}/{upload_orig_voice.value[0].name}"
    save_audio(upload_orig_voice, orig_voice_path)
    source_se, _ = se_extractor.get_se(orig_voice_path, tone_color_converter, target_dir=OUTPUT_DIR, vad=True)
else:
    en_text = """
    OpenVINO toolkit is a comprehensive toolkit for quickly developing applications and solutions that solve 
    a variety of tasks including emulation of human vision, automatic speech recognition, natural language processing, 
    recommendation systems, and many others.
    """
    #source_se = en_source_newest_se
    orig_voice_path = OUTPUT_DIR / "output_ov_en-newest.wav"
    print("use output_ov_en-newest.wav")
    speaker_id = 0
    melo_tts_en_newest.tts_to_file(en_text, speaker_id, orig_voice_path, speed = 1.0)
    zh_text = """
    OpenVINO 是一个全面的开发工具集，旨在快速开发和部署各类应用程序及解决方案，可用于模仿人类视觉、自动语音识别、自然语言处理、
    推荐系统等多种任务。
    """
    #source_se = zh_source_se
    orig_voice_path = OUTPUT_DIR / "output_ov_zh.wav"
    speaker_id = 1 #Choose the first speaker
    melo_tts_zh.tts_to_file(zh_text, speaker_id, orig_voice_path, speed = 1.0)


use output_ov_en-newest.wav
 > Text split to sentences.
OpenVINO toolkit is a comprehensive toolkit for quickly developing applications and solutions that solve a variety of tasks including emulation of human vision, automatic speech recognition, natural language processing, recommendation systems, and many others.


100%|██████████| 1/1 [00:04<00:00,  4.99s/it]


 > Text split to sentences.
OpenVINO 是一个全面的开发工具集,
旨在快速开发和部署各类应用程序及解决方案,
可用于模仿人类视觉、自动语音识别、自然语言处理、 推荐系统等多种任务.


100%|██████████| 3/3 [00:04<00:00,  1.41s/it]


And finally, run voice tone conversion with OpenVINO optimized model

In [53]:
tau_slider = widgets.FloatSlider(
    value=0.3,
    min=0.01,
    max=2.0,
    step=0.01,
    description="tau",
    disabled=False,
    readout_format=".2f",
)
tau_slider

FloatSlider(value=0.3, description='tau', max=2.0, min=0.01, step=0.01)

In [125]:
# import librosa
en_resulting_voice_path = OUTPUT_DIR / "output_ov_en-newest_cloned.wav"
en_orig_voice_path = OUTPUT_DIR / "output_ov_en-newest.wav"
zh_resulting_voice_path = OUTPUT_DIR / "output_ov_zh_cloned.wav"
zh_orig_voice_path = OUTPUT_DIR / "output_ov_zh.wav"
# print(source_se.shape, target_se.shape)

# # source_flat = source_se.view(-1)
# # target_flat = target_se.view(-1)

# # with open("source_se.txt", "w") as f:
# #     for value in source_flat:
# #         f.write(f"{value.item():.8f}\n")

# # with open("target_se.txt", "w") as f:
# #     for value in target_flat:
# #         f.write(f"{value.item():.8f}\n")
# audio, sample_rate = librosa.load(orig_voice_path, sr=22050)
# audio = torch.tensor(audio).float()
# print("sample_rate", sample_rate)
# y = torch.FloatTensor(audio).to(pt_device)
# y = y.unsqueeze(0)
# spec = spectrogram_torch(y, 1024,22050, 256, 1024,center=False).to(pt_device)
# spec_lengths = torch.LongTensor([spec.size(-1)]).to(pt_device)


# print(spec.shape, spec_lengths)
# flat = spec.view(-1)  # 或者用 tensor.reshape(-1)

# with open("spec_flat.txt", "w") as f:
#     for v in flat:
#         f.write(f"{v.item():.8f}\n")
tone_color_converter.convert(
    audio_src_path=zh_orig_voice_path,
    src_se=en_source_newest_se,
    tgt_se=target_se,
    output_path=en_resulting_voice_path,
    tau=tau_slider.value,
    message="@MyShell",
)
tone_color_converter.convert(
    audio_src_path=zh_orig_voice_path,
    src_se=zh_source_se,
    tgt_se=target_se,
    output_path=zh_resulting_voice_path,
    tau=tau_slider.value,
    message="@MyShell",
)

In [126]:
Audio(zh_orig_voice_path)

In [127]:
Audio(zh_resulting_voice_path)

## Run OpenVoice Gradio interactive demo
[back to top ⬆️](#Table-of-contents:)

We can also use [Gradio](https://www.gradio.app/) app to run TTS and voice tone conversion online.

In [None]:
import gradio as gr
import langid

supported_languages = ["zh", "en"]


def predict_impl(
    prompt,
    audio_file_pth,
    agree,
    output_dir,
    tone_color_converter,
    en_tts_model,
    zh_tts_model,
    en_source_se,
    zh_source_se,
):
    text_hint = ""
    if not agree:
        text_hint += "[ERROR] Please accept the Terms & Condition!\n"
        gr.Warning("Please accept the Terms & Condition!")
        return (
            text_hint,
            None,
            None,
        )

    language_predicted = langid.classify(prompt)[0].strip()
    print(f"Detected language:{language_predicted}")

    if language_predicted not in supported_languages:
        text_hint += f"[ERROR] The detected language {language_predicted} for your input text is not in our Supported Languages: {supported_languages}\n"
        gr.Warning(f"The detected language {language_predicted} for your input text is not in our Supported Languages: {supported_languages}")
        return (
            text_hint,
            None,
        )

    if language_predicted == "zh":
        tts_model = zh_tts_model
        if zh_tts_model is None:
            gr.Warning("TTS model for Chinece language was not loaded")
            return (
                text_hint,
                None,
            )
        source_se = zh_source_se
        speaker_id = 1

    elif language_predicted == "en":
        tts_model = en_tts_model
        if en_tts_model is None:
            gr.Warning("TTS model for English language was not loaded")
            return (
                text_hint,
                None,
            )
        source_se = en_source_se
        speaker_id = 0


    speaker_wav = audio_file_pth

    if len(prompt) < 2:
        text_hint += "[ERROR] Please give a longer prompt text \n"
        gr.Warning("Please give a longer prompt text")
        return (
            text_hint,
            None,
        )
    if len(prompt) > 200:
        text_hint += (
            "[ERROR] Text length limited to 200 characters for this demo, please try shorter text. You can clone our open-source repo and try for your usage \n"
        )
        gr.Warning("Text length limited to 200 characters for this demo, please try shorter text. You can clone our open-source repo for your usage")
        return (
            text_hint,
            None,
        )

    # note diffusion_conditioning not used on hifigan (default mode), it will be empty but need to pass it to model.inference
    try:
        target_se, audio_name = se_extractor.get_se(speaker_wav, tone_color_converter, target_dir=OUTPUT_DIR, vad=True)
    except Exception as e:
        text_hint += f"[ERROR] Get target tone color error {str(e)} \n"
        gr.Warning("[ERROR] Get target tone color error {str(e)} \n")
        return (
            text_hint,
            None,
        )
    gr.Warning("language_predicted ",language_predicted )
    gr.Warning(prompt, speaker_id,src_path)
    src_path = f"{output_dir}/tmp.wav"
    tts_model.tts_to_file(prompt, speaker_id, src_path, speed = 1.0)

    save_path = f"{output_dir}/output.wav"
    encode_message = "@MyShell"
    tone_color_converter.convert(
        audio_src_path=src_path,
        src_se=source_se,
        tgt_se=target_se,
        output_path=save_path,
        message=encode_message,
    )

    text_hint += "Get response successfully \n"

    return (
        text_hint,
        src_path,
        save_path,
    )

In [119]:
from functools import partial


predict = partial(
    predict_impl,
    output_dir=OUTPUT_DIR,
    tone_color_converter=tone_color_converter,
    en_tts_model=melo_tts_en_newest,
    zh_tts_model=melo_tts_zh,
    en_source_se=en_source_newest_se,
    zh_source_se=zh_source_se,
)

In [120]:
import sys
if 'gradio_helper' in sys.modules:
    print("clean")
    del sys.modules['gradio_helper']
sys.path.insert(0, str(Path("/home/gta/qiu/openvino_notebooks/notebooks/openvoice")))
from gradio_helper import make_demo

demo = make_demo(fn=predict)

try:
    demo.queue(max_size=2).launch(debug=True, height=1000)
except Exception:
    demo.queue(max_size=2).launch(share=True, debug=True, height=1000)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/

clean
* Running on local URL:  http://127.0.0.1:7863


Rerunning server... use `close()` to stop if you need to change `launch()` parameters.
----
* Running on public URL: https://6106334727f7cfe319.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7863 <> https://6106334727f7cfe319.gradio.live


In [82]:
# please uncomment and run this cell for stopping gradio interface
demo.close()

Closing server running on port: 7861


## Cleanup
[back to top ⬆️](#Table-of-contents:)

In [26]:
# import shutil
# shutil.rmtree(CKPT_BASE_PATH)
# shutil.rmtree(IRS_PATH)
# shutil.rmtree(OUTPUT_DIR)