# VALL-E X

VALL-E XはMicrosoftが開発した音声合成のモデルです。論文はMicrosoftから出ていますが、今回紹介するモデルはMicrosoftaが出しているモデルではないです(Microsoftの公式のモデルは公開されていないみたいです)主な特徴としては、英語、日本語、中国語での音声合成と数秒の音声ファイルからその声を模倣した合成音声を生成できます。
- GitHub:https://github.com/Plachtaa/VALL-E-X

<a href="https://colab.research.google.com/github/fuyu-quant/data-science-wiki/blob/main/multimodal/text_to_speech/vall_e_x.ipynb" target="_blank" rel="noopener noreferrer"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
%%capture
!git clone https://github.com/Plachtaa/VALL-E-X.git


In [None]:
cd VALL-E-X


In [5]:
%%capture
!pip install -r requirements.txt


In [6]:
from utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
from IPython.display import Audio


### モデルのダウンロード

In [None]:
preload_models()


### 音声生成

In [None]:
text_prompt = """
    Quantum computers are advanced computational devices that use principles of quantum mechanics to process information. Unlike classical computers which use bits as 0s or 1s, quantum computers use qubits, which can be in a superposition of both states. This allows them to solve certain problems much faster than classical computers, especially in areas like cryptography and optimization.
"""
audio_array = generate_audio(text_prompt)

write_wav("/content/english.wav", SAMPLE_RATE, audio_array)
Audio(audio_array, rate=SAMPLE_RATE)


### 日本語

In [None]:
text_prompt = """
    ひき肉でーーーす．
"""
audio_array = generate_audio(text_prompt, prompt="cafe")

write_wav("/content/japanese.wav", SAMPLE_RATE, audio_array)
Audio(audio_array, rate=SAMPLE_RATE)


### 複数言語

In [None]:
text_prompt = """
    [EN]Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.[EN]
    [ZH]神经网络（Neural Network）是模拟人脑工作机制的算法，用于识别模式和处理数据。它包括多个层，每个层都有许多神经元。通过训练数据，神经网络可以不断调整其内部权重，以优化其预测和分类能力。[ZH]
"""
audio_array = generate_audio(text_prompt, language='mix')

write_wav("/content/multilingual.wav", SAMPLE_RATE, audio_array)
Audio(audio_array, rate=SAMPLE_RATE)


### 様々な声で試す
* https://github.com/Plachtaa/VALL-E-X/tree/master/presets

In [None]:
text_prompt = """
I'll get serious starting tomorrow
"""

audio_array = generate_audio(text_prompt, prompt="cafe")

write_wav("/content/sample.wav", SAMPLE_RATE, audio_array)
Audio(audio_array, rate=SAMPLE_RATE)


### 音声合成

In [None]:
from utils.prompt_making import make_prompt

# 先ほど生成した音声を使う
make_prompt(name="sample", audio_prompt_path="/content/sample.wav",
                transcript="I'll get serious starting tomorrow.")

# whisperが使える時は文章を明示的に与えなくても自動で文字起こしができる
#make_prompt(name="paimon", audio_prompt_path="paimon_prompt.wav")


text_prompt = """
Procrastination is the thief of time
"""
audio_array = generate_audio(text_prompt, prompt="sample")

write_wav("/content/sample2.wav", SAMPLE_RATE, audio_array)
Audio(audio_array, rate=SAMPLE_RATE)
