# Beatrice 2.0.0-alpha (Python API)

## Beatrice 2.0 について

Beatrice は、**超低遅延・低負荷・低容量**を特徴とする**完全無料**の AI ボイスチェンジャー VST であり、最初のメジャーリリースであるバージョン 1.0.0 は、低遅延・低負荷な AI ボイスチェンジャーの先駆けとして 2023 年 9 月 10 日にリリースされました。

Beatrice 2 では、これまでの Beatrice の構成を全て見直し、**大幅な性能強化**と**利便性の向上**を目指しています。

### Beatrice 2 のゴール

* **自分の変換された声を聴きながら、歌を快適に歌えるようにする**
* **入力された声の抑揚を変換音声に正確に反映し、より繊細な表現を可能にする**
* **変換音声のより高い自然性と明瞭さ**
* **より多様な変換先話者**
* 50ms 程度の遅延
  * これまでの Beatrice と同程度
  * 外部の録音機器で実測した場合の値。デバイスによる遅延などを含めない場合は、計算方法により異なるが、35ms 程度が目安。
* 開発者のノート PC (Intel Core i7-1165G7) でシングルスレッドで動作させ、RTF < 0.25 となる程度の負荷
  * これまでの Beatrice と同程度か、より低負荷
* 30MB 以下の容量
  * これまでの Beatrice と同程度か、より低容量
* (快適な開発)
  * これまでより少ない依存データ、単純なソースコード、再現性のある実験など内部的な話
* その他 (内緒)

## このデモについて

リアルタイムで変換を行うための API の実装ができましたが、クライアントの作成にまだ時間がかかるため、API とその使用例を公開して進捗を示すものです。

モデルは開発途上であり、品質やピッチへの追従性、ノイズ耐性などはまだまだ向上する見込みがあります。
<small>改善案に計算リソースが追い付いていません</small>

注意点として、バグの除去は十分ではありません。
例えば、不適切なパラメータファイルを読み込むと例外は出ずに Python ごと落ちます。

また、このデモを使用したことによる使用者の不利益に対して、Project Beatrice およびその関係者は一切の責任を負いません。

## ライセンス

### 重み・出力音声

重み (.bin) と出力された変換音声は、JVS corpus と JVS-MuSiC のライセンスに従って使ってください。
Beatrice 2.0.0-alpha を使用していることを記載する必要はありません。

### ライブラリ・この notebook

alpha 版のライブラリ (.so ファイル) やこの使用例の Python コードもご自由に再利用いただけますが、Beatrice 2.0.0-alpha (Python API) を利用していることを記載してください。
<small>(もっとも、バグがどこにあるかわからないのでまともに使うのは難しいと思います。)</small>
後述の使用ライブラリのライセンスもご確認ください。

また、この Python API は C 言語の API (Windows/Linux) のラッパーとして作られており、遅延を切り詰めたクライアントを作る場合にはガベージコレクタが無い C/C++ などでの実装が有利と考えられます。
使いたい場合は Project Beatrice にご連絡ください。
<small>というか C/C++ 書けるなら公式のクライアントを作るのを手伝ってほしいです。</small>

## リソース

* JVS corpus: https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus
* JVS-MuSiC: https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_music
* PocketFFT: https://gitlab.mpcdf.mpg.de/mtr/pocketfft/-/tree/cpp
  * Copyright 2010-2018 Max-Planck-Society
  * BSD-3-Clause license
  * https://opensource.org/license/bsd-3-clause/
* fmath: https://github.com/herumi/fmath
  * Copyright 2009 MITSUNARI Shigeo
  * BSD-3-Clause license
  * https://opensource.org/license/bsd-3-clause/
* NumPy: https://github.com/numpy/numpy
  * Copyright 2005-2023 NumPy Developers
  * BSD-3-Clause license
  * https://opensource.org/license/bsd-3-clause/
* Python: https://github.com/python/cpython
  * Copyright 2001-2023 Python Software Foundation
  * Copyright 2000 BeOpen.com
  * Copyright 1995-2000 Corporation for National Research Initiatives
  * Copyright 1991-1995 Stichting Mathematisch Centrum
  * PSF License
  * https://docs.python.org/3.10/license.html#psf-license

In [None]:
# @title 下準備

# @markdown セルを上から順に実行すると、最後のセルで変換が行われます。
# @markdown パラメータを変更した場合は、変更したセル以降を順に実行してください。

!git clone -b alpha https://huggingface.co/fierce-cats/beatrice-2.0.0-alpha
!cp beatrice-2.0.0-alpha/alpha/* .
!pip install pyworld

import math
import random
from base64 import b64decode
from time import perf_counter

import librosa
import matplotlib.pyplot as plt
import numpy as np
import pyworld
from google.colab import output
from IPython.display import Audio, Javascript
from tqdm.auto import tqdm

from beatrice import (
    IN_HOP_LENGTH,
    OUT_HOP_LENGTH,
    IN_SAMPLE_RATE,
    OUT_SAMPLE_RATE,
    PITCH_BINS,
    PITCH_BINS_PER_OCTAVE,
    PhoneExtractor,
    PitchEstimator,
    WaveformGenerator,
    read_speaker_embeddings,
)


print(f"{IN_SAMPLE_RATE=}")
print(f"{OUT_SAMPLE_RATE=}")

RECORD = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time));
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader();
  reader.onloadend = e => resolve(e.srcElement.result);
  reader.readAsDataURL(blob);
});

var record = time => new Promise(async resolve => {
  const div = document.createElement('div');
  const button = document.createElement('button');

  button.textContent = "Start Recording";
  button.onclick = async function(){
    button.disabled = true;
    button.textContent = "Recording...";
    stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    recorder = new MediaRecorder(stream);
    chunks = [];
    recorder.ondataavailable = e => chunks.push(e.data);
    recorder.start();
    await sleep(time);
    recorder.onstop = async ()=>{
      blob = new Blob(chunks);
      text = await b2text(blob);
      resolve(text);
      button.textContent = "Recording stopped.";
    }
    recorder.stop();
  }
  div.appendChild(button);

  document.querySelector("#output-area").appendChild(div);
});
"""


def record(sec=3):
    # adapted from https://gist.github.com/korakot/c21c3476c024ad6d56d5f48b0bca92be
    display(Javascript(RECORD))
    s = output.eval_js(f"record({sec * 1000})")
    b = b64decode(s.split(",")[1])
    with open("recording.wav",'wb') as f:
        f.write(b)


def slerp(t, v0, v1):
    # adapted from https://gist.github.com/dvschultz/3af50c40df002da3b751efab1daddf2c
    v0_copy = v0
    v1_copy = v1
    v0 = v0 / np.linalg.norm(v0)
    v1 = v1 / np.linalg.norm(v1)
    dot = np.sum(v0 * v1)
    if np.abs(dot) > 0.9995:
        return v0_copy * (1.0 - t) + v1_copy * t
    theta_0 = np.arccos(dot)
    sin_theta_0 = np.sin(theta_0)
    theta_t = theta_0 * t
    sin_theta_t = np.sin(theta_t)
    s0 = np.sin(theta_0 - theta_t) / sin_theta_0
    s1 = sin_theta_t / sin_theta_0
    v2 = s0 * v0_copy + s1 * v1_copy
    return v2


target_speaker_names = [f"jvs{i:03d}" for i in range(1, 101)]
target_speaker_mean_pitches = [146, 216, 154, 248, 140, 100, 208, 228, 122, 278, 156, 119, 148, 274, 256, 199, 213, 217, 233, 137, 109, 120, 125, 248, 224, 257, 238, 117, 220, 250, 123, 171, 144, 119, 212, 250, 109, 227, 244, 230, 144, 131, 212, 131, 137, 136, 138, 110, 133, 132, 241, 151, 227, 138, 230, 217, 212, 220, 226, 229, 247, 215, 232, 209, 243, 247, 246, 126, 225, 133, 112, 231, 142, 117, 151, 144, 154, 110, 134, 154, 132, 217, 246, 218, 239, 132, 151, 130, 122, 264, 215, 203, 267, 218, 225, 242, 163, 178, 158, 125]

nearest_c_major_scale = np.zeros(PITCH_BINS, dtype=np.int32)
i = 0
while i < PITCH_BINS:
    for s in [2, 1, 2, 2, 1, 2, 2]:
        i += s * 8
        if i >= PITCH_BINS:
            break
        nearest_c_major_scale[i] = i
for _ in range(20):
    for i in range(1, PITCH_BINS - 1):
        nearest_c_major_scale[i] = nearest_c_major_scale[i] or nearest_c_major_scale[i - 1] or nearest_c_major_scale[i + 1]
nearest_c_major_scale[-1] = nearest_c_major_scale[-2]

phone_extractor_parameter_file = "phone_extractor_003b_checkpoint_03000000.bin"
pitch_estimator_parameter_file = "pitch_estimator_008_1_checkpoint_00300000.bin"
speaker_embeddings_file = "speaker_embeddings_017_checkpoint_01300000.bin"
waveform_generator_parameter_file = "waveform_generator_017_checkpoint_01300000.bin"

phone_extractor = PhoneExtractor()
phone_extractor.read_parameters(phone_extractor_parameter_file)
pitch_estimator = PitchEstimator()
pitch_estimator.read_parameters(pitch_estimator_parameter_file)
speaker_embeddings = read_speaker_embeddings(speaker_embeddings_file)
waveform_generator = WaveformGenerator()
waveform_generator.read_parameters(waveform_generator_parameter_file)

In [None]:
# @title 入力音声

# @markdown `record_your_voice` を選択してセルを実行すると、録音ボタンが表示されます。

in_filename = "jvs003_16k.wav"  # @param ["jvs001_16k.wav", "jvs002_16k.wav", "jvs003_16k.wav", "jvs004_16k.wav", "record_your_voice"] {allow-input: true}
if in_filename == "record_your_voice":
    record(6)
    in_filename = "recording.wav"

source_wav, _ = librosa.load(in_filename, sr=IN_SAMPLE_RATE)
assert source_wav.ndim == 1
source_wav = source_wav.astype(np.float32)
display(Audio(source_wav, rate=IN_SAMPLE_RATE))

In [None]:
# @title 変換の設定

target_speaker = "random"  # @param ["random", "jvs001", "jvs002", "jvs003", "jvs004", "jvs005", "jvs006", "jvs007", "jvs008", "jvs009", "jvs010", "jvs011", "jvs012", "jvs013", "jvs014", "jvs015", "jvs016", "jvs017", "jvs018", "jvs019", "jvs020", "jvs021", "jvs022", "jvs023", "jvs024", "jvs025", "jvs026", "jvs027", "jvs028", "jvs029", "jvs030", "jvs031", "jvs032", "jvs033", "jvs034", "jvs035", "jvs036", "jvs037", "jvs038", "jvs039", "jvs040", "jvs041", "jvs042", "jvs043", "jvs044", "jvs045", "jvs046", "jvs047", "jvs048", "jvs049", "jvs050", "jvs051", "jvs052", "jvs053", "jvs054", "jvs055", "jvs056", "jvs057", "jvs058", "jvs059", "jvs060", "jvs061", "jvs062", "jvs063", "jvs064", "jvs065", "jvs066", "jvs067", "jvs068", "jvs069", "jvs070", "jvs071", "jvs072", "jvs073", "jvs074", "jvs075", "jvs076", "jvs077", "jvs078", "jvs079", "jvs080", "jvs081", "jvs082", "jvs083", "jvs084", "jvs085", "jvs086", "jvs087", "jvs088", "jvs089", "jvs090", "jvs091", "jvs092", "jvs093", "jvs094", "jvs095", "jvs096", "jvs097", "jvs098", "jvs099", "jvs100"]
pitch_shift_semitones_ = 'auto'  # @param ["'auto'", "-12", "-10", "-8", "-6", "-4", "-2", "0", "2", "4", "6", "8", "10", "12"] {type:"raw", allow-input: true}
second_target_speaker = "none"  # @param ["none", "random", "jvs001", "jvs002", "jvs003", "jvs004", "jvs005", "jvs006", "jvs007", "jvs008", "jvs009", "jvs010", "jvs011", "jvs012", "jvs013", "jvs014", "jvs015", "jvs016", "jvs017", "jvs018", "jvs019", "jvs020", "jvs021", "jvs022", "jvs023", "jvs024", "jvs025", "jvs026", "jvs027", "jvs028", "jvs029", "jvs030", "jvs031", "jvs032", "jvs033", "jvs034", "jvs035", "jvs036", "jvs037", "jvs038", "jvs039", "jvs040", "jvs041", "jvs042", "jvs043", "jvs044", "jvs045", "jvs046", "jvs047", "jvs048", "jvs049", "jvs050", "jvs051", "jvs052", "jvs053", "jvs054", "jvs055", "jvs056", "jvs057", "jvs058", "jvs059", "jvs060", "jvs061", "jvs062", "jvs063", "jvs064", "jvs065", "jvs066", "jvs067", "jvs068", "jvs069", "jvs070", "jvs071", "jvs072", "jvs073", "jvs074", "jvs075", "jvs076", "jvs077", "jvs078", "jvs079", "jvs080", "jvs081", "jvs082", "jvs083", "jvs084", "jvs085", "jvs086", "jvs087", "jvs088", "jvs089", "jvs090", "jvs091", "jvs092", "jvs093", "jvs094", "jvs095", "jvs096", "jvs097", "jvs098", "jvs099", "jvs100"]
second_target_mode = "morph"  # @param ["merge", "morph"]
bonus = "none"  # @param ["none", "vibrato", "autotune", "monotone"]
block_size = 1  # 現在は最も遅延が短くなる 1 (10ms) のみ対応

# 以下は設定値を API に渡すための前処理

if target_speaker == "random":
    target_speaker_id = random.randint(0, 99)
else:
    target_speaker_id = target_speaker_names.index(target_speaker)
print(f"{target_speaker_id=}")
print(f"target speaker name: {target_speaker_names[target_speaker_id]}")
if second_target_speaker == "none":
    second_target_speaker_id = target_speaker_id
elif second_target_speaker == "random":
    second_target_speaker_id = random.randint(0, 99)
else:
    second_target_speaker_id = target_speaker_names.index(second_target_speaker)
print(f"{second_target_speaker_id=}")
print(f"second target speaker name: {target_speaker_names[second_target_speaker_id]}")

if pitch_shift_semitones_ == "auto":
    target_speaker_mean_pitch = target_speaker_mean_pitches[target_speaker_id]
    second_target_speaker_mean_pitch = target_speaker_mean_pitches[second_target_speaker_id]
    tmp_source_pitch, _ = pyworld.harvest(source_wav.astype(np.float64), IN_SAMPLE_RATE)
    tmp_source_pitch = tmp_source_pitch[tmp_source_pitch > 0]
    if len(tmp_source_pitch) == 0:
        pitch_shift_semitones = 0.0
        second_pitch_shift_semitones = 0.0
    else:
        source_speaker_mean_pitch = math.exp(float(np.log(tmp_source_pitch).mean()))
        pitch_shift_semitones = math.log2(target_speaker_mean_pitch / source_speaker_mean_pitch) * 12.0
        second_pitch_shift_semitones = math.log2(second_target_speaker_mean_pitch / source_speaker_mean_pitch) * 12.0
else:
    pitch_shift_semitones = float(pitch_shift_semitones_)
    second_pitch_shift_semitones = pitch_shift_semitones
pitch_shift_semitones = round(pitch_shift_semitones, 1)
second_pitch_shift_semitones = round(second_pitch_shift_semitones, 1)
print(f"{pitch_shift_semitones=}")
print(f"{second_pitch_shift_semitones=}")

if second_target_mode == "merge":
    target_speaker_embeddings = slerp(
        speaker_embeddings[target_speaker_id],
        speaker_embeddings[second_target_speaker_id],
        0.5,
    )
    target_speaker_embeddings = np.stack([target_speaker_embeddings] * 400)
    quantized_pitch_shift = np.array([int(round(
        (pitch_shift_semitones + second_pitch_shift_semitones) * 0.5 * PITCH_BINS_PER_OCTAVE / 12.0
    ))] * 400, dtype=np.int32)
else:
    target_speaker_embeddings = []
    quantized_pitch_shift = []
    target_speaker_embeddings += [speaker_embeddings[target_speaker_id]] * 100
    quantized_pitch_shift += [int(round(pitch_shift_semitones * PITCH_BINS_PER_OCTAVE / 12.0))] * 100
    for i in range(100):
        target_speaker_embeddings.append(slerp(
            i / 100.0,
            speaker_embeddings[target_speaker_id],
            speaker_embeddings[second_target_speaker_id],
        ))
        quantized_pitch_shift.append(int(round(
            (pitch_shift_semitones * (100 - i) + second_pitch_shift_semitones * i) / 100.0
            * PITCH_BINS_PER_OCTAVE / 12.0
        )))
    target_speaker_embeddings += [speaker_embeddings[second_target_speaker_id]] * 100
    quantized_pitch_shift += [int(round(second_pitch_shift_semitones * PITCH_BINS_PER_OCTAVE / 12.0))] * 100
    for i in range(100):
        target_speaker_embeddings.append(slerp(
            i / 100.0,
            speaker_embeddings[second_target_speaker_id],
            speaker_embeddings[target_speaker_id],
        ))
        quantized_pitch_shift.append(int(round(
            (second_pitch_shift_semitones * (100 - i) + pitch_shift_semitones * i) / 100.0
            * PITCH_BINS_PER_OCTAVE / 12.0
        )))
    target_speaker_embeddings = np.stack(target_speaker_embeddings)
    quantized_pitch_shift = np.array(quantized_pitch_shift, dtype=np.int32)
print(f"{target_speaker_embeddings.shape=}")
print(f"{quantized_pitch_shift=}")

if bonus == "vibrato":
    vibrato_rate_hz = 4.0
    vibrato_depth_semitones = 3.0
    vibrato_wave = np.sin(np.linspace(
        0, 2.0 * math.pi * vibrato_rate_hz * 400 / 100, 400, endpoint=False
    )) * (vibrato_depth_semitones * PITCH_BINS_PER_OCTAVE / 12.0)
    quantized_pitch_shift += vibrato_wave.round().astype(np.int32)
    print(f"{quantized_pitch_shift=}")
elif bonus == "autotune":
    print(f"{nearest_c_major_scale=}")
elif bonus == "monotone":
    quantized_target_pitch = int(round(
        math.log2(math.sqrt(target_speaker_mean_pitch * second_target_speaker_mean_pitch) / 55.0) * PITCH_BINS_PER_OCTAVE
    ))
    print(f"{quantized_target_pitch=}")
elif bonus == "whisper":
    pass



In [None]:
# @title 変換

assert block_size == 1
phone_ctx = phone_extractor.new_context(block_size)
pitch_ctx = pitch_estimator.new_context(block_size)
waveform_ctx = waveform_generator.new_context(block_size)

block_size_in_input_sample_rate = block_size * IN_HOP_LENGTH

converted_wav_segments = []
t0 = perf_counter()

# block_size * 10ms ずつ変換する
for left in tqdm(range(0, len(source_wav), block_size_in_input_sample_rate)):
    source_wav_segment = source_wav[left : left + block_size_in_input_sample_rate]
    if len(source_wav_segment) < block_size_in_input_sample_rate:
        source_wav_segment = np.pad(
            source_wav_segment,
            (0, block_size_in_input_sample_rate - len(source_wav_segment)),
        )
    phone = phone_extractor(source_wav_segment, phone_ctx)
    quantized_pitch, pitch_features = pitch_estimator(source_wav_segment, pitch_ctx)
    l = left // block_size_in_input_sample_rate % 400
    if bonus in {"none", "vibrato", "autotune"}:
        quantized_pitch = (
            quantized_pitch
            + quantized_pitch_shift[l : l + block_size]
        ).clip(1, PITCH_BINS - 1)
        if bonus == "autotune":
            quantized_pitch = nearest_c_major_scale[quantized_pitch]
    elif bonus == "monotone":
        quantized_pitch[:] = quantized_target_pitch
    else:
        assert False
    speaker_embedding = target_speaker_embeddings[l : l + block_size]
    converted_wav_segment = waveform_generator(
        phone, quantized_pitch, pitch_features, speaker_embedding, waveform_ctx
    )
    converted_wav_segments.append(converted_wav_segment)
elapsed_time = perf_counter() - t0
rtf = elapsed_time / len(source_wav) * IN_SAMPLE_RATE
print(f"Elapsed time: {elapsed_time:.3f}s")
print(f"RTF: {rtf:.3f}")  # Xeon 遅すぎ！

converted_wav = np.concatenate(converted_wav_segments)
converted_wav[:2000] *= np.linspace(0.0, 1.0, 2000)

plt.figure(figsize=(16, 1))
plt.plot(np.arange(len(source_wav)) / IN_SAMPLE_RATE, source_wav + 0.1, label="source")
plt.plot(np.arange(len(converted_wav)) / OUT_SAMPLE_RATE, converted_wav - 0.1, label="converted")
plt.legend()
plt.grid()
plt.show()
display(Audio(converted_wav, rate=OUT_SAMPLE_RATE))
