## VC-SD: Demonstration

This script demonstrate the voice conversion, voice design and controllability of the VC-SD framework. Please note, this script is for demonstration purposes only, final models, training schemes, practical implementation etc. is not provided.

## Install dependencies

In [1]:
#@title Setup and Imports
!pip install torch librosa numpy ipywidgets descript-audiotools torchfcpe
print("All libraries installed successfully!")

All libraries installed successfully!


## Import dependencies

In [2]:
#@title Import

!git clone https://github.com/abargum/vc-sd-reproduction.git

import os
os.chdir('vc-sd-reproduction')

import torch
import librosa
import numpy as np
from utils.demo_utils import *
from audiotools import transforms as tfm
import ipywidgets as widgets
from IPython.display import display

Cloning into 'vc-sd-reproduction'...
remote: Enumerating objects: 81, done.[K
remote: Counting objects: 100% (22/22), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 81 (delta 7), reused 15 (delta 4), pack-reused 59 (from 2)[K
Receiving objects: 100% (81/81), 63.22 MiB | 20.79 MiB/s, done.
Resolving deltas: 100% (16/16), done.


## üéôÔ∏è Voice Design

In the next cell, you‚Äôll be able to transform a voice using simple, interactive controls.

Run the cell, then use the sliders to design a new voice profile for your input audio. Feel free to experiment ‚Äî even small adjustments can noticeably change the result.

---

### üéõÔ∏è Controls

- **Audio File**: Select the audio file you want to convert. To use a different file, simply change the file path.

- **Gender**: Adjusts the perceived timbre of the voice: **-1.72** ‚Üí more typically masculine, **1.94** ‚Üí more typically feminine.
      
- **Age**: Changes the perceived age of the output. Due to age cues being subtle, this can be thought of as a timbre variation control.

- **Tremble**: Adds a tremble (vibrato-like effect) to the voice: **0** ‚Üí no tremble, **12** ‚Üí strong tremble. 

- **Ambitus**: Controls how expressive the voice sounds: **0.5** ‚Üí flatter, more robotic, **1.5** ‚Üí wider pitch range, more emotional.

- **Pitch**: Shifts the overall pitch of the converted voice up or down.

---

üí° **Tip:** Try adjusting one slider at a time to clearly hear what each parameter changes ‚Äî then combine them to craft unique voice styles.

In [6]:
#@title Create Audio

def linear_map(x, src_min, src_max, dst_min, dst_max):
    x = np.clip(x, src_min, src_max)
    return dst_min + (x - src_min) * (dst_max - dst_min) / (src_max - src_min)

def years_to_age_param(years):
    """Convert years back to age parameter"""
    return linear_map(
        years,
        src_min=15,
        src_max=90,
        dst_min=-0.75,
        dst_max=3.5,
    )

def semitones_to_pitch(semitones):
    """Convert semitones to pitch multiplier (octaves)"""
    return 2 ** (semitones / 12.0)

def gender_param_to_label(gender_param):
    """Convert gender parameter to label"""
    if gender_param < 0:
        return "Male"
    else:
        return "Female"

transform = tfm.Compose(
            tfm.VolumeNorm(),
            tfm.RescaleAudio())
vc_model = torch.jit.load("pretrained/model-nc.ts")
vc_model = vc_model.eval()

audio_path = widgets.Text(
    value='audio/librispeech2.wav',
    placeholder='Enter audio file path',
    description='Audio File:',
    continuous_update=False
)

gender_slider = widgets.FloatSlider(value=-0.1, min=-1.72, max=1.94, step=0.01, description='Gender:', continuous_update=False)
age_slider = widgets.IntSlider(value=35, min=15, max=90, step=1, description='Age (years):', continuous_update=False)
tremble_slider = widgets.FloatSlider(value=0.0, min=0.0, max=12.0, step=0.1, description='Tremble:', continuous_update=False)
ambitus_slider = widgets.FloatSlider(value=1.0, min=0.5, max=1.5, step=0.01, description='Ambitus:', continuous_update=False)
pitch_slider = widgets.IntSlider(value=0, min=-12, max=12, step=1, description='Pitch (semitones):', continuous_update=False)

gender_label = widgets.Label(value='Male (-1.72) ‚Üí Female (1.94)')
age_label = widgets.Label(value='Age in years')
tremble_label = widgets.Label(value='Tremble Amount')
ambitus_label = widgets.Label(value='Pitch Variance')
pitch_label = widgets.Label(value='-12 to +12 semitones')

process_button = widgets.Button(description='Process Audio', button_style='success')
output = widgets.Output()

def process_audio(b):
    with output:
        output.clear_output()

        gender = gender_slider.value
        age_years = age_slider.value
        age = years_to_age_param(age_years)
        tremble = tremble_slider.value
        ambitus = ambitus_slider.value
        semitones = pitch_slider.value
        pitch = semitones_to_pitch(semitones)

        print(f"Audio file: {audio_path.value}")
        print(f"Gender: {gender_param_to_label(gender)}")
        print(f"Age: {age_years} years")
        print(f"Tremble: {tremble}")
        print(f"Ambitus: {ambitus}")
        print(f"Pitch: {semitones:+d} semitones")
        print()
        print("Processing, may take a few seconds...")
        print()

        try:
            x, sr = librosa.load(audio_path.value, sr=16000, mono=True)
            x = torch.tensor(x, dtype=torch.float32).unsqueeze(0).unsqueeze(0)

            speaker_gender = torch.tensor([gender], dtype=torch.float32)
            speaker_age = torch.tensor([age], dtype=torch.float32)
            speaker_tremble = torch.tensor([tremble], dtype=torch.float32)
            speaker_ambitus = torch.tensor([ambitus], dtype=torch.float32)
            speaker_pitch = torch.tensor([pitch], dtype=torch.float32)

            with torch.no_grad():
                vc_model.reset_pitch()
                vc_model.set_new_speaker(speaker_gender, speaker_age)
                vc_model.set_tremble_depth(speaker_tremble)
                vc_model.set_ambitus_scaler(speaker_ambitus)
                vc_model.set_pitch_mult(speaker_pitch)

            out = vc_model(x)

            display_audios([("INPUT", x, sr), ("CONVERTED", out, sr)])
        except Exception as e:
            print(f"Error: {e}")

process_button.on_click(process_audio)

display(widgets.VBox([
    audio_path,
    widgets.HBox([gender_slider, gender_label]),
    widgets.HBox([age_slider, age_label]),
    widgets.HBox([tremble_slider, tremble_label]),
    widgets.HBox([ambitus_slider, ambitus_label]),
    widgets.HBox([pitch_slider, pitch_label]),
    process_button,
    output
]))

VBox(children=(Text(value='audio/librispeech2.wav', continuous_update=False, description='Audio File:', placeh‚Ä¶

## üéß Convert by Audio Reference

Instead of designing a voice with sliders, you can also **convert your input to match a reference recording**.

Simply provide a **target audio file**, and the system will analyze its vocal characteristics, such as timbre, and tone, and apply them to your input audio.

**In short**: Input content + reference voice = your message, delivered in a new vocal style.

In [7]:
#@title Create Audio
input_audio_path = widgets.Text(
    value='audio/librispeech2.wav',
    placeholder='Enter input audio file path',
    description='Input Audio:',
    continuous_update=False
)

target_audio_path = widgets.Text(
    value='targets/p228_004.wav',
    placeholder='Enter target audio file path',
    description='Target Audio:',
    continuous_update=False
)

target_start_sample = widgets.IntText(
    value=8000,
    description='Target Start (in samples):',
    continuous_update=False
)

process_button = widgets.Button(description='Process Audio', button_style='success')
output = widgets.Output()

def process_audio(b):
    with output:
        output.clear_output()
        try:
            x, sr = librosa.load(input_audio_path.value, sr=16000, mono=True)
            x = torch.tensor(x, dtype=torch.float32).unsqueeze(0).unsqueeze(0)

            t, sr = librosa.load(target_audio_path.value, sr=16000, mono=True)
            start_idx = target_start_sample.value
            t = torch.tensor(t[start_idx:], dtype=torch.float32).unsqueeze(0).unsqueeze(0)

            print(f"Input audio: {input_audio_path.value}")
            print(f"Target audio: {target_audio_path.value}")
            print(f"Start sample: {start_idx}")
            print()

            with torch.no_grad():
                vc_model.reset_pitch()
                vc_model.set_embedding_from_audio(t)
                vc_model.set_tremble_depth(torch.zeros(1, dtype=torch.float32))
                vc_model.set_ambitus_scaler(torch.ones(1, dtype=torch.float32))
                vc_model.set_pitch_mult(torch.ones(1, dtype=torch.float32))

            out = vc_model(normalize(x, transform))
            display_audios([("INPUT", x, sr), ("TARGET", t, sr), ("CONVERTED", out, sr)])

        except Exception as e:
            print(f"Error: {e}")

process_button.on_click(process_audio)

display(widgets.VBox([
    input_audio_path,
    target_audio_path,
    target_start_sample,
    process_button,
    output
]))

VBox(children=(Text(value='audio/librispeech2.wav', continuous_update=False, description='Input Audio:', place‚Ä¶

## üéß Convert by Predefined Library

You can also **convert your input to match a predefined speaker ID**. In this case from the VCTK dataset.

Simply provide a **speaker ID**, (p225 - p360) and the system will add the vocal characteristics to your input.

In [8]:
#@title Create Audio
input_audio_path = widgets.Text(
    value='audio/librispeech2.wav',
    placeholder='Enter input audio file path',
    description='Input Audio:',
    continuous_update=False
)

speaker_id = widgets.Text(
    value='p231',
    placeholder='Enter VCTK speaker ID',
    description='Speaker ID:',
    continuous_update=False
)

json_path = widgets.Text(
    value='utils/speaker_dict.json',
    placeholder='Enter JSON path',
    description='JSON Path:',
    continuous_update=False
)

process_button = widgets.Button(description='Process Audio', button_style='success')
output = widgets.Output()

def process_audio(b):
    with output:
        output.clear_output()
        try:
            x, sr = librosa.load(input_audio_path.value, sr=16000, mono=True)
            x = torch.tensor(x, dtype=torch.float32).unsqueeze(0).unsqueeze(0)

            target = [speaker_id.value]
            speaker_embedding_avg, speaker_embedding_one, speaker_mean = get_speaker_embeddings_json(target, json_path.value)

            print(f"Speaker ID: {target[0]}")
            print(f"F0 Mean: {speaker_mean[0]:.2f}")
            print()

            speaker_mean = torch.tensor([speaker_mean[0]], dtype=torch.float32)
            speaker_embedding_avg = speaker_embedding_avg[0]

            with torch.no_grad():
                vc_model.reset_pitch()
                vc_model.set_new_speaker_from_embedding(speaker_mean, speaker_embedding_avg)
                vc_model.set_tremble_depth(torch.zeros(1, dtype=torch.float32))
                vc_model.set_ambitus_scaler(torch.ones(1, dtype=torch.float32))
                vc_model.set_pitch_mult(torch.ones(1, dtype=torch.float32))

            out = vc_model(normalize(x, transform))
            display_audios([("INPUT", x, sr), ("CONVERTED", out, sr)])

        except Exception as e:
            print(f"Error: {e}")

process_button.on_click(process_audio)

display(widgets.VBox([
    input_audio_path,
    speaker_id,
    process_button,
    output
]))

VBox(children=(Text(value='audio/librispeech2.wav', continuous_update=False, description='Input Audio:', place‚Ä¶