

# Submitted to [Listed Careers](https://listedinc.notion.site/About-Us-Listed-Inc-c158f2e78d7948a2abae6033e56920e8)
Create a voice cloning model that can generate a synthetic voice that sounds like a specific person. The model
should be able to generate speech from text input, and it should be able to reproduce the unique vocal
characteristics of the target speaker.

# **Voice cloning model: To generate a synthetic voice**


The BARK (Building Acoustic Resource with KITT.AI) model is a text-to-speech (TTS) voice cloning system developed by [KITT.AI](https://kitt.ai/). The BARK model is designed to generate human-like speech based on input text, allowing users to create personalized and expressive voice clones.

This Colab Notebook is modified from the original BARK model, which is hosted in the GitHub repository [bark text-to-speech](https://github.com/suno-ai/bark).

Modified by: [Abdullah Firdowsi](https://www.linkedin.com/in/abdullahfirdowsi/)

In [1]:
#@title <h1>Step 1: Clone the repository</h1>
!rm -rf /content/github-clone-repo
!rm -rf /content/sample_data
!mkdir /content/github-clone-repo
%cd /content/github-clone-repo
!git clone https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer/
%cd bark-voice-cloning-HuBERT-quantizer

/content/github-clone-repo
Cloning into 'bark-voice-cloning-HuBERT-quantizer'...
remote: Enumerating objects: 1866, done.[K
remote: Counting objects: 100% (231/231), done.[K
remote: Compressing objects: 100% (109/109), done.[K
remote: Total 1866 (delta 133), reused 208 (delta 117), pack-reused 1635[K
Receiving objects: 100% (1866/1866), 319.75 MiB | 23.38 MiB/s, done.
Resolving deltas: 100% (134/134), done.
/content/github-clone-repo/bark-voice-cloning-HuBERT-quantizer


In [2]:
#@title <h1>Step 2: Install Packages</h1>
#@markdown * Installing requirements
#@markdown * Installing torch torchvision torchaudio
%pip install -r requirements.txt
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
%pip install bark

Ignoring soundfile: markers 'platform_system == "Windows"' don't match your environment
Collecting audiolm-pytorch==1.1.4 (from -r requirements.txt (line 1))
  Downloading audiolm_pytorch-1.1.4-py3-none-any.whl (37 kB)
Collecting fairseq (from -r requirements.txt (line 2))
  Downloading fairseq-0.12.2.tar.gz (9.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting huggingface-hub (from -r requirements.txt (line 3))
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece (from -r requirements.txt (line 4))
  Downloading se

In [3]:
#@title <h1>Step 3: Loading the models </h1>
large_quant_model = False  # Use the larger pretrained model
device = 'cuda'  # 'cuda', 'cpu', 'cuda:0', 0, -1, torch.device('cuda')

import numpy as np
import IPython.display
import torch
import torchaudio
from encodec import EncodecModel
from encodec.utils import convert_audio
from bark_hubert_quantizer.hubert_manager import HuBERTManager
from bark_hubert_quantizer.pre_kmeans_hubert import CustomHubert
from bark_hubert_quantizer.customtokenizer import CustomTokenizer

model = ('quantifier_V1_hubert_base_ls960_23.pth', 'tokenizer_large.pth') if large_quant_model else ('quantifier_hubert_base_ls960_14.pth', 'tokenizer.pth')

print('Loading HuBERT...')
hubert_model = CustomHubert(HuBERTManager.make_sure_hubert_installed(), device=device)
print('Loading Quantizer...')
quant_model = CustomTokenizer.load_from_checkpoint(HuBERTManager.make_sure_tokenizer_installed(model=model[0], local_file=model[1]), device)
print('Loading Encodec...')
encodec_model = EncodecModel.encodec_model_24khz()
encodec_model.set_target_bandwidth(6.0)
encodec_model.to(device)
print('Downloaded HuBERT Quantizer Encodec')

from bark.api import generate_audio
from transformers import BertTokenizer
from bark.generation import SAMPLE_RATE, preload_models, codec_decode, generate_coarse, generate_fine, generate_text_semantic

#@title <h1>Load the selected models</h1>
#@markdown #**Selection of models to download**

text_use_gpu = True #@param {type:"boolean"}
text_use_small=False #@param {type:"boolean"}
coarse_use_gpu=True #@param {type:"boolean"}
coarse_use_small=False #@param {type:"boolean"}
fine_use_gpu=True #@param {type:"boolean"}
fine_use_small=False #@param {type:"boolean"}
codec_use_gpu=True #@param {type:"boolean"}
force_reload=False #@param {type:"boolean"}
print('Loading Preload...')
preload_models(
    text_use_gpu,
    text_use_small,
    coarse_use_gpu,
    coarse_use_small,
    fine_use_gpu,
    fine_use_small,
    codec_use_gpu,
    force_reload,
    # path="models"
)

print('Downloaded and loaded all models!')

Loading HuBERT...
Downloading HuBERT base model
Downloaded HuBERT
Loading Quantizer...
Downloading HuBERT custom tokenizer


Downloading (…)rt_base_ls960_14.pth:   0%|          | 0.00/104M [00:00<?, ?B/s]

Downloaded tokenizer
Loading Encodec...


Downloading: "https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th" to /root/.cache/torch/hub/checkpoints/encodec_24khz-d7cc33bc.th
100%|██████████| 88.9M/88.9M [00:01<00:00, 83.9MB/s]


Downloaded HuBERT Quantizer Encodec
Loading Preload...


Downloading text_2.pt:   0%|          | 0.00/5.35G [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading coarse_2.pt:   0%|          | 0.00/3.93G [00:00<?, ?B/s]

Downloading fine_2.pt:   0%|          | 0.00/3.74G [00:00<?, ?B/s]

Downloaded and loaded all models!


# **Quick guide**
1. Create audio file named input_audio.wav
2. Load the (.wav) audio to the model
3. Provide the text prompt for the speaker's statement
4. Execute the model to generate a cloned voice.
5. Play the audio and also, save it to the local computer.



# **Now lets try!**

In [9]:
#@title 1. Give the audio file as input

!rm -rf /content/user_data
!mkdir /content/user_data
%cd /content/user_data
from google.colab import files
audio_input = files.upload()

wav_file = 'input_audio.wav'  # Put the path of the speaker you want to use here.

wav, sr = torchaudio.load(wav_file)

wav_hubert = wav.to(device)

if wav_hubert.shape[0] == 2:  # Stereo to mono if needed
    wav_hubert = wav_hubert.mean(0, keepdim=True)

print('Extracting semantics...')
semantic_vectors = hubert_model.forward(wav_hubert, input_sample_hz=sr)
print('Tokenizing semantics...')
semantic_tokens = quant_model.get_token(semantic_vectors)
print('Creating coarse and fine prompts...')
wav = convert_audio(wav, sr, encodec_model.sample_rate, 1).unsqueeze(0)

wav = wav.to(device)

with torch.no_grad():
    encoded_frames = encodec_model.encode(wav)
codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1).squeeze()

codes = codes.cpu()
semantic_tokens = semantic_tokens.cpu()

!rm -rf /content/generated_data
!mkdir /content/generated_data

history_prompt = '/content/generated_data/output.npz'
np.savez(history_prompt, fine_prompt=codes, coarse_prompt=codes[:2, :], semantic_prompt=semantic_tokens)
print('Done!')

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
/content/user_data


Saving input_audio.wav to input_audio.wav
Extracting semantics...
Tokenizing semantics...
Creating coarse and fine prompts...
Done!


In [5]:
#@title 2. Give the text prompt as input

# Enter your prompt and speaker here
text_prompt = input("Enter your text prompt : ")
# text_prompt = "Generative AI is a type of artificial intelligence that can create new things like art, music, or stories on its own. It's like having a machine that can be creative and make unique stuff"

Enter your text prompt : Generative AI is a type of artificial intelligence that can create new things like art, music, or stories on its own. It's like having a machine that can be creative and make unique stuff.


In [10]:
#@title 3. Generate the clone voice
x_semantic = generate_text_semantic(
    text_prompt,
    history_prompt,
    temp=0.7,
    top_k=50,
    top_p=0.95,
)

x_coarse_gen = generate_coarse(
    x_semantic,
    history_prompt,
    temp=0.7,
    top_k=50,
    top_p=0.95,
)
x_fine_gen = generate_fine(
    x_coarse_gen,
    history_prompt,
    temp=0.5,
)
audio_array = codec_decode(x_fine_gen)

100%|██████████| 100/100 [01:28<00:00,  1.13it/s]
100%|██████████| 37/37 [06:59<00:00, 11.34s/it]


In [11]:
#@title 4. Play the synthetic voice
from IPython.display import Audio
# play audio
Audio(audio_array, rate=SAMPLE_RATE)


In [12]:
#@title 5. Download the audio "output.wav" to local computer
from scipy.io.wavfile import write as write_wav
# save audio
filepath = '/content/generated_data/output.wav' # change this to your desired output path
write_wav(filepath, SAMPLE_RATE, audio_array)
files.download(filepath)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>