# Voice Cloning with SpeechT5
This notebook walks through the implementation of a voice cloning algorithm using the SpeechT5 model.

# Importing Libraries and Preparing the Project

This section initializes the project by importing essential libraries and modules. It sets up a logger to display runtime information and creates the required directories (`original_samples` and `cloned_samples`) for storing original and cloned audio samples. Additionally, it ensures the determinism of the model by enabling deterministic operations and setting a fixed random seed.

In [None]:
# Imports
import logging
import os
import pandas as pd
import torch
from datasets import load_dataset, Dataset
from tqdm import tqdm
from scipy.io.wavfile import write
from typing import Tuple
import numpy as np
from transformers import set_seed
from speechbrain.pretrained.interfaces import EncoderClassifier
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan

# Set up determinism
torch.backends.cudnn.deterministic = True
set_seed(42)

# Set up logger
logger = logging.getLogger(__name__)
logging.basicConfig(format='%(levelname)s | %(asctime)s | %(message)s', level=logging.INFO)

# Create required directories
if not os.path.isdir('original_samples'):
    os.makedirs('original_samples', exist_ok=True)
    logger.info('Created directory: original_samples')

if not os.path.isdir('cloned_samples'):
    os.makedirs('cloned_samples', exist_ok=True)
    logger.info('Created directory: cloned_samples')

# Set Up Dataset and Prompts

This section loads the LibriSpeech dataset in streaming mode, focusing on the test.clean split. It initializes key variables related to dataset processing. It reads a CSV file containing text prompts that map audio sample IDs to their corresponding cloning text and reference IDs for comparison.

In [None]:
# Specifies the computation device: GPU ('cuda:0') if available, otherwise CPU.
device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')

# Defines column names in the dataset:
audio_column = 'audio'        # Contains audio data.
text_column = 'text'          # Contains text prompts.
speaker_column = 'speaker_id' # Contains unique speaker identifiers.
id_column = 'id'              # Contains unique sample identifiers.

# Loads the LibriSpeech dataset in streaming mode, focusing on the 'test.clean' split. The `trust_remote_code` flag enables custom dataset scripts.
dataset = load_dataset('openslr/librispeech_asr', split='test.clean', streaming=True, trust_remote_code=True)

# Reads a CSV file with mapping sample IDs to their cloning text and reference IDs for comparison.
prompts = pd.read_csv(f'../data/text_prompts/ls-test-clean.csv')

# Initialize Models and Tools

This section initializes the necessary models and tools for text-to-speech cloning:

In [None]:
# Speaker classifier for embedding extraction
classifier = EncoderClassifier.from_hparams(
    source='speechbrain/spkrec-xvect-voxceleb',
    run_opts={"device": device},
    savedir=os.path.join('/tmp', 'speechbrain/spkrec-xvect-voxceleb')
)
# Processor for handling inputs to the model
processor = SpeechT5Processor.from_pretrained('microsoft/speecht5_tts')
# Text-to-speech model
model = SpeechT5ForTextToSpeech.from_pretrained('microsoft/speecht5_tts').to(device)
# HiFi-GAN vocoder for audio post-processing
vocoder = SpeechT5HifiGan.from_pretrained('microsoft/speecht5_hifigan').to(device)

# Define Cloning Function

This section defines the clone_speecht5 function, which clones an audio sample based on a provided text prompt. It extracts normalized speaker embeddings, processes the text input into tokens, and generates synthetic audio using the SpeechT5 model and vocoder. The function returns the cloned audio and its sampling rate.

In [None]:
def clone_speecht5(model_input_audio, model_input_text_prompt): 
    # Extract speaker embeddings
    speaker_embeddings = classifier.encode_batch(model_input_audio)
    speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2)
    speaker_embeddings = speaker_embeddings[0].view(1, -1)

    # Process the text prompt
    inputs = processor(text=model_input_text_prompt, return_tensors='pt').to(device)

    # Generate synthetic audio
    cloned_audio = model.generate_speech(inputs['input_ids'], speaker_embeddings, vocoder=vocoder)
    cloned_audio = cloned_audio.view(-1).cpu().numpy()
    cloned_sampling_rate = 16000
    return cloned_audio, cloned_sampling_rate

# Iterate Over Dataset Samples

This section processes each sample from the dataset by extracting the audio and the associated text prompt for cloning. It retrieves a reference sample for comparison based on the id_sample_to_compare field and saves the reference audio in the original_samples directory. The clone_speecht5 function is then used to generate cloned audio, which is saved in the cloned_samples directory. For each successfully cloned sample, a log entry is created to confirm the operation. The loop is configured to stop after processing five samples to limit runtime during testing.

In [None]:
# Tracks the current loop iteration
current_iteration = 0

for sample in tqdm(dataset):
    # Prepare file name for saving the cloned audio
    filename = f'{sample[speaker_column]}_{sample[id_column]}.wav'
    
    # Get the text prompt associated with the current sample
    model_input_text_prompt = prompts.loc[
        prompts['id_sample_to_clone'] == sample[id_column],
        'text'
    ].values[0]
    
    # Extract the audio data to be cloned and move it to the appropriate device
    model_input_audio = torch.tensor(sample[audio_column]['array']).to(device)
    
    # Get the ID of the sample to compare against for reference audio
    id_sample_to_compare = prompts.loc[
        prompts['id_sample_to_clone'] == sample[id_column],
        'id_sample_to_compare'
    ].values[0]

    # Retrieve the reference sample from the dataset based on the comparison ID
    for proposed_sample_to_compare in dataset:
        if proposed_sample_to_compare[id_column] == id_sample_to_compare:
            sample_to_compare = proposed_sample_to_compare
    
    # Extract the reference audio and its sampling rate
    audio_to_compare = sample_to_compare[audio_column]['array']
    sampling_rate_to_compare = sample_to_compare[audio_column]['sampling_rate']
    
    # Save the reference audio to the 'original_samples' directory
    write(f'original_samples/{filename}', sampling_rate_to_compare, audio_to_compare)

    # Clone the audio using the text prompt and save it to the 'cloned_samples' directory
    cloned_audio, cloned_sampling_rate = clone_speecht5(model_input_audio, model_input_text_prompt)
    write(f'cloned_samples/{filename}', cloned_sampling_rate, cloned_audio)

    # Information about properly cloned sample
    logger.info(f'\nSample {filename} cloned properly')

    # Ends the loop when iteration reaches 5
    current_iteration += 1
    if current_iteration == 5:
        break