<a href="https://colab.research.google.com/github/dxvsh/LearningPytorch/blob/main/Week8/DLP_GA8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Instructions**

**Step 1: Setup**

- **Install Required Libraries**: Ensure that transformers, datasets, soundfile, speechbrain, and accelerate are installed. This can typically be done using a pip install command in the terminal or notebook.
- **Disable Weights & Biases Logging**: To avoid automatic logging during this assignment, set up an environment variable to disable logging.

**Step 2: Load and Prepare Dataset**

- **Load the Dataset**: Use the “VoxPopuli” dataset available on Hugging Face, specifically the “it” (Italian) subset, and load only the training split.
- **Create a Subset of the Dataset**: Shuffle the dataset and take a random quarter (with seed=42) of the entries. This smaller subset will reduce processing time, making it easier to handle on limited resources.
- **Convert Audio Sampling Rate**: Convert the audio samples in the dataset to a 16 kHz sampling rate, as this is compatible with the model you’ll be using.

**Step 3: Load the Model and Tokenizer**

- **Initialize the TTS Model and Tokenizer**: Use the speecht5 model and tokenizer pre-trained on English from the Hugging Face library.
- **Tokenize and Generate Audio Output**: Create a test sentence and tokenize it. Use the model to generate an audio waveform from this text to confirm the setup is working.

**Step 4: Preprocess the Dataset Text**

- **ExtractApply the Preprocessing Function Vocabulary from Dataset**: Extract unique characters from the “normalized text” column in the dataset to build a vocabulary of characters used in the data.
- **Identify Unsupported Characters**: Compare the characters in the dataset vocabulary with the tokenizer’s vocabulary. Determine if there are any characters in the dataset that are not supported by the tokenizer.
- **Replace Unsupported Characters**: Define replacements for unsupported characters (such as accented letters) with simpler versions. Apply these replacements to clean up the text data.

**Step 5: Speaker Analysis**

- **Analyze Speaker Distribution**: Count the number of examples per speaker in the dataset to understand the distribution of samples across speakers.
- **Visualize Speaker Data**: Plot a histogram of the number of examples per speaker. This will help you understand which speakers are well-represented and which have fewer examples.
- **Filter Speaker Data**: Filter the dataset to include only those speakers who have between 100 and 400 examples to ensure a balanced distribution.

**Step 6: Create Speaker Embeddings**

- **Load the Speaker Model**: Use a pre-trained x-vector EncoderClassifier model from SpeechBrain to generate speaker embeddings. Ensure that the model is compatible with your system (CUDA if available).
- **Generate Speaker Embeddings**: Define a function that generates and normalizes speaker embeddings for each audio sample in the dataset. These embeddings capture the unique characteristics of each speaker’s voice.

**Step 7: Process the Dataset for Model Input**

- **Define a Preprocessing Function**: Create a function that prepares each dataset example. The function should:
    - Tokenize the text.
    - Convert audio to log-mel spectrograms.
    - Add the speaker embeddings.
- **Apply the Preprocessing Function**: Apply the preprocessing function to each example in the dataset. This may take some time, depending on the dataset size.
- **Filter Out Long Texts**: To ensure efficient training, filter out examples where the tokenized text length exceeds 200 tokens.
- **Split the Dataset**: Split the dataset into training and test sets with a 90:10 ratio.

**Step 8: Define a Custom Data Collator**

- **Define the Data Collator Class**: Create a custom data collator that:
    - Pads input sequences to the same length.
    - Pads spectrogram labels with a special value to ignore during loss computation.
    - Adjusts labels to fit the model’s reduction factor.
    - Adds speaker embeddings to each batch.
- **Instantiate the Data Collator**: Initialize the data collator with the - processor object used for padding inputs and labels.

**Step 9: Configure Training Parameters**

- **Set Training Arguments**: Define training arguments for Seq2SeqTrainingArguments, including:
    - Batch size, learning rate, and warmup steps.
    - Enable gradient checkpointing to save memory.
    - Configure mixed precision and evaluation steps.
    - Enable **load_best_model_at_end** for best model tracking.

    ```
    training_args = Seq2SeqTrainingArguments(
    output_dir = "speecht5_finetuned_voxpopuli_it",  
    per_device_train_batch_size = 8,  
    gradient_accumulation_steps = 4,  
    learning_rate = 1e-4,  
    warmup_steps = 200,  
    max_steps = 1000,  
    gradient_checkpointing = True,
    fp16 = True,  
    evaluation_strategy = "steps",
    per_device_eval_batch_size = 4,
    save_steps = 500,  
    eval_steps = 500,  
    logging_steps = 50,  
    load_best_model_at_end = True,
    greater_is_better = False,
    label_names = ["labels"],
    push_to_hub = False
    )
    ```

**Step 10: Train the Model**

- **Initialize the Trainer**: Use the Seq2SeqTrainer class to set up the training loop with:
    - The model, training arguments, dataset, data collator, and tokenizer.
- **Start Training**: Train the model. If encountering a CUDA “out-of-memory” error, gradually reduce the batch size and adjust gradient accumulation steps to compensate.

**Step 11: Perform Inference**

- **Load the Fine-tuned Model**: After training, load the fine-tuned model.
- **Prepare Inference Input**: Select a speaker embedding and create a test sentence in Italian.
- **Generate Speech Output**: Use a vocoder model (such as SpeechT5HifiGan) to convert the model’s output to audio. Listen to the generated audio to evaluate the results.

**Step 12: Evaluation Tips**

- **Consider Embedding Quality**: Remember that the model may produce better results if trained with speaker embeddings more closely aligned with the target language.
- **Experiment with Configuration**: To improve quality, consider adjusting the model’s configuration (e.g., using reduction 1) and training for a longer duration.

In [2]:
!pip install transformers datasets soundfile speechbrain==0.5.16 accelerate > /dev/null

In [3]:
# Disable wandb logging
import os
os.environ['WANDB_DISABLED'] = 'true'

In [None]:
from datasets import load_dataset

# Load the train dataset
dataset = load_dataset("facebook/voxpopuli", "it", split="train")

# Take a random subset of 1/4th of the dataset
subset_size = len(dataset) // 4
subset = dataset.shuffle(seed=42).select(range(subset_size))

# Verify the subset size
print(f"Original dataset size: {len(dataset)}")
print(f"Subset size: {len(subset)}")

**Q1.** What is the original size of the train split of ”facebook/voxpopuli”, ”train” set for ”italian” ?

In [5]:
dataset

Dataset({
    features: ['audio_id', 'language', 'audio', 'raw_text', 'normalized_text', 'gender', 'speaker_id', 'is_gold_transcript', 'accent'],
    num_rows: 22576
})

The size of the Italian train split is: **22576**

**Q2.** What is the sampling rate of the original audio?

In [6]:
dataset.features['audio']

Audio(sampling_rate=16000, mono=True, decode=True, id=None)

The sampling rate of the original audio is **16000**

**Q3.** How many unique characters are in the dataset?

Rename the subset as our default dataset and lets look at a random sample from the dataset:

In [7]:
dataset = subset

In [8]:
dataset[100]

{'audio_id': '20130523-0900-PLENARY-4-it_20130523-09:02:07_7',
 'language': 5,
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/2626274a4bca92b7aade3c49f47a5dfff1670fd5f60ba758cb4ea48368f35184/train_part_0/20130523-0900-PLENARY-4-it_20130523-09:02:07_7.wav',
  'array': array([ 0.24359131,  0.19784546,  0.19006348, ..., -0.01724243,
          0.00827026,  0.01028442]),
  'sampling_rate': 16000},
 'raw_text': "la prova è che la proposta verrà dall'Italia, dalla Francia, anche dai livelli nazionali, quindi no a strumentalizzazioni.",
 'normalized_text': "la prova è che la proposta verrà dall'italia dalla francia anche dai livelli nazionali quindi no a strumentalizzazioni.",
 'gender': 'female',
 'speaker_id': '28340',
 'is_gold_transcript': True,
 'accent': 'None'}

In [11]:
from transformers import SpeechT5Processor

checkpoint = "microsoft/speecht5_tts"
processor = SpeechT5Processor.from_pretrained(checkpoint)

tokenizer = processor.tokenizer
print(tokenizer)

SpeechT5Tokenizer(name_or_path='microsoft/speecht5_tts', vocab_size=79, model_max_length=600, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	79: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True, special=True),
	80: AddedToken("<ctc_blank>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
}


In [12]:
def extract_all_chars(batch):
    all_text = " ".join(batch['normalized_text'])
    vocab = list(set(all_text))
    return {'vocab': [vocab], 'all_text': [all_text]}

vocabs = dataset.map(
    extract_all_chars,
    batched=True,
    batch_size=-1,
    keep_in_memory=True,
    remove_columns=dataset.column_names,
)

dataset_vocab = set(vocabs['vocab'][0])
tokenizer_vocab = {k for k, _ in tokenizer.get_vocab().items()}

Map:   0%|          | 0/5644 [00:00<?, ? examples/s]

In [14]:
len(dataset_vocab)

40

So, there are **40** unique characters in the smaller dataset (subset).

In [22]:
dataset_vocab - tokenizer_vocab

{' ', 'à', 'è', 'ì', 'ï', 'ò', 'ó', 'ù'}

In [23]:
replacements = [
    ('à', "a"),
    ('ç', 'c'),
    ('è', 'e'),
    ('ë', 'e'),
    ('í', 'i'),
    ('ï', 'i'),
    ('ö', 'o'),
    ('ü', 'u')
]


def cleanup_text(inputs):
    for src, dst in replacements:
        inputs['normalized_text'] = inputs['normalized_text'].replace(src, dst)
    return inputs

dataset = dataset.map(cleanup_text)

Map:   0%|          | 0/5644 [00:00<?, ? examples/s]

**Q4.** How many tokens are in the ”microsoft/speechT5” tokenizer?

In [16]:
len(tokenizer_vocab)

81

**Q5.** Whether all the unique characters in the italian train split are present in the token list of microsoft/speechT5? (true\false)

In [17]:
dataset_vocab.issubset(tokenizer_vocab)

False

**Q6**. What is the need for normalized text in TTS training?
1. It removes variations in text representation, making it easier for the model to learn consistent pronunciation and intonation.
2. It makes the text appear more formal, which increases the model’s efficiency.
3. It simplifies the dataset by removing unnecessary words, leading to a smaller model size.
4. It allows the model to skip processing the text altogether, speeding up training.


Option 1.

**Q7.** How many speakers have less than or equal to 100 samples?

In [18]:
from collections import defaultdict

speaker_counts = defaultdict(int)

for speaker_id in dataset['speaker_id']:
    speaker_counts[speaker_id] += 1

In [20]:
num_speakers = 0
for speaker_id, count in speaker_counts.items():
    if count <= 100:
        num_speakers += 1

In [21]:
num_speakers

119

**Q8.** What is the length of the dataset after removing speakers with less than 100 samples and more than 400 samples?

In [24]:
def select_speaker(speaker_id):
    return 100 <= speaker_counts[speaker_id] <= 400

dataset = dataset.filter(select_speaker, input_columns=['speaker_id'])

Filter:   0%|          | 0/5644 [00:00<?, ? examples/s]

In [25]:
dataset

Dataset({
    features: ['audio_id', 'language', 'audio', 'raw_text', 'normalized_text', 'gender', 'speaker_id', 'is_gold_transcript', 'accent'],
    num_rows: 2570
})

Length of the dataset after removing speakers with less than 100 and more than 400 samples is : **2570**

**Q9.** What the target labels for a TTS task ?
1. Frames of the spectrogram
2. Data points of the 16kHz audio
3. Characters of the language


- [ ] 1
- [x] 2
- [ ] 3


**Q10.** After filtering any speaker with less than 100 and more than 400 samples and removing any text with more than 200 tokens how many samples are left?

In [27]:
dataset_token_lengths = dataset.map(
    lambda x: {'length' : len(tokenizer(x['normalized_text']).input_ids)},
)

dataset_token_lengths.filter(lambda x: x['length'] <= 200)

Map:   0%|          | 0/2570 [00:00<?, ? examples/s]

Filter:   0%|          | 0/2570 [00:00<?, ? examples/s]

Dataset({
    features: ['audio_id', 'language', 'audio', 'raw_text', 'normalized_text', 'gender', 'speaker_id', 'is_gold_transcript', 'accent', 'length'],
    num_rows: 1418
})

There are 1418 samples left.

**Q11.** What is the training loss after 1000 steps?

In [28]:
import os
import torch
from speechbrain.pretrained import EncoderClassifier

spk_model_name = "speechbrain/spkrec-xvect-voxceleb"

device='cuda' if torch.cuda.is_available() else 'cpu'

speaker_model = EncoderClassifier.from_hparams(
    source=spk_model_name,
    savedir=os.path.join('/tmp', spk_model_name)
)

def create_speaker_embedding(waveform):
    with torch.no_grad():
        speaker_embeddings = speaker_model.encode_batch(torch.tensor(waveform))
        speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2)
        speaker_embeddings = speaker_embeddings.squeeze().cpu().numpy()
    return speaker_embeddings

hyperparams.yaml:   0%|          | 0.00/2.04k [00:00<?, ?B/s]

embedding_model.ckpt:   0%|          | 0.00/16.9M [00:00<?, ?B/s]

mean_var_norm_emb.ckpt:   0%|          | 0.00/3.20k [00:00<?, ?B/s]

classifier.ckpt:   0%|          | 0.00/15.9M [00:00<?, ?B/s]

label_encoder.txt:   0%|          | 0.00/129k [00:00<?, ?B/s]

  torch.load(path, map_location=device), strict=False
  stats = torch.load(path, map_location=device)


In [29]:
def prepare_dataset(example):
    audio = example['audio']

    example = processor(
        text=example['normalized_text'],
        audio_target=audio['array'],
        sampling_rate=audio['sampling_rate'],
        return_attention_mask=False,
    )

    # strip off the batch dimension
    example['labels'] = example['labels'][0]

    # use speechbrain to obtain x-vector
    example['speaker_embeddings'] = create_speaker_embedding(audio['array'])

    return example

In [30]:
processed_example = prepare_dataset(dataset[0])
list(processed_example.keys())

['input_ids', 'labels', 'speaker_embeddings']

In [31]:
dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)

Map:   0%|          | 0/2570 [00:00<?, ? examples/s]

In [32]:
def is_not_too_long(input_ids):
    input_length = len(input_ids)
    return input_length < 200

dataset = dataset.filter(is_not_too_long, input_columns=["input_ids"])
len(dataset)

Filter:   0%|          | 0/2570 [00:00<?, ? examples/s]

1413

In [33]:
dataset = dataset.train_test_split(test_size=0.1)

In [34]:

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class TTSDataCollatorWithPadding:
    processor: Any

    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        input_ids = [{"input_ids": feature["input_ids"]} for feature in features]
        label_features = [{"input_values": feature["labels"]} for feature in features]
        speaker_features = [feature["speaker_embeddings"] for feature in features]

        # collate the inputs and targets into a batch
        batch = processor.pad(
            input_ids=input_ids, labels=label_features, return_tensors="pt"
        )

        # replace padding with -100 to ignore loss correctly
        batch["labels"] = batch["labels"].masked_fill(
            batch.decoder_attention_mask.unsqueeze(-1).ne(1), -100
        )

        # not used during fine-tuning
        del batch["decoder_attention_mask"]

        # round down target lengths to multiple of reduction factor
        if model.config.reduction_factor > 1:
            target_lengths = torch.tensor(
                [len(feature["input_values"]) for feature in label_features]
            )
            target_lengths = target_lengths.new(
                [
                    length - length % model.config.reduction_factor
                    for length in target_lengths
                ]
            )
            max_length = max(target_lengths)
            batch["labels"] = batch["labels"][:, :max_length]

        # also add in the speaker embeddings
        batch["speaker_embeddings"] = torch.tensor(speaker_features)

        return batch

In [35]:
data_collator = TTSDataCollatorWithPadding(processor=processor)

In [36]:
from transformers import SpeechT5ForTextToSpeech

model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint)

config.json:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/585M [00:00<?, ?B/s]

In [37]:
from functools import partial

# disable cache during training since it's incompatible with gradient checkpointing
model.config.use_cache = False

# set language and task for generation and re-enable cache
model.generate = partial(model.generate, use_cache=True)

In [38]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir = "speecht5_finetuned_voxpopuli_it",
    per_device_train_batch_size = 8,
    gradient_accumulation_steps = 4,
    learning_rate = 1e-4,
    warmup_steps = 200,
    max_steps = 1000,
    gradient_checkpointing = True,
    fp16 = True,
    evaluation_strategy = "steps",
    per_device_eval_batch_size = 4,
    save_steps = 500,
    eval_steps = 500,
    logging_steps = 50,
    load_best_model_at_end = True,
    greater_is_better = False,
    label_names = ["labels"],
    push_to_hub = False
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [39]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=data_collator,
    tokenizer=processor,
)

  trainer = Seq2SeqTrainer(
max_steps is given, it will override any value given in num_train_epochs


In [40]:
trainer.train()

Step,Training Loss,Validation Loss
500,0.4988,0.48401
1000,0.4661,0.482094




TrainOutput(global_step=1000, training_loss=0.5201288890838623, metrics={'train_runtime': 2036.1059, 'train_samples_per_second': 15.716, 'train_steps_per_second': 0.491, 'total_flos': 5035925941895376.0, 'train_loss': 0.5201288890838623, 'epoch': 25.157232704402517})