# Introduction to Language and Speech Technology - ReMA (RU)
*Seminar 10*

Last update: 2024/11/18

Aditya Kamlesh Parikh - @aditya.parikh@ru.nl


In this tutorial, we will learn how to fine-tune the [Wav2Vec2-BERT](https://huggingface.co/facebook/w2v-bert-2.0) model (a very latest model released by Meta from Wav2vec2.0 family), a self-supervised speech model for ASR. We will use a small dataset from Hugging Face to finetune model to convert speech to text. By the end, you will understand how to load, prepare, fine-tune, and evaluate a model for speech recognition.

The first step is to install the required libraries:

In [None]:
%%capture
!pip install transformers datasets torchaudio evaluate jiwer accelerate
!apt install git-lfs ##to upload your model on huggingface

Would you like to upload this model on HuggingFace 🤗?
Then first login in huggingface hub.

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Give a repository name:

In [1]:
repo_name = "wav2vec2-bert-speechocean-762"

# 1. Load Dataset

In this tutorial we are using a dataset from HuggingFace 🤗 "***speechocean762: A non-native English corpus for pronunciation scoring task***"  

We will load the dataset from 🤗 and use it for the finetuning.

In [47]:
from datasets import load_dataset
speechocean = load_dataset("mispeech/speechocean762")
print(speechocean)

DatasetDict({
    train: Dataset({
        features: ['accuracy', 'completeness', 'fluency', 'prosodic', 'text', 'total', 'words', 'speaker', 'gender', 'age', 'audio'],
        num_rows: 2500
    })
    test: Dataset({
        features: ['accuracy', 'completeness', 'fluency', 'prosodic', 'text', 'total', 'words', 'speaker', 'gender', 'age', 'audio'],
        num_rows: 2500
    })
})


Now we will look more into the dataset.

In [48]:
speechocean

DatasetDict({
    train: Dataset({
        features: ['accuracy', 'completeness', 'fluency', 'prosodic', 'text', 'total', 'words', 'speaker', 'gender', 'age', 'audio'],
        num_rows: 2500
    })
    test: Dataset({
        features: ['accuracy', 'completeness', 'fluency', 'prosodic', 'text', 'total', 'words', 'speaker', 'gender', 'age', 'audio'],
        num_rows: 2500
    })
})

In [49]:
speechocean['train']

Dataset({
    features: ['accuracy', 'completeness', 'fluency', 'prosodic', 'text', 'total', 'words', 'speaker', 'gender', 'age', 'audio'],
    num_rows: 2500
})

A Dataset contains columns of data, and each column can be a different type of data. The index, or axis label, is used to access examples from the dataset. For example, indexing by the row returns a dictionary of an example from the dataset:

In [50]:
speechocean["train"][15]

{'accuracy': 8,
 'completeness': 10.0,
 'fluency': 9,
 'prosodic': 9,
 'text': 'DORA IS NOT A CLEANER',
 'total': 8,
 'words': [{'accuracy': 10,
   'phones': ['D', 'AO1', 'R', 'AH0'],
   'phones-accuracy': [2.0, 1.8, 2.0, 2.0],
   'stress': 10,
   'text': 'DORA',
   'total': 10,
   'mispronunciations': []},
  {'accuracy': 10,
   'phones': ['IH0', 'Z'],
   'phones-accuracy': [2.0, 2.0],
   'stress': 10,
   'text': 'IS',
   'total': 10,
   'mispronunciations': []},
  {'accuracy': 10,
   'phones': ['N', 'AA0', 'T'],
   'phones-accuracy': [2.0, 1.6, 2.0],
   'stress': 10,
   'text': 'NOT',
   'total': 10,
   'mispronunciations': []},
  {'accuracy': 10,
   'phones': ['AH0'],
   'phones-accuracy': [2.0],
   'stress': 10,
   'text': 'A',
   'total': 10,
   'mispronunciations': []},
  {'accuracy': 10,
   'phones': ['K', 'L', 'IY1', 'N', 'ER0'],
   'phones-accuracy': [2.0, 2.0, 2.0, 2.0, 2.0],
   'stress': 10,
   'text': 'CLEANER',
   'total': 10,
   'mispronunciations': []}],
 'speaker': '0001

But, I think it is equally important that you understand `dataset` library from 🤗 and understand how to use it when you have your own dataset in `tsv`,`csv`, `json` or `arrow` format. Then you can convert your dataset in 🤗 dataset-dict and use it quickly and efficiently.
I recommand you to go through this page: https://huggingface.co/docs/datasets/en/load_hub


Sometimes, you need to upsample/downsample the audio. For example, in above datacard the `sampling_rate` is 16000 (So it is fine). But if you are using Common-Voice dataset or Librispeech then sampling rate can be different. In that case, use the `cast_column()` function and set the `sampling_rate` parameter in the Audio feature to upsample/downsample the audio signal. [This](https://huggingface.co/docs/datasets/en/audio_process) can also be very much useful to you.

One more thing: Sometimes you need to prepare your data to make it more usable. The 🤗 datasets gives you freedom to make any changes with the help of `map()` function.

Can you try to add another one column to dataset namely `phonetic_transcription` by joining phonemes of all the words keeping a single space between them? Take it as a task. (Optional !!!)

For example:


```
Orthographic_transcription: THEN HE WENT TO THEME PARK

Phonetic_transcription: DH EH0 N HH IY0 W EH0 N T T UW0 TH IY0 M P AA0 R K
```



# 2. Prepare Data

Here you will perform some simple steps for data preparation.

First we will start with removing columns which are not useful for us for finetuning.

**Task 1:**

Remove all the columns from dataset except `text` column with the help of `remove_columns` function.

##

Once you are done with this, we will clean the text and remove any punctuation marks, foreign/special characters present in the text.

**Task 2:**

Write a function to remove all special characters, if they are present in the text. Also consider here the language. So, language specific special characters can be stayed for better understanding of language.

Hint: You can use `regular expressions`. Also, such functions you have created in your previous tutorials.  

##

Finally, we will create a vocabulary. In simple terms vocabulary is the all distinct letters/characters present in your dataset. For example, if your data is in English then 26 English alphabet can be your vocabulary.

**Task 3:**

You will write a function to extract all the unique characters present in your dataset.

##

Once you are done with this, we will create a `json` formatted vocab file, it will have a key-value structure, where each character (key) will have a numerical value. For example: `A:1, B:2, C:3` and so on.

**Question:**

For English, apart from 26 alphabets which are important punctualtion mark need to be stayed in the vocabulary?

Shall we also consider a space `" "` in vocabulary? Why?


##


If you really want to understand finetuning of pretrained models like wav2vec2.0 or Hubert, it is very important that you learn more about Connectionist Temporal Classification (CTC) framework. It is a framework used in sequencial tasks. Some great resources to learn CTC are here: ([1](https://distill.pub/2017/ctc/)),([2](https://medium.com/@kushagrabh13/modeling-sequences-with-ctc-part-1-91b14a0405b3))



For these 3 tasks I will give you some time. If you are unable to perform at that time, please do it at home. I will also uploa3d a `vocab.json` file with this tutorial and we can continue with that.



In [51]:
# Your code here for Task 1
speechocean = speechocean.select_columns(["text", "audio"])

In [52]:
speechocean

DatasetDict({
    train: Dataset({
        features: ['text', 'audio'],
        num_rows: 2500
    })
    test: Dataset({
        features: ['text', 'audio'],
        num_rows: 2500
    })
})

In [53]:
# Your code here for Task 2
# Write a code to remove any special characters in the train/test dataset
import re

def bazinga(text: str) -> str:   
    text = text["text"]
    text = re.sub("[^\w\d\s-]+", "", text)
    
    return {"text": text}

speechocean = speechocean.map(bazinga)

Map:  23%|██▎       | 577/2500 [00:00<00:00, 5634.84 examples/s]

Map: 100%|██████████| 2500/2500 [00:05<00:00, 440.43 examples/s]
Map: 100%|██████████| 2500/2500 [00:04<00:00, 596.20 examples/s]


In [54]:
speechocean["train"][15]

{'text': 'DORA IS NOT A CLEANER',
 'audio': {'path': '000010140.wav',
  'array': array([ 0.00021362, -0.0005188 , -0.00186157, ...,  0.00164795,
          0.00048828, -0.00079346]),
  'sampling_rate': 16000}}

In [None]:
# Your code here for Task 3
# Your output should look like this. For each unique character there should be a number.



Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

{'D': 0,
 'B': 1,
 'L': 2,
 'X': 3,
 'Y': 4,
 'K': 5,
 'I': 6,
 'J': 7,
 'P': 8,
 'F': 9,
 'W': 10,
 'V': 11,
 'A': 12,
 ' ': 13,
 'R': 14,
 'Q': 15,
 'E': 16,
 'M': 17,
 "'": 18,
 'O': 19,
 'S': 20,
 'G': 21,
 'H': 22,
 'U': 23,
 'T': 24,
 'Z': 25,
 'N': 26,
 'C': 27}

In [29]:
# If you are not able to perform all the 3 tasks, please complete it later. You can download the vocab.json file directly.

import json
with open('../data/tutorial_10/vocab.json', 'r', encoding='utf-8') as file:
    vocab = json.load(file)

vocab

{'H': 0,
 'Q': 1,
 'F': 2,
 'Z': 3,
 'X': 4,
 'K': 5,
 'Y': 6,
 'R': 7,
 'N': 8,
 'M': 9,
 'A': 10,
 'C': 11,
 'O': 12,
 'J': 13,
 'T': 14,
 "'": 15,
 ' ': 16,
 'P': 17,
 'W': 18,
 'L': 19,
 'S': 20,
 'V': 21,
 'U': 22,
 'I': 23,
 'B': 24,
 'E': 25,
 'D': 26,
 'G': 27}

Now we will add some special tokens in the vocabulary. `[UNK]` and `[PAD]` token. `[PAD]` tokens are also known as blank tokens in CTC alignment. If you are facing difficulties to understand this, please refer to CTC blogs I mentioned before.

In [35]:
vocab["[UNK]"] = len(vocab)
vocab["[PAD]"] = len(vocab)
# Also for convience, change your " " token with |
# So it can be more visible to you.
vocab["|"] = vocab[" "]
del vocab[" "]

In [36]:
vocab

{'H': 0,
 'Q': 1,
 'F': 2,
 'Z': 3,
 'X': 4,
 'K': 5,
 'Y': 6,
 'R': 7,
 'N': 8,
 'M': 9,
 'A': 10,
 'C': 11,
 'O': 12,
 'J': 13,
 'T': 14,
 "'": 15,
 'P': 17,
 'W': 18,
 'L': 19,
 'S': 20,
 'V': 21,
 'U': 22,
 'I': 23,
 'B': 24,
 'E': 25,
 'D': 26,
 'G': 27,
 '[UNK]': 28,
 '[PAD]': 29,
 '|': 16}

In [37]:
len(vocab)

30

Can you tell me what does it mean? What will be the dimention of our output? How many classes we will get in our output?

In [38]:
# Save this vocab.json
import json
with open('../data/tutorial_10/vocab.json', 'w') as vocab_file:
    json.dump(vocab, vocab_file)

# 3. Create Tokenizer

Hubert model can also be fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded using Wav2Vec2CTCTokenizer.
In the next tutorial session we will explain more details about this.

In [41]:
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("../data/tutorial_10", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")



In [None]:
# Push tokenizer to repository
tokenizer.push_to_hub(repo_name)

CommitInfo(commit_url='https://huggingface.co/Aditya3107/wav2vec2-bert-speechocean/commit/69cf1363167f97d118b2d99e6e00cae3c4d18840', commit_message='Upload tokenizer', commit_description='', oid='69cf1363167f97d118b2d99e6e00cae3c4d18840', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Aditya3107/wav2vec2-bert-speechocean', endpoint='https://huggingface.co', repo_type='model', repo_id='Aditya3107/wav2vec2-bert-speechocean'), pr_revision=None, pr_num=None)

# 4. Feature Extractor

In audio fine-tuning with models like Wav2Vec2 or HuBERT, the feature extractor processes raw waveform audio into input representations that the model can understand.

It converts audio to a consistent format (e.g., sample rate, duration) to match the model's pretraining setup.

Extracts low-level features (like spectrogram-like representations) directly from the waveform, so that model can understand speech signals.



In [42]:
from transformers import SeamlessM4TFeatureExtractor

feature_extractor = SeamlessM4TFeatureExtractor.from_pretrained("facebook/w2v-bert-2.0")

In [44]:
from transformers import Wav2Vec2BertProcessor

processor = Wav2Vec2BertProcessor(feature_extractor=feature_extractor, tokenizer=tokenizer)
# processor.push_to_hub(repo_name)

In [55]:
# if you check the speechocean dataset, then we do not need to do anything.
speechocean['train'][20]['audio']

{'path': '000050003.wav',
 'array': array([-0.02224731, -0.02105713, -0.0227356 , ...,  0.0010376 ,
        -0.00030518,  0.00030518]),
 'sampling_rate': 16000}

In [56]:
import numpy as np
print("Target text:", speechocean["train"][50]["text"])
print("Input array shape:", np.asarray(speechocean["train"][50]["audio"]["array"]).shape)
print("Sampling rate:", speechocean["train"][50]["audio"]["sampling_rate"])

Target text: TOM LIKES THE OLD SWEATER
Input array shape: (50880,)
Sampling rate: 16000


# 5. Prepare for training

In the code below,
* First we load and resample the audio.
* Extract the `input_features` from the loaded audio file, in our case it is `Log-Mel Feature Extraction`
* Encode the trasncription/text to labels.

In [57]:
def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_features"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    batch["input_length"] = len(batch["input_features"])

    batch["labels"] = processor(text=batch["text"]).input_ids
    return batch

In [58]:
speechocean = speechocean.map(prepare_dataset)

Map: 100%|██████████| 2500/2500 [11:18<00:00,  3.69 examples/s]  
Map: 100%|██████████| 2500/2500 [10:31<00:00,  3.96 examples/s]  


## Data Collator

A data collator prepares batches of data during training. It ensures the inputs and labels are appropriately padded and formatted for the model

Basically, what is happening your input data size (audio features, transcripts) very in their length. So, you need to align them.


`DataCollatorCTCWithPadding`

This class customizes how batches are created for Connectionist Temporal Classification (CTC)-based training tasks like speech recognition. It takes care of:

1. Padding Inputs: Handles variable-length input audio features by padding them to the longest sequence in a batch.
2. Padding Labels: Pads transcription labels separately.
3. Handling Loss Masking: Ensures that padding in the labels is ignored during loss computation by replacing padding tokens with -100.

In [None]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:

    processor: Wav2Vec2BertProcessor
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )

        labels_batch = self.processor.pad(
            labels=label_features,
            padding=self.padding,
            return_tensors="pt",
        )
        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

Now, we want to check how our model is performing. So, we need to define evaluation metric. We choose Word-Error-Rate(WER). We will talk about the evaluation metrics in the next tutorial.

In [None]:
from evaluate import load
wer_metric = load("wer")

In [None]:
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

## Load pretrained model

Now we will load the main pretrained model and provide the training arguments.

In [None]:
from transformers import Wav2Vec2BertForCTC

model = Wav2Vec2BertForCTC.from_pretrained(
    "facebook/w2v-bert-2.0",
    attention_dropout=0.0,
    hidden_dropout=0.0,
    feat_proj_dropout=0.0,
    mask_time_prob=0.0,
    layerdrop=0.0,
    ctc_loss_reduction="mean",
    add_adapter=True,
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer),
)

config.json:   0%|          | 0.00/1.87k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

Some weights of Wav2Vec2BertForCTC were not initialized from the model checkpoint at facebook/w2v-bert-2.0 and are newly initialized: ['adapter.layers.0.ffn.intermediate_dense.bias', 'adapter.layers.0.ffn.intermediate_dense.weight', 'adapter.layers.0.ffn.output_dense.bias', 'adapter.layers.0.ffn.output_dense.weight', 'adapter.layers.0.ffn_layer_norm.bias', 'adapter.layers.0.ffn_layer_norm.weight', 'adapter.layers.0.residual_conv.bias', 'adapter.layers.0.residual_conv.weight', 'adapter.layers.0.residual_layer_norm.bias', 'adapter.layers.0.residual_layer_norm.weight', 'adapter.layers.0.self_attn.linear_k.bias', 'adapter.layers.0.self_attn.linear_k.weight', 'adapter.layers.0.self_attn.linear_out.bias', 'adapter.layers.0.self_attn.linear_out.weight', 'adapter.layers.0.self_attn.linear_q.bias', 'adapter.layers.0.self_attn.linear_q.weight', 'adapter.layers.0.self_attn.linear_v.bias', 'adapter.layers.0.self_attn.linear_v.weight', 'adapter.layers.0.self_attn_conv.bias', 'adapter.layers.0.self_

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir=repo_name,
  group_by_length=True,
  per_device_train_batch_size=16,
  gradient_accumulation_steps=2,
  evaluation_strategy="steps",
  num_train_epochs=10,
  gradient_checkpointing=True,
  fp16=True,
  save_steps=600,
  eval_steps=300,
  logging_steps=300,
  learning_rate=5e-5,
  warmup_steps=500,
  save_total_limit=2,
  push_to_hub=True,
  report_to="wandb"
)




In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=speechocean['train'],
    eval_dataset=speechocean['test'],
    tokenizer=processor.feature_extractor,
)


  trainer = Trainer(


### Introduction to Weights and Bias.
https://wandb.ai/site/

Weights and Biases (W&B) is a tool that helps you track, visualize, and organize your machine learning experiments.
It can:
1. Track of metrics like loss, accuracy, and learning rate for each training step or epoch.
2. Visualize Training Progress by providing real-time graphs and dashboards to help you see how well your model is learning.
3. Stores your model configurations, hyperparameters.

I highly recommand you to get fimilier with `WANDB`; if you want to train/finetune AI models in future. You can login to `WANDB` from this Notebook.

If you do not want to use `WANDB` then remove `report_to="wandb"` from training arguments.

Training will take multiple hours depending on the GPU allocations. But this can give you a general idea how the finetuning can be possible.

In [None]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33maditya3107[0m. Use [1m`wandb login --relogin`[0m to force relogin


Upload the result of finetuning on 🤗 hub.

In [None]:
trainer.push_to_hub()

This tutorial is highly adapted from a well-known blog: https://huggingface.co/blog/fine-tune-wav2vec2-english

Please check it out in caseyou wan tmore details.

In *the next tutorial*, we will use the model which we finetuned here and evaluate the output from finetuned model. We will calculate Word error rate and character error rate for our predictions.

We will also try to open the output from finetuned models to demonstrate CTC algorithm.
