# **Fine-Tuning Whisper Model for Singlish Accented Speech**

Automatic speech recognition (ASR) systems are limited in their efficacy in regional applications due to the particular difficulties posed by accented speech. The goal of this study is to optimize the Whisper ASR model for Visual Acuity (VA) test scenarios by fine-tuning it to recognize Singlish-accented speech. There are particular phonetic and tonal difficulties with Singlish, the regional dialect of English spoken in Singapore. A customized collection of Singlish speech samples, with a focus on VA test utterances, was used to refine the Whisper model. When compared to the baseline model, our method dramatically decreased the Word Error Rate (WER) by **31.48\%**. The findings show how well targeted fine-tuning works to modify ASR systems for regional accents, resulting in automated processes that are more precise and effective.

The command `!pip install accelerate -U` installs or updates the Hugging Face Accelerate library to the latest version. This library is used to optimize and efficiently run deep learning models on various hardware configurations like GPUs and TPUs.

In [None]:
!pip install accelerate -U



# **Install Python libraries required for building, training, and deploying a machine learning model**:

`datasets`: A library to access and process large datasets for machine learning.

`transformers`: The Hugging Face library for pre-trained transformer models like Whisper.

`librosa:` A library for audio processing tasks like resampling and feature extraction.

`evaluate`: For evaluation metrics like Word Error Rate (WER).

`jiwer`: Specifically for calculating WER.

`gradio`: A library to create interactive web interfaces for model inference and deployment.

In [None]:
!pip install datasets #>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install transformers[torch]
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

You need a Hugging Face account and create an **access token** for this step. So that you can log into the Hugging Face Hub from within a Jupyter notebook or Colab environment. The token links your local environment to your Hugging Face account,**enabling you to upload models, datasets, or use private resources from the Hugging Face Hub**.


In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Load and organize the Singlish dataset of audio-text pairs from specified training and testing paths. It scans directories for .wav audio files and their corresponding .txt transcriptions, ensuring both exist and are valid. The audio-text pairs are then converted into a DatasetDict with "train" and "test" splits, ready for machine learning tasks.

In [None]:
from datasets import Dataset, DatasetDict
import os

# Define dataset paths
train_data_path = "/content/drive/MyDrive/Colab Notebooks/Singlish Data_Train+Test/Singlish Data_Train"
test_data_path = "/content/drive/MyDrive/Colab Notebooks/Singlish Data_Train+Test/Singlish_data_Test"

# Function to load dataset
def load_data(data_path):
    data_files = {"audio": [], "text": []}
    for root, dirs, files in os.walk(data_path):
        if "WAV" in root:
            wav_folder = root
            txt_folder = root.replace("WAV", "TXT")
            if not os.path.exists(txt_folder):
                continue
            for wav_file in os.listdir(wav_folder):
                if wav_file.endswith(".wav"):
                    wav_path = os.path.join(wav_folder, wav_file)
                    txt_file = os.path.splitext(wav_file)[0] + ".txt"
                    txt_path = os.path.join(txt_folder, txt_file)
                    if os.path.exists(txt_path):
                        with open(txt_path, "r") as f:
                            transcription = f.read().strip()
                        if transcription:  # Skip empty transcriptions
                            data_files["audio"].append(wav_path)
                            data_files["text"].append(transcription)
                        else:
                            print(f"Empty transcription for: {wav_path}")
                    else:
                        print(f"Missing TXT for WAV file: {wav_path}")
    if not data_files["audio"]:
        print(f"No audio files found in {data_path}")
    return Dataset.from_dict(data_files)

# Load training and test datasets
training_dataset = load_data(train_data_path)
test_dataset = load_data(test_data_path)

# Combine datasets into a DatasetDict
dataset = DatasetDict({
    "train": training_dataset,
    "test": test_dataset
})

# Display dataset summary
print(f"Dataset Summary:\n{dataset}")


Missing TXT for WAV file: /content/drive/MyDrive/Colab Notebooks/Singlish Data_Train+Test/Singlish Data_Train/Wireless/Noisy/M1/WAV/M1ENTest2715243USBAudio1.011072024150909.wav
Missing TXT for WAV file: /content/drive/MyDrive/Colab Notebooks/Singlish Data_Train+Test/Singlish Data_Train/Wireless/Noisy/F2/WAV/F2ENTest277473USBAudio1.030072024123220.386399.wav
Missing TXT for WAV file: /content/drive/MyDrive/Colab Notebooks/Singlish Data_Train+Test/Singlish Data_Train/Desktop/Noisy/M2/WAV/MENTest251693IntelDM12072024102702.688642.wav
Missing TXT for WAV file: /content/drive/MyDrive/Colab Notebooks/Singlish Data_Train+Test/Singlish Data_Train/Desktop/Noisy/F2/WAV/F2ENTest16CIntelDM30072024122210.290612.wav
Missing TXT for WAV file: /content/drive/MyDrive/Colab Notebooks/Singlish Data_Train+Test/Singlish Data_Train/Desktop/Quiet/M2/WAV/M2ENTest1EIntelDM12072024102702.688642.wav
Missing TXT for WAV file: /content/drive/MyDrive/Colab Notebooks/Singlish Data_Train+Test/Singlish_data_Test/wirel

This below code **cleans and processes the text field** of a Singlish dataset by extracting the transcription from entries where it include a file path separated by a tab (\t). It uses the map function to apply this cleaning to both the training and test datasets. The cleaned datasets are then recombined into a DatasetDict. Finally, it prints a preview of three examples from each dataset split (train and test) for verification.

In [None]:
# Define a function to process the 'text' field
def clean_text(example):
    # Extract the transcription part from the 'text' field (remove file path)
    if '\t' in example['text']:
        example['text'] = example['text'].split('\t')[-1]
    return example

# Apply the function to the training and test datasets using map
training_dataset = training_dataset.map(clean_text)
test_dataset = test_dataset.map(clean_text)

# Combine the datasets back into a DatasetDict
dataset = DatasetDict({
    "train": training_dataset,
    "test": test_dataset
})

# Print the updated dataset in a more readable way
for split in dataset:
    print(f"\n--- {split.upper()} SPLIT ---")
    for i, example in enumerate(dataset[split]):
        print(f"Audio Path: {example['audio']}")
        print(f"Text: {example['text']}")  # Prints raw text without quotes
        if i > 2:  # Limit output to 3 examples per split for readability
            break


Map:   0%|          | 0/794 [00:00<?, ? examples/s]

Map:   0%|          | 0/449 [00:00<?, ? examples/s]


--- TRAIN SPLIT ---
Audio Path: /content/drive/MyDrive/Colab Notebooks/Singlish Data_Train+Test/Singlish Data_Train/Wireless/Quiet/M1/WAV/M1ENTest13NQWXB3USBAudio1.011072024150909.443893.wav
Text: N Q W X B
Audio Path: /content/drive/MyDrive/Colab Notebooks/Singlish Data_Train+Test/Singlish Data_Train/Wireless/Quiet/M1/WAV/M1ENTest10FDPLTCEO3USBAudio1.011072024150909.443893.wav
Text: F D P L T C E O
Audio Path: /content/drive/MyDrive/Colab Notebooks/Singlish Data_Train+Test/Singlish Data_Train/Wireless/Quiet/M1/WAV/M1ENTest12RVDH3USBAudio1.011072024150909.443893.wav
Text: R V D H
Audio Path: /content/drive/MyDrive/Colab Notebooks/Singlish Data_Train+Test/Singlish Data_Train/Wireless/Quiet/M1/WAV/M1ENTest14ONUHZB3USBAudio1.011072024150909.443893.wav
Text: O N U H Z B

--- TEST SPLIT ---
Audio Path: /content/drive/MyDrive/Colab Notebooks/Singlish Data_Train+Test/Singlish_data_Test/wireless/Quiet/M9/WAV/M9ENTest10FDPLTCEO3USBAudio1.006082024144738.711652.wav
Text: F D P L T C E O
Audio P

To prepare the audio dataset and set up the environment for fine-tuning the Whisper model, the audio column in the dataset is converted to an audio-compatible format using the Audio feature from the Hugging Face `datasets` library. Necessary libraries are imported for tasks like processing audio, defining the Whisper model (`WhisperProcessor` and `WhisperForConditionalGeneration`), training (`Trainer and TrainingArguments`), and evaluating the model's performance using the Word Error Rate (WER) metric from `jiwer`.

In [None]:
from datasets import Dataset, DatasetDict, Audio


In [None]:
dataset = dataset.cast_column("audio", Audio())


In [None]:
import os
import torch
from datasets import Dataset, DatasetDict, Audio
from transformers import (
    WhisperProcessor,
    WhisperForConditionalGeneration,
    Trainer,
    TrainingArguments,
    TrainerCallback,
)
import torchaudio
from jiwer import wer

In [None]:
# Load Whisper processor
model_name = "openai/whisper-tiny"  # Use a smaller Whisper model for faster training
processor = WhisperProcessor.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

The below preprocessing function prepares audio and text data for training a Whisper model. It resamples audio to` 16 kHz` , processes it into mel spectrogram features, and pads or truncates the features to `3000` frames. Text transcriptions are tokenized into numerical IDs for the model, and both processed features and original text are retained in the batch for training and evaluation purposes

In [None]:
# Preprocessing function
def preprocess_function(batch):
    audio = batch["audio"]

    # Resample audio if necessary
    if audio["sampling_rate"] != 16000:
        resampler = torchaudio.transforms.Resample(orig_freq=audio["sampling_rate"], new_freq=16000)
        audio_array = resampler(torch.tensor(audio["array"], dtype=torch.float32))
    else:
        audio_array = torch.tensor(audio["array"], dtype=torch.float32)

    # Process audio and pad/truncate to 3000 frames
    inputs = processor(audio_array.numpy(), sampling_rate=16000, return_tensors="pt")
    mel_features = inputs.input_features.squeeze(0)

    mel_features = torch.nn.functional.pad(
        mel_features, (0, max(0, 3000 - mel_features.shape[-1]))
    )[:, :3000]

    batch["input_features"] = mel_features

    # Tokenize text
    batch["labels"] = processor.tokenizer(batch["text"], return_tensors="pt", padding=True).input_ids.squeeze(0)

    # Retain the original text for evaluation
    batch["text"] = batch["text"]
    return batch


Applies the preprocess_function to the dataset, processing audio and text while removing the original audio column

In [None]:
processed_dataset = dataset.map(preprocess_function, remove_columns=["audio"])

Map:   0%|          | 0/794 [00:00<?, ? examples/s]

Map:   0%|          | 0/449 [00:00<?, ? examples/s]

The custom data collator below prepares batches of data for the Whisper model during training. It stacks `input_features` (mel spectrograms) from all samples into a single tensor and pads the `labels` (tokenized text) to the same length using the tokenizer's padding token ID. The collator ensures the input features and labels are properly formatted and aligned for batch processing.

In [None]:
# Custom data collator
def data_collator(features):
    input_features = torch.stack([torch.tensor(f["input_features"], dtype=torch.float32) for f in features])
    labels = torch.nn.utils.rnn.pad_sequence(
        [torch.tensor(f["labels"], dtype=torch.long) for f in features],
        batch_first=True,
        padding_value=processor.tokenizer.pad_token_id
    )
    return {"input_features": input_features, "labels": labels}

In [None]:
model = WhisperForConditionalGeneration.from_pretrained(model_name)


config.json:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/151M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

 Define training arguments for fine-tuning the Whisper model using the Hugging Face Trainer. It specifies settings like batch sizes, learning rate, gradient accumulation, number of epochs, evaluation frequency, and logging steps. It also enables mixed precision training (fp16), saves the best model based on evaluation loss, and limits the number of saved checkpoints to save the storage and avoid session crashes.

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=2,
    learning_rate=5e-5,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=3,
    eval_strategy="steps",
    eval_steps=100,
    save_steps=100,
    logging_dir="./logs",
    logging_steps=50,
    fp16=True,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    push_to_hub=False,
)

Initialize the Hugging Face Trainer for fine-tuning the Whisper model. It specifies the model, training arguments, processed training and evaluation datasets, and a custom data collator for batching. The Trainer simplifies the training and evaluation process by handling tasks like gradient updates, logging, and checkpointing.

In [None]:
train_dataset = processed_dataset["train"]
test_dataset = processed_dataset["test"]

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=data_collator,
)


In [None]:
trainer.train()

Step,Training Loss,Validation Loss


Step,Training Loss,Validation Loss
100,0.2627,0.288636
200,0.2051,0.231751
300,0.1251,0.202508
400,0.0635,0.207607
500,0.0375,0.1972


There were missing keys in the checkpoint model loaded: ['proj_out.weight'].


TrainOutput(global_step=594, training_loss=0.13727542145886404, metrics={'train_runtime': 10784.9946, 'train_samples_per_second': 0.221, 'train_steps_per_second': 0.055, 'total_flos': 5.839599550464e+16, 'train_loss': 0.13727542145886404, 'epoch': 2.9874055415617127})

Evaluate the fine-tuned Whisper model by calculating the **Word Error Rate** (**WER**) on the test dataset. For each sample, it generates predictions using the model, decodes them into text, and compares them to the ground truth transcriptions. The `compute_wer` function calculates the WER, which is then printed as a percentage to indicate the model's performance.

In [None]:
# Evaluate the model
def compute_wer(predictions, references):
    return wer(references, predictions)

def evaluate_model(model, processor, dataset):
    model.eval()
    predictions, references = [], []
    for sample in dataset:
        inputs = torch.tensor(sample["input_features"]).unsqueeze(0)  # Add batch dimension
        with torch.no_grad():
            predicted_ids = model.generate(inputs)
        transcription = processor.decode(predicted_ids[0], language="en")
        predictions.append(transcription)
        references.append(sample["text"])
    return compute_wer(predictions, references)


test_wer = evaluate_model(model, processor, test_dataset)
print(f"Test WER: {test_wer * 100:.2f}%")

Test WER: 31.48%
