# Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

This notebook demonstrates how to fine-tune OpenAI's Whisper model for Tajik speech recognition using synthetic data. The approach is inspired by the [Fine-tuning Whisper blog post](https://huggingface.co/blog/fine-tune-whisper) from Hugging Face.

## Introduction

Whisper is a powerful pre-trained model for automatic speech recognition (ASR) published by OpenAI. While it comes with multilingual capabilities, fine-tuning on specific languages and domains can significantly improve performance.

In this notebook, we'll:
1. Load synthetic Tajik speech data generated using Meta's MMS-TTS
2. Prepare the data for fine-tuning
3. Fine-tune the Whisper small model
4. Evaluate and save the results

In [None]:
!pip install -q git+https://github.com/openai/whisper.git
!pip uninstall -y transformers datasets peft accelerate huggingface_hub
!pip install transformers==4.38.2 datasets==2.18.0 accelerate==0.27.2 peft==0.8.2 huggingface_hub==0.22.2

In [2]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration, Seq2SeqTrainer, Seq2SeqTrainingArguments
from datasets import load_dataset
import torchaudio

## Load Dataset

In [4]:
from datasets import Dataset, DatasetDict, Audio
import json
from pathlib import Path

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# # Path to your metadata file and audio folder
# metadata_path = "/content/drive/MyDrive/Colab Outputs/metadata.jsonl"
# audio_base_path = "/content/drive/MyDrive/Colab Outputs"

# with open(metadata_path, "r", encoding="utf-8") as f:
#     data = [json.loads(line) for line in f]

# for item in data:
#     item["audio"] = str(Path(item["audio_path"]).resolve())
#     item["sentence"] = item["text"]
#     del item["audio_path"]
#     del item["text"]
#     del item["duration"]

# full_dataset = Dataset.from_list(data)

# full_dataset = full_dataset.cast_column("audio", Audio())

# split_dataset = full_dataset.train_test_split(test_size=0.1, seed=42)

# custom_data = DatasetDict({
#     "train": split_dataset["train"],
#     "test": split_dataset["test"]
# })

In [5]:
metadata_path = "/content/drive/MyDrive/Colab Outputs/metadata.jsonl"
audio_base_path = "/content/drive/MyDrive/Colab Outputs"

with open(metadata_path, "r", encoding="utf-8") as f:
    data = [json.loads(line) for line in f]

data = data[:8000]

for item in data:
    item["audio"] = str(Path(item["audio_path"]).resolve())
    item["sentence"] = item["text"]
    del item["audio_path"]
    del item["text"]
    del item["duration"]

full_dataset = Dataset.from_list(data)

full_dataset = full_dataset.cast_column("audio", Audio())

custom_data = DatasetDict({
    "train": full_dataset.select(range(6000)),
    "test": full_dataset.select(range(6000, 8000))
})

In [6]:
print(custom_data)

DatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 6000
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 2000
    })
})


## Prepare Feature Extractor, Tokenizer and Data

### Load WhisperFeatureExtractor

In [7]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

### Load WhisperTokenizer

In [9]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained(
    "openai/whisper-small",
    language="tg",
    task="transcribe"
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Combine To Create A WhisperProcessor

In [10]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained(
    "openai/whisper-small",
    language="tg",
    task="transcribe"
  )

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Prepare Data

In [11]:
print(custom_data["train"][0])

{'audio': {'path': '/content/drive/My Drive/Colab Outputs/audio/synthetic_000000.wav', 'array': array([ 0.00473022,  0.00238037, -0.00137329, ..., -0.00201416,
       -0.00460815, -0.00314331]), 'sampling_rate': 16000}, 'sentence': 'ин кори шумо зиндагии худу рангину ширин месозад'}


In [12]:
from datasets import Audio

custom_data = custom_data.cast_column("audio", Audio(sampling_rate=16000))

In [13]:
print(custom_data["train"][0])

{'audio': {'path': '/content/drive/My Drive/Colab Outputs/audio/synthetic_000000.wav', 'array': array([ 0.00473022,  0.00238037, -0.00137329, ..., -0.00201416,
       -0.00460815, -0.00314331]), 'sampling_rate': 16000}, 'sentence': 'ин кори шумо зиндагии худу рангину ширин месозад'}


In [14]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

In [15]:
custom_data = custom_data.map(prepare_dataset, remove_columns=custom_data.column_names["train"], num_proc=2)

Map (num_proc=2):   0%|          | 0/6000 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/2000 [00:00<?, ? examples/s]

## Training

In [16]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

config.json:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.87k [00:00<?, ?B/s]

We can disable the automatic language detection task performed during inference, and force the model to generate in Hindi. To do so, we set the [langauge](https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperForConditionalGeneration.generate.language)
and [task](https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperForConditionalGeneration.generate.task)
arguments to the generation config. We'll also set any [`forced_decoder_ids`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.forced_decoder_ids)
to None, since this was the legacy way of setting the language and
task arguments:

In [17]:
model.generation_config.language = "tg"
model.generation_config.task = "transcribe"

model.generation_config.forced_decoder_ids = None

### Define a Data Collator

In [18]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [19]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

### Evaluation Metrics

In [None]:
!pip install evaluate>=0.30
!pip install jiwer

In [21]:
import evaluate

metric = evaluate.load("wer")

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

In [22]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

### Define the Training Configuration

In [23]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-small-tg",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
)


In [24]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=custom_data["train"],
    eval_dataset=custom_data["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [25]:
processor.save_pretrained(training_args.output_dir)

[]

### Training

In [26]:
trainer.train()

  return fn(*args, **kwargs)
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Step,Training Loss,Validation Loss,Wer
1000,0.2093,0.319725,31.981174
2000,0.0185,0.307487,27.604059
3000,0.0033,0.326666,26.089131
4000,0.0015,0.337926,25.933225


Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793, 14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075, 21675, 22520, 26130, 26161, 26435, 28279, 29464, 31650, 32302, 32470, 36865, 42863, 47425, 49870, 50254, 50258, 50360, 50361, 50362], 'begin_suppress_tokens': [220, 50257]}
  return fn(*args, **kwargs)
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 1379

TrainOutput(global_step=4000, training_loss=0.21795023451931775, metrics={'train_runtime': 13353.2029, 'train_samples_per_second': 4.793, 'train_steps_per_second': 0.3, 'total_flos': 1.846946562048e+19, 'train_loss': 0.21795023451931775, 'epoch': 10.67})

In [27]:
!cp -r ./whisper-small-tg /content/drive/MyDrive/whisper-small-tg