# Fine Tuning Whisper using LoRA

In this tutorial, we will be demonstrating how to fine-tune a [Whisper](https://arxiv.org/abs/2212.04356) model using `adapters`. We will be adding [LoRA](https://docs.adapterhub.ml/methods#lora) to Whisper and will incorporate a sequence to sequence head on top of the model for performing audio transcription. 

Our tutorial is build on [this Whisper training guide](https://huggingface.co/blog/fine-tune-whisper), which provides a detailed step-by-step guide on how to do full-model finetunig with Whisper. However, in this notebook, we will be focusing on the minimal changes required to swap out traditional full model finetuning with parameter-efficient finetuning using `adapters`.

For more information on the Whisper Model, please visit the [Hugging Face model card](https://huggingface.co/openai/whisper-large-v3) or see the original [OpenAI Blog Post](https://openai.com/index/whisper/).

### Installation

Before we can get started, we need to ensure the proper packages are installed. Here's a breakdown of what we need:

- `adapters` and `accelerate` for efficient fine-tuning and training optimization
- `librosa` and `datasets[audio]` for audio processing and data handling
- `evaluate` and `jiwer` for metric computation and model evaluation

In [1]:
!pip install -qq jiwer evaluate>=0.30 librosa
!pip install -qq -U adapters datasets[audio] accelerate

In [2]:
import os
#os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(["0", "1", "2", "3"])  # use this line if you have multiple GPUs available, e.g. 4 GPUs
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

### Dataset

In this tutorial, we will be using the `mozilla-foundation/common_voice_11_0` dataset, created by the Mozilla Foundation. This dataset is a comprehensive collection of voice recordings in multiple languages, making it ideal for training and fine-tuning speech recognition models.

For comparison purposes, we will finetune Whisper on the low resource language Hindi, but you can adapt the language to your choice if desired. The dataset supports numerous languages; a list of all available ones can be found [here](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0#languages), on the dataset page on Hugging Face.
For a high resource language make sure you have enough memory available.


In [3]:
language = "hindi"
language_abbr = "hi"
task = "transcribe"
dataset_name = "mozilla-foundation/common_voice_11_0"

### Loading the Dataset

We load the dataset and split it into its respective train and test sets. For training, we use both the training and validation split, since Hindi is low resource. 
We then remove some of the columns as they are not needed for training.

In [4]:
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", language_abbr, split="train+validation")
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", language_abbr, split="test")

print(common_voice)

DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 6540
    })
    test: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 2894
    })
})


In [5]:
common_voice = common_voice.remove_columns(
    ["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"]
)

print(common_voice["train"][0])

{'audio': {'path': '/home/imhof/.cache/huggingface/datasets/downloads/extracted/8fcfd9e391a57582a1ced30aaf3434aa04b2903f54adfa3c588ef97e404d8bd9/hi_train_0/common_voice_hi_26008353.mp3', 'array': array([ 5.81611368e-26, -1.48634016e-25, -9.37040538e-26, ...,
        1.06425901e-07,  4.46416450e-08,  2.61450239e-09]), 'sampling_rate': 48000}, 'sentence': 'हमने उसका जन्मदिन मनाया।'}


### Data Preprocessing

For preprocessing audio data we require:
- a feature extractor which pre-processes the raw audio-inputs
- a tokenizer which post-processes the model outputs to text format

Hugging Face `transformers` offers a class for each respectively, however you can also use the `WhisperProcessor`, which wraps both into a single class. 


In [6]:
from transformers import WhisperFeatureExtractor
from transformers import WhisperTokenizer
from transformers import WhisperProcessor

model_name_or_path = "openai/whisper-small"

feature_extractor = WhisperFeatureExtractor.from_pretrained(model_name_or_path)
tokenizer = WhisperTokenizer.from_pretrained(model_name_or_path, language=language, task=task)
processor = WhisperProcessor.from_pretrained(model_name_or_path, language=language, task=task)

Next, we need to address a sampling rate mismatch. The Common Voice dataset typically provides audio sampled at 48 kHz, but Whisper's feature extractor expects a sampling rate of 16 kHz. To resolve this, we need to *downsample* our audio data.

In [7]:
#sample down to 16000
from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

In [8]:
print(common_voice["train"][0])

{'audio': {'path': '/home/imhof/.cache/huggingface/datasets/downloads/extracted/8fcfd9e391a57582a1ced30aaf3434aa04b2903f54adfa3c588ef97e404d8bd9/hi_train_0/common_voice_hi_26008353.mp3', 'array': array([ 3.81639165e-17,  2.42861287e-17, -1.73472348e-17, ...,
       -1.30981789e-07,  2.63096808e-07,  4.77157300e-08]), 'sampling_rate': 16000}, 'sentence': 'हमने उसका जन्मदिन मनाया।'}


Now we can create a function `prepare_dataset` to make our data ready for the model with the following steps:

1) Grab the audio data from each sample in the batch (this reloading triggers the resampling operation)
2) Use the feature extractor to compute the input features from the 1-dim audio array
3) Encode the transcriptions to label ids with the tokenizer

In [9]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

We then use the `dataset.map()` function to apply the preparation function to the whole training split.

In [10]:
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=1)

### Define a DataCollator

We now define a DataCollator class that will be responsible for batching and preprocessing our training data.

In the DataCollator we ensure that both the input features and our tokenized input_ids in our labels are of the same length. We do this by padding both of them to ensure they are equal, and then replace the padding values with -100 to ensure that these tokens are ignored when computing the loss.

In [11]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

We then initialize our DataCollator so we can apply it to our dataset during the training process.

In [12]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### Evaluation Metrics

We'll use the word error rate (WER) metric, a metric used primarily for evaluating performance on audio speech recognition models. 

You can find more information about this metric [here](https://huggingface.co/metrics/wer).

In [13]:
import evaluate

metric = evaluate.load("wer")

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

### Initialize the adapter model

Here we will setup the model using the `WhisperAdapterModel` class from `adapters`. As you can see, we only swapped out the normal `WhisperForConditionalGeneration` from Hugging Face `transformers` and gained all the parameter-efficient finetuning capabilities in one line of code.

We still can easily download the pre-trained checkpoint from the Hugging Face Hub via `from_pretained()`.


In [14]:
from adapters import WhisperAdapterModel
model_name_or_path = "openai/whisper-small"

model = WhisperAdapterModel.from_pretrained(
    model_name_or_path,
)

AdapterModels are typically instantiated without predefined heads, unlike static head Hugging Face models. This design allows for flexible addition and removal of heads as needed. However, when loading from a pre-trained checkpoint that contains head weights, the `adapters` library automatically converts these weights into a "default" head and adds it to the model. This behavior explains why the 'heads' attribute of our model is not empty.

While we could have added and trained a new head alongside the LoRA adapter, we will utilize the existing "default" head and train only the newly initialized adapter.

In [15]:
model.heads

ModuleDict(
  (default): Seq2SeqLMHead(
    (0): Linear(in_features=768, out_features=51865, bias=False)
  )
)

In [16]:
model.active_head

'default'

Now we add a new LoRA adapter, that we will fine-tune instead of the ``Whisper`` model parameters.
Additionally, we leverage the `train_adapter` function to make sure the `AdapterTrainer` knows what adapter weights need to be updated during training.

In [17]:
import adapters

name = "whisper_LoRA"

model.add_adapter(name, config="lora")
model.train_adapter(name)

print(model.adapter_summary())

Name                     Architecture         #Param      %Param  Active   Train
--------------------------------------------------------------------------------
whisper_LoRA             lora                884,736       0.366       1       1
--------------------------------------------------------------------------------
Full model                               241,734,912     100.000               0


### Model Training

We will now define the training arguments for our model. We will be using the `Seq2SeqTrainingArguments` class from `transformers` to define the training arguments. This class is similar to the `TrainingArguments` class, but it is specifically designed for sequence-to-sequence tasks like audio transcription.

The `Seq2SeqTrainingArguments` class has a special parameter called `predict_with_generate`, which, if set to `True` will enable using the `generate()` method during the evaluation loop for producing the prediction. 
Unfortunately, we are currently not able to use this parameter due to signature mismatches of the `forward()` method of the `WhisperAdapterModel`, therefore we fall back to normal training, which also works.
(This notebook is based on the current state of the `adapters` library, version 1.0.0,  and will be updated when a corresponding fix is implemented.)

In [18]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="whisper_finetuning",  # change to a directory name of your choice
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-3,
    warmup_steps=50,
    #predict_with_generate=True,  # currently not supported
    num_train_epochs=3, # edit this based on the number of epochs you would like to train
    eval_strategy="epoch",
    fp16=True,
    per_device_eval_batch_size=8,
    logging_steps=25,
)

Now we can initialize the `Seq2SeqAdapterTrainer` class from `adapters` to train our model. We pass in the model, tokenizer, data collator, training arguments, and the training and evaluation datasets. The `Seq2SeqAdapterTrainer` class is specifically designed for sequence-to-sequence tasks like audio transcription. Alternatively, you could use the standard `AdapterTrainer` class for other tasks.

And that's it! We are ready to start training our model.

In [19]:
from adapters import Seq2SeqAdapterTrainer

trainer = Seq2SeqAdapterTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    #compute_metrics=compute_metrics,  # currently not supported
    args=training_args,
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


But before we start the training, let's add an initial evaluation loop to have a baseline performance of the model on Hindi before training.
For this, because we currently cannot use the `generate()` method inside the trainer, we create a custom evaluation function that computes the WER score on the test set.

In [20]:
#initial_wer = trainer.evaluate()  # currently not supported

from torch.utils.data import DataLoader
from tqdm import tqdm
import numpy as np
import gc

def eval_loop():
    eval_dataloader = DataLoader(common_voice["test"], batch_size=16, collate_fn=data_collator)
    
    model.eval()
    for step, batch in enumerate(tqdm(eval_dataloader)):
        with torch.cuda.amp.autocast():
            with torch.no_grad():
                generated_tokens = (
                    model.generate(
                        input_features=batch["input_features"].to("cuda"),
                        decoder_input_ids=batch["labels"][:, :4].to("cuda"),
                        max_new_tokens=255,
                    )
                    .cpu()
                    .numpy()
                )
                labels = batch["labels"].cpu().numpy()
                labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
                decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
                decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
                metric.add_batch(
                    predictions=decoded_preds,
                    references=decoded_labels,
                )
        del generated_tokens, labels, batch
        gc.collect()
    wer = 100 * metric.compute()
    return wer

initial_wer = eval_loop()
print(f"The initial wer score is: {initial_wer}")


  with torch.cuda.amp.autocast():
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
100%|██████████| 362/362 [45:36<00:00,  7.56s/it] 

The initial wer score is: 92.50825361889444





Ok, now we are ready to start the training! Just run the cell below to start the training process.
We will compare the performance of the model after training with the initial wer score.

In [21]:
trainer.train()

  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss
1,0.3087,0.381261
2,0.2245,0.343649
3,0.1752,0.336221


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


TrainOutput(global_step=615, training_loss=0.3448017062210455, metrics={'train_runtime': 4402.9285, 'train_samples_per_second': 4.456, 'train_steps_per_second': 0.14, 'total_flos': 5.6870418235392e+18, 'train_loss': 0.3448017062210455, 'epoch': 3.0})

Trainig will take approximately 1-4 hours depending on your GPU and the batch size you have chosen. You might encounter a CUDA out of memory error, which means that the batch size is too large for your GPU. In this case, you can reduce the batch size in the training arguments.

Let's now evaluate the model on the test set again and compute the WER score.

In [22]:
post_training_wer = eval_loop()
print(f"The wer score after training is: {post_training_wer}")

  with torch.cuda.amp.autocast():
100%|██████████| 362/362 [29:22<00:00,  4.87s/it]

The wer score after training is: 40.22263607889613





After training, our WER score is 40.2, which is a significant improvement over the initial WER score of 92.5. 
Compared to full finetuning in the original notebook, we did not reach the same performance with a WER score of 32.1, 
BUT:
- We only utilized 0,36% of the parameters compared to full model finetuning
- We only trained for 1.5 hours compared to 8 hours in the original notebook when finetuning the full model

Which makes this a great result for parameter-efficient fine-tuning with adapters!

Feel free to experiment with the training arguments, e.g., the learning rate or the number of epochs and see if you can improve the model's performance further.
With more training time you will most likely reach a performance comparable to the original notebook.

If you want to save your model on the Hugging Face Hub you need to first sign in via the cell below.

In [23]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [24]:
model.push_adapter_to_hub(
    "whisper",
    "whisper_adapter",
    datasets_tag="mozilla-foundation/common_voice_11_0"
)

And you can always save your model locally on your computer by using the `save_adapter` function.

In [25]:
# Define the directory to save the model
save_directory = "./my_model_directory"
# Save the model
model.save_adapter(save_directory, name)

### Using AutomaticSpeechRecognitionPipeline

Now that we have successfully fine-tuned our Whisper model using LoRA, it's time to put it to the test! In this section, we'll demonstrate how to use the `AutomaticSpeechRecognitionPipeline` to transcribe audio files into text using our newly trained model.

First, let's create the speech recognition pipeline.

In [26]:
from transformers import AutomaticSpeechRecognitionPipeline

device = "cuda" if torch.cuda.is_available() else "cpu"

# Create the pipeline
pipe = AutomaticSpeechRecognitionPipeline(
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    device=device
)

Now we'll create an inference function that takes a raw audio sample and the pipeline, and returns the audio transcription.

In [27]:
import numpy as np
from scipy import signal
import torch

def transcribe_audio_sample(audio_sample, pipe, processor, language, task):
    """
    Transcribe a single unprocessed audio sample from the Common Voice dataset.
    
    :param audio_sample: A single audio sample from the Common Voice dataset
    :param pipe: Pre-initialized AutomaticSpeechRecognitionPipeline
    :param processor: Pre-initialized WhisperProcessor
    :param language: Language code for transcription
    :param task: Task for the model (e.g., "transcribe")
    :return: Transcribed text
    """
    
    # Function to resample audio using scipy
    def resample_audio(audio_array, orig_sr, target_sr):
        resampled = signal.resample(audio_array, int(len(audio_array) * target_sr / orig_sr))
        return resampled
    
    # Resample the audio
    original_sr = audio_sample['audio']['sampling_rate']
    target_sr = processor.feature_extractor.sampling_rate  # Whisper expects 16kHz
    
    if original_sr != target_sr:
        resampled_audio = resample_audio(audio_sample['audio']['array'], original_sr, target_sr)
    else:
        resampled_audio = audio_sample['audio']['array']
    
    # Get forced decoder IDs - Used to ensure that the model generates output in a specific language and for a specific task
    forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task=task)
    
    # Transcribe
    with torch.cuda.amp.autocast():
        result = pipe(
            resampled_audio,
            generate_kwargs={"forced_decoder_ids": forced_decoder_ids},
            max_new_tokens=256
        )["text"]
    
    return result

Next, we need some data to test our model. For simplicity, let's just reuse the test split of Common Voice.

In [28]:
from datasets import Audio, DatasetDict, load_dataset

# Redownload the test split 
inference_test = DatasetDict()
inference_test["test"] = load_dataset("mozilla-foundation/common_voice_11_0", language_abbr, split="test")

# Select a subset of samples
inf_set = inference_test["test"].select(range(0,32))

Finally, let's transcribe a sample and print the result!

In [29]:
transcription = transcribe_audio_sample(inf_set[2], pipe, processor, language, task)
print(f"Transcription: {transcription}")

  with torch.cuda.amp.autocast():


Transcription: वीराट कोली के लिए बेहत्रिन माउका
