# Fine Tuning Whisper using Adapters

In this tutorial, we will be demonstrating how to fine-tune a [Whisper](https://arxiv.org/abs/2212.04356) model using adapters. We will be adding [LoRA](https://docs.adapterhub.ml/methods#lora) to Whisper and will incorporate a sequence to sequence head on top of the model so that we can do speech recognition. This tutorial is build on this [Whisper PEFT-Lora blog post](https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb). 

For more information on the Whisper Model, please visit the [Hugging Face model card](https://huggingface.co/openai/whisper-large-v3) or see the original [OpenAI Blog Post](https://openai.com/index/whisper/).

### Installation

Before we can get started with the model, we need to ensure the proper packages are installed. Ensure you have `accelerate`, `bitsandbytes` and `datasets` installed along with the `adapters` library and various speech recognition libraries as well.

In [1]:
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio
!pip install -qq -U adapters accelerate bitsandbytes datasets

Collecting librosa
  Using cached librosa-0.10.2.post1-py3-none-any.whl.metadata (8.6 kB)
Collecting audioread>=2.1.9 (from librosa)
  Using cached audioread-3.0.1-py3-none-any.whl.metadata (8.4 kB)
Collecting numba>=0.51.0 (from librosa)
  Using cached numba-0.60.0-cp312-cp312-win_amd64.whl.metadata (2.8 kB)
Collecting soundfile>=0.12.1 (from librosa)
  Using cached soundfile-0.12.1-py2.py3-none-win_amd64.whl.metadata (14 kB)
Collecting pooch>=1.1 (from librosa)
  Using cached pooch-1.8.2-py3-none-any.whl.metadata (10 kB)
Collecting soxr>=0.3.2 (from librosa)
  Using cached soxr-0.4.0-cp312-cp312-win_amd64.whl.metadata (5.7 kB)
Collecting lazy-loader>=0.1 (from librosa)
  Using cached lazy_loader-0.4-py3-none-any.whl.metadata (7.6 kB)
Collecting msgpack>=1.0 (from librosa)
  Using cached msgpack-1.0.8-cp312-cp312-win_amd64.whl.metadata (9.4 kB)
Collecting llvmlite<0.44,>=0.43.0dev0 (from numba>=0.51.0->librosa)
  Using cached llvmlite-0.43.0-cp312-cp312-win_amd64.whl.metadata (4.9 kB)

### Datasets Configuration

In this tutorial, we will be using the mozilla-foundation/common_voice_11_0 dataset. In this cell, we set the proper cuda device to leverage the GPU and also set some of the dataset configurations. You can always change the below config to select what datasets you prefer the adapters model to train on. More infomation on the common voice dataset can be found [here](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)

In [2]:
# Select CUDA device index
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model_name_or_path = "openai/whisper-tiny"
language = "cantonese"
language_abbr = "zh-HK"
task = "transcribe"
dataset_name = "mozilla-foundation/common_voice_11_0"

### Loading the Dataset

We load the dataset and split it into its respective train and test sets. We then remove some of the columns as they are not needed for training.

In [3]:
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "zh-HK", split="train", use_auth_token=True)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "zh-HK", split="test", use_auth_token=True)

print(common_voice)

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 8423
    })
    test: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 5591
    })
})


In [4]:
common_voice = common_voice.remove_columns(
    ["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"]
)

print(common_voice["train"][0])

{'audio': {'path': 'C:\\Users\\Jackson\\.cache\\huggingface\\datasets\\downloads\\extracted\\38272d4d8a5becb490327bdb81aef3d7a11ae9499ba16e02965a27567411ad93\\zh-HK_train_0/common_voice_zh-HK_22942304.mp3', 'array': array([ 0.00000000e+00,  4.30017540e-13,  6.87821111e-13, ...,
       -1.39293297e-06, -6.22257721e-06, -1.14162267e-05]), 'sampling_rate': 48000}, 'sentence': '才能勇往直前'}


### Data Preprocessing

These modules are required for data processing and work specifically for the `Whisper` Model class.

In [5]:
from transformers import WhisperFeatureExtractor

from transformers import WhisperTokenizer

from transformers import WhisperProcessor

feature_extractor = WhisperFeatureExtractor.from_pretrained(model_name_or_path)

tokenizer = WhisperTokenizer.from_pretrained(model_name_or_path, language=language, task=task)

processor = WhisperProcessor.from_pretrained(model_name_or_path, language=language, task=task)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



This section is dedicated to pre-processing the dataset we loaded into something that the model can use to train and learn from. 

The `Whisper` model expects the sample rate to be 16000 hz, while the audio in the dataset is set at 48000 hz.

In [6]:
#sample down to 16000

from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

In [7]:
print(common_voice["train"][0])

{'audio': {'path': 'C:\\Users\\Jackson\\.cache\\huggingface\\datasets\\downloads\\extracted\\38272d4d8a5becb490327bdb81aef3d7a11ae9499ba16e02965a27567411ad93\\zh-HK_train_0/common_voice_zh-HK_22942304.mp3', 'array': array([ 5.45696821e-12,  2.72848411e-12,  3.63797881e-12, ...,
        1.48210138e-05,  9.73203896e-07, -4.09249424e-06]), 'sampling_rate': 16000}, 'sentence': '才能勇往直前'}


We now prepare a function `prepare_dataset` that will take in a batch of samples and process inputs and labels.

In `prepare_dataset` we:
1) Grab the audio data from each sample in the batch
2) Create a new column named `input_features` that contain the extracted features when calling the `WhisperFeatureExtractor` onto the audio data
3) Create a new column called `labels` which contain the tokenized sentences

In [8]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

We then use the built in `map' function to build our dataset using the pre-processing function before passing it into the model

In [9]:
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=1)

Map: 100%|██████████| 8423/8423 [04:02<00:00, 34.80 examples/s]  
Map: 100%|██████████| 5591/5591 [02:23<00:00, 39.03 examples/s]


### Initializing the DataCollator

We now define a DataCollator class that will be responsible for batching and preprocessing our speech-to-text data.

In the DataCollator we ensure that both the input features and our tokenized input_ids in our labels are of the same length. We do this by padding both of them to ensure they are equal, and then replace the padding values with -100 to ensure their loss values are ignored during training

In [10]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

We then initialize our DataCollator so we can apply it to our dataset during the training process.

In [11]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### Evaluation Metrics

We'll use the word error rate (WER) metric, a metric used primarily for evaluating performance on audio speech recognition models. For more information, please go to the WER [docs](https://huggingface.co/metrics/wer).

In [12]:
import evaluate

metric = evaluate.load("wer")

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

### Initializing the `Whisper` adapters model

Here we will setup the adapters `Whisper` model. You can see that the code used to initialize the model looks somewhat identical to code used inside `HuggingFace`. This is because `adapters` acts like a wrapper to the `HuggingFace` library, so users familiar with `HuggingFace` can easily integrate their own code while combining it with `adapters` code. The cool thing here is that we can load the `WhisperConfig` directly from `transformers` and use it for our `Whisper` model specifications. 

From the `adapters` module we then import the `WhisperAdapterModel` and intialize it using the config we previously imported.

In [13]:
from transformers import WhisperConfig
from adapters import WhisperAdapterModel

config = WhisperConfig.from_pretrained(
    model_name_or_path,
)
model = WhisperAdapterModel.from_pretrained(
    model_name_or_path,
    config=config,
)

We now add an untrained adapter named `whisper_adapter` that we will fine-tune instead of the ``Whisper`` model parameters. Afterwards we include a sequence to sequence language modeling head so we can generate tokens from the ``Whisper`` model.

Once these two components are added, we leverage the `train_adapter` function to make sure `adapters` knows what adapter weights that need to be updated during training.

In [14]:
import adapters
#initialize the Lora Config to use as an adapter
from adapters import LoRAConfig


config = LoRAConfig(
    selfattn_lora=True, intermediate_lora=True, output_lora=True,
    attn_matrices=["q", "k", "v"],
    alpha=16, r=64, dropout=0.1
)
model.add_adapter("whisper_adapter", config=config)
model.add_seq2seq_lm_head("whisper_adapter")
model.train_adapter("whisper_adapter")

print(model.adapter_summary())

Name                     Architecture         #Param      %Param  Active   Train
--------------------------------------------------------------------------------
whisper_adapter          lora              3,735,552       9.893       1       1
--------------------------------------------------------------------------------
Full model                                37,760,640     100.000               0


In [15]:
#initalize training arguments directly from HuggingFace
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="temp",  # change to a directory name of your choice
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-3,
    warmup_steps=50,
    num_train_epochs=1, #edit this based on the number of epochs you would like to train
    evaluation_strategy="epoch",
    fp16=True,
    per_device_eval_batch_size=8,
    generation_max_length=128,
    logging_steps=25,
    remove_unused_columns=False, 
    label_names=["labels"],  # same reason as above
)



In [16]:
subset1 = common_voice["train"].select(range(500))
subset2 = common_voice["test"].select(range(100))

We will import the `adapters` sequence to sequence adapter trainer. This works very similarly to the `Trainer` module inside `Huggingface`, saving you the trouble from needing to write any extra code.

In [17]:
from adapters import Seq2SeqAdapterTrainer

trainer = Seq2SeqAdapterTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=subset1,
    eval_dataset=subset2,
    args=training_args,
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [18]:
trainer.train()

  attn_output = torch.nn.functional.scaled_dot_product_attention(
 40%|███▉      | 25/63 [00:57<01:12,  1.90s/it]

{'loss': 3.2212, 'grad_norm': 3.6394524574279785, 'learning_rate': 0.00044, 'epoch': 0.4}


 79%|███████▉  | 50/63 [01:45<00:25,  1.92s/it]

{'loss': 1.1574, 'grad_norm': 3.3313405513763428, 'learning_rate': 0.00094, 'epoch': 0.79}


                                               
100%|██████████| 63/63 [02:30<00:00,  2.38s/it]

{'eval_loss': 1.3545207977294922, 'eval_runtime': 18.6904, 'eval_samples_per_second': 5.35, 'eval_steps_per_second': 0.696, 'epoch': 1.0}
{'train_runtime': 150.2277, 'train_samples_per_second': 3.328, 'train_steps_per_second': 0.419, 'train_loss': 1.980199102371458, 'epoch': 1.0}





TrainOutput(global_step=63, training_loss=1.980199102371458, metrics={'train_runtime': 150.2277, 'train_samples_per_second': 3.328, 'train_steps_per_second': 0.419, 'total_flos': 1.499904e+16, 'train_loss': 1.980199102371458, 'epoch': 1.0})

### Model Inference

We can use the below cell to see how well our fine-tuned `Whisper` model performs. We use the WER metric we initialized earlier to mark the performance of the model. To learn more about WER, you can visit this [link](https://huggingface.co/spaces/evaluate-metric/wer)

In [19]:
#model inference

from torch.utils.data import DataLoader
from tqdm import tqdm
import numpy as np
import gc

eval_dataloader = DataLoader(subset2, batch_size=8, collate_fn=data_collator)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.eval()
for step, batch in enumerate(tqdm(eval_dataloader)):
    with torch.cuda.amp.autocast():
        with torch.no_grad():
            generated_tokens = (
                model.generate(
                    input_features=batch["input_features"].to(device),
                    decoder_input_ids=batch["labels"][:, :4].to(device),
                    max_new_tokens=255,
                )
                .cpu()
                .numpy()
            )
            labels = batch["labels"].cpu().numpy()
            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
            metric.add_batch(
                predictions=decoded_preds,
                references=decoded_labels,
            )
    del generated_tokens, labels, batch
    gc.collect()
wer = 100 * metric.compute()
print(f"{wer=}")

  with torch.cuda.amp.autocast():
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
100%|██████████| 13/13 [01:04<00:00,  4.93s/it]


wer=98.0


If you would like to save your model and or publish to huggingface, sign into the huggingface_hub via the cell below.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

You can always save your model by using the `save_adapter` function locally onto your computer or you can directly upload them to the ``HuggingFace`` hub. Make sure you include the parameter `adapterhub_tag`.

In [None]:
# Define the directory to save the model
save_directory = "./my_model_directory"
# Save the model
model.save_adapter(save_directory, "whisper_adapter")

In [None]:
model.push_adapter_to_hub(
    "whisper",
    "whisper_adapter",
    datasets_tag="mozilla-foundation/common_voice_11_0"
)