# Fine Tuning Whisper using Adapters

In this tutorial, we will be demonstrating how to fine-tune a [Whisper](https://arxiv.org/abs/2212.04356) Model using the adapters framework. We will be adding [LoRA adapters](https://docs.adapterhub.ml/methods#lora) into the Whisper Model while freezing the model weights. Then we will incorporate a sequence to sequence head on top of the model so that we can do speech recognition.

For more information on the Whisper Model, please visit the huggingface model card https://huggingface.co/openai/whisper-large-v3

### Installation

Before we can get started with the model, we need to ensure the proper packages are installed. Ensure you have `accelerate`, `bitsandbytes` and `datasets` installed along with the `adapters` library and various speech recognition libraries as well.

In [2]:
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio
!pip install -qq -U adapters accelerate bitsandbytes datasets

### Datasets Configuration

In this tutorial, we will be using the mozilla-foundation/common_voice_11_0 dataset. In this cell, we set the proper cuda device to leverage the GPU and also set some of the dataset configurations. You can always change the below config to select what datasets you prefer the adapters model to train on. More infomation on the common voice dataset can be found [here](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)

In [4]:
# Select CUDA device index
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model_name_or_path = "openai/whisper-large-v2"
language = "cantonese"
language_abbr = "zh-HK"
task = "transcribe"
dataset_name = "mozilla-foundation/common_voice_11_0"

### Loading the Dataset

We load the dataset and split it into its respective train and test sets. We then remove some of the columns as they are not needed for training.

In [5]:
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "zh-HK", split="train", use_auth_token=True)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "zh-HK", split="test", use_auth_token=True)

print(common_voice)



DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 8423
    })
    test: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 5591
    })
})


In [None]:
common_voice = common_voice.remove_columns(
    ["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"]
)

print(common_voice["train"][0])

### Load Feature Extractor, Tokenizer and Processor

These modules are required and work specifically for the `Whisper` Model class

In [7]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained(model_name_or_path)

from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained(model_name_or_path, language=language, task=task)
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained(model_name_or_path, language=language, task=task)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Data Preprocessing

This section is dedicated to pre-processing the dataset we loaded into something that the model can use to train and learn from. 

The `Whisper` model expects the sample rate to be 16000 hz, while the audio in the dataset is set at 48000 hz.

In [8]:
#sample down to 16000

from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

In [None]:
print(common_voice["train"][0])

We now prepare a function `prepare_dataset` that will take in a batch of samples and process inputs and labels.

In `prepare_dataset` we:
1) Grab the audio data from each sample in the batch
2) Create a new column named `input_features` that contain the extracted features when calling the `WhisperFeatureExtractor` onto the audio data
3) Create a new column called `labels` which contain the tokenized sentences

In [10]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

We then use the built in `map' function to build our dataset using the pre-processing function before passing it into the model

In [11]:
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=1)

### Initializing the DataCollator

We now initialize a DataCollator that will be responsible for setting up a data preprocessing pipeline designed for sequence to sequence data.

In the DataCollator we ensure that both the input features and our tokenized input_ids in our labels are of the same length. We do this by padding both of them to ensure they are equal, and then replace the padding values with -100 to ensure their loss values are ignored during training

In [12]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

We then initialize our DataCollator so we can apply it to our dataset before training.

In [13]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### Evaluation Metrics

We'll use the word error rate (WER) metric, a metric used primarily for evaluating performance on audio speech recognition models. For more information, please go to the WER [docs](https://huggingface.co/metrics/wer).

In [14]:
import evaluate

metric = evaluate.load("wer")

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

In [15]:
#some temp code to get into the right directory for imports
import os
%cd ..
%cd src
os.getcwd()


d:\Documents\Github\adapters
d:\Documents\Github\adapters\src


'd:\\Documents\\Github\\adapters\\src'

### Initializing the `Whisper` adapters model

Here we will initialize the adapters `Whisper` model.

`adapters` acts like a wrapper to the `transformers` library so we can directly use the `WhisperConfig` for model specifications. From the `adapters` module we then import the `WhisperAdapterModel` and intialize it using the config we previously loaded.

In [16]:
from transformers import WhisperConfig
from adapters.models.whisper.adapter_model import WhisperAdapterModel #will need replacing
config = WhisperConfig.from_pretrained(
    "openai/whisper-small",
)
model = WhisperAdapterModel.from_pretrained(
    "openai/whisper-small",
    config=config,
)

Some weights of WhisperAdapterModel were not initialized from the model checkpoint at openai/whisper-small and are newly initialized: ['heads.default.0.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
#initialize the Lora Config to use as an adapter

import adapters
from adapters import LoRAConfig

adapters.init(model)

config = LoRAConfig(
    selfattn_lora=True, intermediate_lora=True, output_lora=True,
    attn_matrices=["q", "k", "v"],
    alpha=16, r=64, dropout=0.1
)
model.add_adapter("whisper_adapter", config=config)
model.add_seq2seq_lm_head("whisper_adapter")
model.train_adapter("whisper_adapter")

print(model.adapter_summary())

Name                     Architecture         #Param      %Param  Active   Train
--------------------------------------------------------------------------------
whispter_adapter         lora             22,413,312       9.272       1       1
--------------------------------------------------------------------------------
Full model                               241,734,912     100.000               0


In [18]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="temp",  # change to a repo name of your choice
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-3,
    warmup_steps=50,
    num_train_epochs=1, #edit this based on the number of epochs you would like to train
    evaluation_strategy="epoch",
    fp16=True,
    per_device_eval_batch_size=8,
    generation_max_length=128,
    logging_steps=25,
    remove_unused_columns=False, 
    label_names=["labels"],  # same reason as above
)

In [19]:
from adapters import AdapterTrainer

trainer = AdapterTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    args=training_args,
    # compute_metrics=compute_metrics,
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

In [20]:
trainer.train()

  0%|          | 0/1053 [00:00<?, ?it/s]

  attn_output = torch.nn.functional.scaled_dot_product_attention(


{'loss': 1.9546, 'grad_norm': 2.2400691509246826, 'learning_rate': 0.0005, 'epoch': 0.02}
{'loss': 0.627, 'grad_norm': 2.3768436908721924, 'learning_rate': 0.001, 'epoch': 0.05}
{'loss': 0.7108, 'grad_norm': 2.8131179809570312, 'learning_rate': 0.0009750747756729811, 'epoch': 0.07}
{'loss': 0.7192, 'grad_norm': 2.126743793487549, 'learning_rate': 0.0009501495513459621, 'epoch': 0.09}
{'loss': 0.7509, 'grad_norm': 3.2347044944763184, 'learning_rate': 0.000926221335992024, 'epoch': 0.12}
{'loss': 0.7545, 'grad_norm': 3.0252304077148438, 'learning_rate': 0.000901296111665005, 'epoch': 0.14}
{'loss': 0.7089, 'grad_norm': 2.927863121032715, 'learning_rate': 0.0008763708873379861, 'epoch': 0.17}
{'loss': 0.7539, 'grad_norm': 2.5152058601379395, 'learning_rate': 0.0008514456630109671, 'epoch': 0.19}
{'loss': 0.6881, 'grad_norm': 2.572906732559204, 'learning_rate': 0.000827517447657029, 'epoch': 0.21}
{'loss': 0.7325, 'grad_norm': 3.0144991874694824, 'learning_rate': 0.0008025922233300101, 'ep

  0%|          | 0/699 [00:00<?, ?it/s]

{'eval_loss': 0.34885716438293457, 'eval_runtime': 4764.7537, 'eval_samples_per_second': 1.173, 'eval_steps_per_second': 0.147, 'epoch': 1.0}
{'train_runtime': 31562.1048, 'train_samples_per_second': 0.267, 'train_steps_per_second': 0.033, 'train_loss': 0.6174209246947895, 'epoch': 1.0}


TrainOutput(global_step=1053, training_loss=0.6174209246947895, metrics={'train_runtime': 31562.1048, 'train_samples_per_second': 0.267, 'train_steps_per_second': 0.033, 'total_flos': 2.7032376545496e+18, 'train_loss': 0.6174209246947895, 'epoch': 1.0})

If you would like to save your model and or publish to huggingface, sign into the huggingface_hub via the cell below.

In [22]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

You can save your model by using the `save_adapter` function

In [27]:
# Define the directory to save the model
save_directory = "./my_model_directory"
# Save the model
model.save_adapter(save_directory, "whisper_adapter")


In [28]:
model.push_adapter_to_hub(
    "whisper",
    "whisper_adapter",
    adapterhub_tag="seq2seq/whisper",
    datasets_tag="mozilla-foundation/common_voice_11_0"
)

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model_head.bin:   0%|          | 0.00/159M [00:00<?, ?B/s]

pytorch_adapter.bin:   0%|          | 0.00/89.8M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/julian-fong/whisper/commit/daa1db299dbf330899707acad4d36b36572a48ae', commit_message='Upload model', commit_description='', oid='daa1db299dbf330899707acad4d36b36572a48ae', pr_url=None, pr_revision=None, pr_num=None)